Add descriptions to example notebook

2026-02-21 14:43:07 +00:00 · 2023-08-28 08:56:46 +00:00
parent fd964e6f05
commit db404ca7c6
1 changed files with 145 additions and 83 deletions
--- a/docs/extras/use_cases/data_anonymization.ipynb
+++ b/docs/extras/use_cases/data_anonymization.ipynb
@@ -28,81 +28,57 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
-    "%load_ext autoreload\n",
-    "%autoreload 2"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "True"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# Import necessary packages\n",
+    "# Install necessary packages\n",
    "# ! pip install langchain langchain-experimental openai\n",
-    "# ! python -m spacy download en_core_web_lg\n",
-    "\n",
-    "# Set env var OPENAI_API_KEY or load from a .env file:\n",
-    "import dotenv\n",
-    "\n",
-    "dotenv.load_dotenv()"
+    "# ! python -m spacy download en_core_web_lg"
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": 3,
+   "cell_type": "markdown",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "from langchain_experimental.data_anonymizer import PresidioAnonymizer"
+    "\\\n",
+    "Let's see how PII anonymization works using a sample sentence:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "anonymizer = PresidioAnonymizer(analyzed_fields=[\"PERSON\"])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'My name is Brenda Nelson, call me at 313-666-7440 or email me at real.slim.shady@gmail.com'"
+       "'My name is Katherine Hancock, call me at 313-666-7440 or email me at real.slim.shady@gmail.com'"
      ]
     },
-     "execution_count": 5,
+     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
+    "from langchain_experimental.data_anonymizer import PresidioAnonymizer\n",
+    "\n",
+    "anonymizer = PresidioAnonymizer(analyzed_fields=[\"PERSON\"])\n",
+    "\n",
    "anonymizer.anonymize(\n",
    "    \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com\"\n",
    ")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "As can be observed, the name was correctly identified and replaced with another. The `analyzed_fields` attribute is responsible for what values are to be detected and substituted. We can add *PHONE_NUMBER* to the list:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 6,
@@ -126,18 +102,30 @@
    ")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "If no analyzed_fields are specified, by default the anonymizer will detect all supported formats. Below is the full list of them:\n",
+    "\n",
+    "`['PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER', 'IBAN_CODE', 'CREDIT_CARD', 'CRYPTO', 'IP_ADDRESS', 'LOCATION', 'DATE_TIME', 'NRP', 'MEDICAL_LICENSE', 'URL', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT', 'US_SSN']`\n",
+    "\n",
+    "**Disclaimer:** We suggest carefully defining the private data to be detected - Presidio doesn't work perfectly and it sometimes makes mistakes, so it's better to have more control over the data."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'My name is Joseph Vang, call me at (986)925-6310 or email me at harmonashley@example.net'"
+       "'My name is Martin Quinn, call me at 3538218419 or email me at padillasteven@example.org'"
      ]
     },
-     "execution_count": 7,
+     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -149,18 +137,26 @@
    ")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "It may be that the above list of detected fields is not sufficient. For example, the already available *PHONE_NUMBER* field does not support polish phone numbers and confuses it with another field:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'My polish phone number is LCBN14037514713276'"
+       "'My polish phone number is JXRR35989266946179'"
      ]
     },
-     "execution_count": 8,
+     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -170,9 +166,17 @@
    "anonymizer.anonymize(\"My polish phone number is 666555444\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "You can then write your own recognizers and add them to the pool of those present. How exactly to create recognizers is described in the [Presidio documentation](https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/)."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -192,18 +196,34 @@
    ")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "Now, we can add recognizer by calling `add_recognizer` method on the anonymizer:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "anonymizer.add_recognizer(polish_phone_numbers_recognizer)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "And voilà! With the added pattern-based recognizer, the anonymizer now handles polish phone numbers."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
@@ -222,18 +242,26 @@
    "print(anonymizer.anonymize(\"My polish phone number is +48 666 555 444\"))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "The problem is - even though we recognize polish phone numbers now, we don't have a method (operator) that would tell how to substitute a given field - because of this, in the outpit we only provide string `<POLISH_PHONE_NUMBER>` We need to create a method to replace it correctly: "
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'+48 32 615 90 45'"
+       "'+48 32 290 20 48'"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -251,9 +279,17 @@
    "fake_polish_phone_number()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "We used Faker to create pseudo data. Now we can create an operator and add it to the anonymizer. For complete information about operators and their creation, see the Presidio documentation for [simple](https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/) and [custom](https://microsoft.github.io/presidio/tutorial/11_custom_anonymization/) anonymization."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -268,7 +304,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -277,16 +313,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'My polish phone number is 882 897 705'"
+       "'My polish phone number is +48 537 219 801'"
      ]
     },
-     "execution_count": 15,
+     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -295,19 +331,27 @@
    "anonymizer.anonymize(\"My polish phone number is 666555444\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "Finally, it is worth showing how to implement anonymizer as a chain. Since anonymization is based on string operations, we can use `TransformChain` for this:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'text': 'You can find our super secret data at https://supersecretdata.com',\n",
-       " 'output_text': 'You can find our super secret data at http://grant.com/'}"
+       " 'output_text': 'You can find our super secret data at http://www.shea-fernandez.com/'}"
      ]
     },
-     "execution_count": 16,
+     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -330,19 +374,50 @@
    "anonymize_chain(\"You can find our super secret data at https://supersecretdata.com\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\\\n",
+    "Later, you can, for example, use such anonymization as part of `SimpleSequentialChain`, as shown below:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Set env var OPENAI_API_KEY or load from a .env file:\n",
+    "import dotenv\n",
+    "\n",
+    "dotenv.load_dotenv()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'input': 'You can find our super secret data at https://supersecretdata.com',\n",
-       " 'output': ' https://www.brown-hunter.info/'}"
+       " 'output': '\\nhttp://www.porter-hart.net/'}"
      ]
     },
-     "execution_count": 17,
+     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -365,18 +440,6 @@
    "sequential_chain = SimpleSequentialChain(chains=[anonymize_chain, llm_chain])\n",
    "sequential_chain(\"You can find our super secret data at https://supersecretdata.com\")"
   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Future Works"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": []
  }
 ],
 "metadata": {
@@ -396,9 +459,8 @@
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
-  },
-  "orig_nbformat": 4
+  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }