community[minor]: [Pebblo] Enhance PebbloSafeLoader to take anonymize flag (#26812)

- **Description:** The flag is named `anonymize_snippets`. When set to
true, the Pebblo server will anonymize snippets by redacting all
personally identifiable information (PII) from the snippets going into
VectorDB and the generated reports
- **Issue:** NA
- **Dependencies:** NA
- **docs**: Updated
This commit is contained in:
Rajendra Kadam
2024-09-25 19:03:06 +05:30
committed by GitHub
parent 92003b3724
commit 7e5a9c317f
3 changed files with 42 additions and 0 deletions

View File

@@ -124,6 +124,39 @@
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Anonymize the snippets to redact all PII details\n",
"\n",
"Set `anonymize_snippets` to `True` to anonymize all personally identifiable information (PII) from the snippets going into VectorDB and the generated reports.\n",
"\n",
"> Note: The _Pebblo Entity Classifier_ effectively identifies personally identifiable information (PII) and is continuously evolving. While its recall is not yet 100%, it is steadily improving.\n",
"> For more details, please refer to the [_Pebblo Entity Classifier docs_](https://daxa-ai.github.io/pebblo/entityclassifier/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import CSVLoader, PebbloSafeLoader\n",
"\n",
"loader = PebbloSafeLoader(\n",
" CSVLoader(\"data/corp_sens_data.csv\"),\n",
" name=\"acme-corp-rag-1\", # App name (Mandatory)\n",
" owner=\"Joe Smith\", # Owner (Optional)\n",
" description=\"Support productivity RAG application\", # Description (Optional)\n",
" anonymize_snippets=True, # Whether to anonymize entities in the PDF Report (Optional, default=False)\n",
")\n",
"documents = loader.load()\n",
"print(documents[0].metadata)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],