mirror of https://github.com/hwchase17/langchain.git synced 2025-09-28 15:00:23 +00:00

Files

maks-operlejn-ds 4cc4534d81 Data deanonymization (#10093 )

### Description

The feature for pseudonymizing data with ability to retrieve original
text (deanonymization) has been implemented. In order to protect private
data, such as when querying external APIs (OpenAI), it is worth
pseudonymizing sensitive data to maintain full privacy. But then, after
the model response, it would be good to have the data in the original
form.

I implemented the `PresidioReversibleAnonymizer`, which consists of two
parts:

1. anonymization - it works the same way as `PresidioAnonymizer`, plus
the object itself stores a mapping of made-up values to original ones,
for example:
```
    {
        "PERSON": {
            "<anonymized>": "<original>",
            "John Doe": "Slim Shady"
        },
        "PHONE_NUMBER": {
            "111-111-1111": "555-555-5555"
        }
        ...
    }
```

2. deanonymization - using the mapping described above, it matches fake
data with original data and then substitutes it.

Between anonymization and deanonymization user can perform different
operations, for example, passing the output to LLM.

### Future works

- **instance anonymization** - at this point, each occurrence of PII is
treated as a separate entity and separately anonymized. Therefore, two
occurrences of the name John Doe in the text will be changed to two
different names. It is therefore worth introducing support for full
instance detection, so that repeated occurrences are treated as a single
object.
- **better matching and substitution of fake values for real ones** -
currently the strategy is based on matching full strings and then
substituting them. Due to the indeterminism of language models, it may
happen that the value in the answer is slightly changed (e.g. *John Doe*
-> *John* or *Main St, New York* -> *New York*) and such a substitution
is then no longer possible. Therefore, it is worth adjusting the
matching for your needs.
- **Q&A with anonymization** - when I'm done writing all the
functionality, I thought it would be a cool resource in documentation to
write a notebook about retrieval from documents using anonymization. An
iterative process, adding new recognizers to fit the data, lessons
learned and what to look out for

### Twitter handle
@deepsense_ai / @MaksOpp

---------

Co-authored-by: MaksOpp <maks.operlejn@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>

2023-09-06 21:33:24 -07:00

langchain_experimental

Data deanonymization (#10093 )

2023-09-06 21:33:24 -07:00

tests

Data deanonymization (#10093 )

2023-09-06 21:33:24 -07:00

Makefile

Add data anonymizer (#9863 )

2023-08-30 10:39:44 -07:00

poetry.lock

Resolve: VectorSearch enabled SQLChain? (#10177 )

2023-09-06 17:08:12 -07:00

poetry.toml

…

pyproject.toml

Diffbot Graph Transformer / Neo4j Graph document ingestion (#9979 )

2023-09-06 13:32:59 -07:00

README.md

Add notice about security-sensitive experimental code to experimental README. (#9936 )

2023-08-29 14:21:30 -04:00

README.md

🦜️🧪 LangChain Experimental

This package holds experimental LangChain code, intended for research and experimental uses.

Warning

Portions of the code in this package may be dangerous if not properly deployed in a sandboxed environment. Please be wary of deploying experimental code to production unless you've taken appropriate precautions and have already discussed it with your security team.

Some of the code here may be marked with security notices. However, given the exploratory and experimental nature of the code in this package, the lack of a security notice on a piece of code does not mean that the code in question does not require additional security considerations in order to be safe to use.

README.md Unescape Escape

🦜️🧪 LangChain Experimental

README.md