Building applications with LLMs through composability
Go to file
maks-operlejn-ds 4cc4534d81
Data deanonymization (#10093)
### Description

The feature for pseudonymizing data with ability to retrieve original
text (deanonymization) has been implemented. In order to protect private
data, such as when querying external APIs (OpenAI), it is worth
pseudonymizing sensitive data to maintain full privacy. But then, after
the model response, it would be good to have the data in the original
form.

I implemented the `PresidioReversibleAnonymizer`, which consists of two
parts:

1. anonymization - it works the same way as `PresidioAnonymizer`, plus
the object itself stores a mapping of made-up values to original ones,
for example:
```
    {
        "PERSON": {
            "<anonymized>": "<original>",
            "John Doe": "Slim Shady"
        },
        "PHONE_NUMBER": {
            "111-111-1111": "555-555-5555"
        }
        ...
    }
```

2. deanonymization - using the mapping described above, it matches fake
data with original data and then substitutes it.

Between anonymization and deanonymization user can perform different
operations, for example, passing the output to LLM.

### Future works

- **instance anonymization** - at this point, each occurrence of PII is
treated as a separate entity and separately anonymized. Therefore, two
occurrences of the name John Doe in the text will be changed to two
different names. It is therefore worth introducing support for full
instance detection, so that repeated occurrences are treated as a single
object.
- **better matching and substitution of fake values for real ones** -
currently the strategy is based on matching full strings and then
substituting them. Due to the indeterminism of language models, it may
happen that the value in the answer is slightly changed (e.g. *John Doe*
-> *John* or *Main St, New York* -> *New York*) and such a substitution
is then no longer possible. Therefore, it is worth adjusting the
matching for your needs.
- **Q&A with anonymization** - when I'm done writing all the
functionality, I thought it would be a cool resource in documentation to
write a notebook about retrieval from documents using anonymization. An
iterative process, adding new recognizers to fit the data, lessons
learned and what to look out for

### Twitter handle
@deepsense_ai / @MaksOpp

---------

Co-authored-by: MaksOpp <maks.operlejn@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
2023-09-06 21:33:24 -07:00
.devcontainer Devcontainer README -> Clarification. (#8414) 2023-07-28 15:09:42 -07:00
.github Install dev, lint, test, typing extra deps for linting steps. (#10249) 2023-09-06 11:15:28 -04:00
docs Data deanonymization (#10093) 2023-09-06 21:33:24 -07:00
libs Data deanonymization (#10093) 2023-09-06 21:33:24 -07:00
.gitattributes Update dev container (#6189) 2023-06-16 15:42:14 -07:00
.gitignore add experimental ref (#8435) 2023-07-28 14:26:47 -07:00
.gitmodules Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
.readthedocs.yaml use top nav docs (#8090) 2023-07-21 13:52:03 -07:00
CITATION.cff bump version to 0069 (#710) 2023-01-24 00:24:54 -08:00
LICENSE add license (#50) 2022-11-01 21:12:02 -07:00
Makefile fix makefile help (#8723) 2023-08-04 15:37:00 -04:00
MIGRATE.md cr 2023-07-28 17:47:00 -07:00
poetry.lock poetry lock the top-level environment. (#9477) 2023-08-22 14:09:11 -04:00
poetry.toml Unbreak devcontainer (#8154) 2023-07-23 19:33:47 -07:00
pyproject.toml Pinecone upsert parallelization (#9859) 2023-09-03 15:37:41 -07:00
README.md docs(readme): fixed badges with new github url (#9493) 2023-08-19 14:51:38 -07:00
SECURITY.md Update SECURITY.md email address. (#9558) 2023-08-21 14:52:21 -04:00

🦜🔗 LangChain

Building applications with LLMs through composability

Release Notes CI Experimental CI Downloads License: MIT Twitter Open in Dev Containers Open in GitHub Codespaces GitHub star chart Dependency Status Open Issues

Looking for the JS/TS version? Check out LangChain.js.

Production Support: As you move your LangChains into production, we'd love to offer more hands-on support. Fill out this form to share more about what you're building, and our team will get in touch.

🚨Breaking Changes for select chains (SQLDatabase) on 7/28/23

In an effort to make langchain leaner and safer, we are moving select chains to langchain_experimental. This migration has already started, but we are remaining backwards compatible until 7/28. On that date, we will remove functionality from langchain. Read more about the motivation and the progress here. Read how to migrate your code here.

Quick Install

pip install langchain or pip install langsmith && conda install langchain -c conda-forge

🤔 What is this?

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.

This library aims to assist in the development of those types of applications. Common examples of these applications include:

Question Answering over specific documents

💬 Chatbots

🤖 Agents

📖 Documentation

Please see here for full documentation on:

  • Getting started (installation, setting up the environment, simple examples)
  • How-To examples (demos, integrations, helper functions)
  • Reference (full API docs)
  • Resources (high-level explanation of core concepts)

🚀 What can this help with?

There are six main areas that LangChain is designed to help with. These are, in increasing order of complexity:

📃 LLMs and Prompts:

This includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.

🔗 Chains:

Chains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

📚 Data Augmented Generation:

Data Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.

🤖 Agents:

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.

🧠 Memory:

Memory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

🧐 Evaluation:

[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.

For more information on these concepts, please see our full documentation.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.