This commit is contained in:
Bagatur
2023-09-08 11:26:07 -07:00
parent 5d8a689d5e
commit 9b1d012d73

View File

@@ -4,58 +4,31 @@ sidebar_position: 1
# Retrieval
Many LLM applications require user-specific data that is not part of the model's training set.
The primary way of accomplishing this is through Retrieval Augmented Generation (RAG).
In this process, external data is *retrieved* and then passed to the LLM when doing the *generation* step.
LangChain provides all the building blocks for RAG applications - from simple to complex.
This section of the documentation covers everything related to the *retrieval* step - e.g. the fetching of the data.
Although this sounds simple, it can be subtly complex.
This encompasses several key modules.
Retrieval Augmented Generation (RAG) is a critical aspect of many LLM applications that require user-specific data not included in the model's training set. LangChain offers a comprehensive range of building blocks to support RAG applications, ranging from simple to complex. This section of the documentation focuses on the retrieval step, which involves fetching the necessary data. While seemingly straightforward, this step can be quite intricate and involves several essential modules.
![data_connection_diagram](/img/data_connection.jpg)
**[Document loaders](/docs/modules/data_connection/document_loaders/)**
Load documents from many different sources.
LangChain provides over 100 different document loaders as well as integrations with other major providers in the space,
like AirByte and Unstructured.
We provide integrations to load all types of documents (HTML, PDF, code) from all types of locations (private s3 buckets, public websites).
Document loaders play a vital role in fetching documents from various sources. LangChain provides over 100 document loaders and integrations with leading providers such as AirByte and Unstructured. These loaders can handle different document types (HTML, PDF, code) from various locations (private s3 buckets, public websites).
**[Document transformers](/docs/modules/data_connection/document_transformers/)**
A key part of retrieval is fetching only the relevant parts of documents.
This involves several transformation steps in order to best prepare the documents for retrieval.
One of the primary ones here is splitting (or chunking) a large document into smaller chunks.
LangChain provides several different algorithms for doing this, as well as logic optimized for specific document types (code, markdown, etc).
Document transformers are crucial for retrieving relevant information from documents. This process involves multiple steps to prepare the documents for retrieval. One crucial step is splitting or chunking large documents into smaller ones. LangChain offers different algorithms for this task, tailored to specific document types like code and markdown.
**[Text embedding models](/docs/modules/data_connection/text_embedding/)**
Another key part of retrieval has become creating embeddings for documents.
Embeddings capture the semantic meaning of the text, allowing you to quickly and
efficiently find other pieces of text that are similar.
LangChain provides integrations with over 25 different embedding providers and methods,
from open-source to proprietary API,
allowing you to choose the one best suited for your needs.
LangChain provides a standard interface, allowing you to easily swap between models.
Text embedding models are another key component of retrieval, as they create embeddings that capture the semantic meaning of text. These embeddings enable efficient searching for similar pieces of text. LangChain integrates with more than 25 different embedding providers and methods, allowing users to choose the most suitable one. The platform also provides a standard interface for seamless model swapping.
**[Vector stores](/docs/modules/data_connection/vectorstores/)**
With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings.
LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones,
allowing you to choose the one best suited for your needs.
LangChain exposes a standard interface, allowing you to easily swap between vector stores.
Vector stores are databases designed to store and search embeddings effectively. With the growing use of embeddings, LangChain integrates with over 50 vector stores, ranging from open-source local options to cloud-hosted proprietary solutions. The platform offers a standard interface that simplifies the process of switching between vector stores.
**[Retrievers](/docs/modules/data_connection/retrievers/)**
Once the data is in the database, you still need to retrieve it.
LangChain supports many different retrieval algorithms and is one of the places where we add the most value.
We support basic methods that are easy to get started - namely simple semantic search.
However, we have also added a collection of algorithms on top of this to increase performance.
These include:
Retriever algorithms are responsible for retrieving the data stored in the database. LangChain supports various retrieval algorithms, offering both basic methods like simple semantic search and advanced algorithms for improved performance. These include:
- [Parent Document Retriever](/docs/modules/data_connection/retrievers/parent_document_retriever): This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the *semantic* part of a query from other *metadata filters* present in the query.
- [Ensemble Retriever](/docs/modules/data_connection/retrievers/ensemble): Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
- And more!
- And more!