mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-07 05:52:15 +00:00
docs: misc retrievers fixes (#9791)
Various miscellaneous fixes to most pages in the 'Retrievers' section of the documentation: - "VectorStore" and "vectorstore" changed to "vector store" for consistency - Various spelling, grammar, and formatting improvements for readability Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
The simplest loader reads in a file as text and places it all into one Document.
|
||||
The simplest loader reads in a file as text and places it all into one document.
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
|
@@ -19,7 +19,7 @@ print(data)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Customizing the csv parsing and loading
|
||||
## Customizing the CSV parsing and loading
|
||||
|
||||
See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.
|
||||
|
||||
|
@@ -1,4 +1,4 @@
|
||||
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
|
||||
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DirectoryLoader
|
||||
@@ -121,7 +121,7 @@ len(docs)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Auto detect file encodings with TextLoader
|
||||
## Auto-detect file encodings with TextLoader
|
||||
|
||||
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
|
||||
|
||||
@@ -212,7 +212,7 @@ loader.load()
|
||||
|
||||
</HTMLOutputBlock>
|
||||
|
||||
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
|
||||
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
|
||||
|
||||
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.
|
||||
|
||||
|
@@ -139,9 +139,9 @@ data[0]
|
||||
|
||||
### Fetching remote PDFs using Unstructured
|
||||
|
||||
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
|
||||
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
|
||||
|
||||
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
|
||||
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
|
||||
|
||||
|
||||
|
||||
@@ -208,7 +208,7 @@ data = loader.load()
|
||||
|
||||
### Using PDFMiner to generate HTML text
|
||||
|
||||
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
|
||||
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
|
||||
|
||||
|
||||
```python
|
||||
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
|
||||
|
||||
|
||||
```python
|
||||
data = loader.load()[0] # entire pdf is loaded as a single Document
|
||||
data = loader.load()[0] # entire PDF is loaded as a single Document
|
||||
```
|
||||
|
||||
|
||||
@@ -259,7 +259,7 @@ for c in content:
|
||||
cur_text = c.text
|
||||
snippets.append((cur_text,cur_fs))
|
||||
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
|
||||
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
|
||||
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
|
||||
```
|
||||
|
||||
|
||||
@@ -285,7 +285,7 @@ for s in snippets:
|
||||
continue
|
||||
|
||||
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
|
||||
# section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
|
||||
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
|
||||
metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
|
||||
metadata.update(data.metadata)
|
||||
semantic_snippets.append(Document(page_content='',metadata=metadata))
|
||||
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
## Using pdfplumber
|
||||
## Using PDFPlumber
|
||||
|
||||
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.
|
||||
|
||||
|
@@ -50,7 +50,7 @@ RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
|
||||
|
||||
## Python
|
||||
|
||||
Here's an example using the PythonTextSplitter
|
||||
Here's an example using the PythonTextSplitter:
|
||||
|
||||
|
||||
```python
|
||||
@@ -78,7 +78,7 @@ python_docs
|
||||
</CodeOutputBlock>
|
||||
|
||||
## JS
|
||||
Here's an example using the JS text splitter
|
||||
Here's an example using the JS text splitter:
|
||||
|
||||
|
||||
```python
|
||||
@@ -109,7 +109,7 @@ js_docs
|
||||
|
||||
## Markdown
|
||||
|
||||
Here's an example using the Markdown text splitter.
|
||||
Here's an example using the Markdown text splitter:
|
||||
|
||||
|
||||
````python
|
||||
@@ -155,7 +155,7 @@ md_docs
|
||||
|
||||
## Latex
|
||||
|
||||
Here's an example on Latex text
|
||||
Here's an example on Latex text:
|
||||
|
||||
|
||||
```python
|
||||
@@ -219,7 +219,7 @@ latex_docs
|
||||
|
||||
## HTML
|
||||
|
||||
Here's an example using an HTML text splitter
|
||||
Here's an example using an HTML text splitter:
|
||||
|
||||
|
||||
```python
|
||||
@@ -281,7 +281,7 @@ html_docs
|
||||
|
||||
|
||||
## Solidity
|
||||
Here's an example using the Solidity text splitter
|
||||
Here's an example using the Solidity text splitter:
|
||||
|
||||
```python
|
||||
SOL_CODE = """
|
||||
|
@@ -36,27 +36,27 @@ class BaseRetriever(ABC):
|
||||
It's that simple! You can call `get_relevant_documents` or the async `aget_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
|
||||
the specific retriever object you are calling.
|
||||
|
||||
Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
|
||||
Of course, we also help construct what we think useful retrievers are. The main type of retriever that we focus on is a vector store retriever. We will focus on that for the rest of this guide.
|
||||
|
||||
In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
|
||||
In order to understand what a vector store retriever is, it's important to understand what a vector store is. So let's look at that.
|
||||
|
||||
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vectorstore to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
|
||||
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vector store to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
|
||||
|
||||
```
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
This example showcases question answering over documents.
|
||||
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vectorstores) and then also shows how to use them in a chain.
|
||||
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vector stores) and then also shows how to use them in a chain.
|
||||
|
||||
Question answering over documents consists of four steps:
|
||||
|
||||
1. Create an index
|
||||
2. Create a Retriever from that index
|
||||
2. Create a retriever from that index
|
||||
3. Create a question answering chain
|
||||
4. Ask questions!
|
||||
|
||||
Each of the steps has multiple sub steps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
|
||||
Each of the steps has multiple substeps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
|
||||
|
||||
First, let's import some common classes we'll use no matter what.
|
||||
|
||||
@@ -66,7 +66,7 @@ from langchain.chains import RetrievalQA
|
||||
from langchain.llms import OpenAI
|
||||
```
|
||||
|
||||
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
|
||||
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt).
|
||||
|
||||
|
||||
```python
|
||||
@@ -129,7 +129,7 @@ index.query_with_sources(query)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
|
||||
What is returned from the `VectorstoreIndexCreator` is a `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionalities. If we just want to access the vector store directly, we can also do that.
|
||||
|
||||
|
||||
```python
|
||||
@@ -144,7 +144,7 @@ index.vectorstore
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
If we then want to access the VectorstoreRetriever, we can do that with:
|
||||
If we then want to access the `VectorStoreRetriever`, we can do that with:
|
||||
|
||||
|
||||
```python
|
||||
@@ -159,7 +159,7 @@ index.vectorstore.as_retriever()
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
It can also be convenient to filter the vectorstore by the metadata associated with documents, particularly when your vectorstore has multiple sources. This can be done using the `query` method like so:
|
||||
It can also be convenient to filter the vector store by the metadata associated with documents, particularly when your vector store has multiple sources. This can be done using the `query` method like so:
|
||||
|
||||
|
||||
```python
|
||||
@@ -185,7 +185,7 @@ There are three main steps going on after the documents are loaded:
|
||||
|
||||
1. Splitting documents into chunks
|
||||
2. Creating embeddings for each document
|
||||
3. Storing documents and embeddings in a vectorstore
|
||||
3. Storing documents and embeddings in a vector store
|
||||
|
||||
Let's walk through this in code
|
||||
|
||||
@@ -211,7 +211,7 @@ from langchain.embeddings import OpenAIEmbeddings
|
||||
embeddings = OpenAIEmbeddings()
|
||||
```
|
||||
|
||||
We now create the vectorstore to use as the index.
|
||||
We now create the vector store to use as the index.
|
||||
|
||||
|
||||
```python
|
||||
|
@@ -9,9 +9,9 @@ from langchain.schema import Document
|
||||
from langchain.vectorstores import FAISS
|
||||
```
|
||||
|
||||
## Low Decay Rate
|
||||
## Low decay rate
|
||||
|
||||
A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
|
||||
A low `decay rate` (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
|
||||
|
||||
|
||||
```python
|
||||
@@ -53,7 +53,7 @@ retriever.get_relevant_documents("hello world")
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## High Decay Rate
|
||||
## High decay rate
|
||||
|
||||
With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.
|
||||
|
||||
@@ -98,9 +98,9 @@ retriever.get_relevant_documents("hello world")
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Virtual Time
|
||||
## Virtual time
|
||||
|
||||
Using some utils in LangChain, you can mock out the time component
|
||||
Using some utils in LangChain, you can mock out the time component.
|
||||
|
||||
|
||||
```python
|
||||
|
@@ -34,8 +34,8 @@ retriever = db.as_retriever()
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
## Maximum Marginal Relevance Retrieval
|
||||
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
|
||||
## Maximum marginal relevance retrieval
|
||||
By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.
|
||||
|
||||
|
||||
```python
|
||||
@@ -47,9 +47,9 @@ retriever = db.as_retriever(search_type="mmr")
|
||||
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
|
||||
```
|
||||
|
||||
## Similarity Score Threshold Retrieval
|
||||
## Similarity score threshold retrieval
|
||||
|
||||
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold
|
||||
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.
|
||||
|
||||
|
||||
```python
|
||||
|
@@ -1,11 +1,11 @@
|
||||
## Get started
|
||||
We'll use a Pinecone vector store in this example.
|
||||
|
||||
First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
|
||||
First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
|
||||
|
||||
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
|
||||
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
|
||||
|
||||
NOTE: The self-query retriever requires you to have `lark` package installed.
|
||||
**Note:** The self-query retriever requires you to have `lark` package installed.
|
||||
|
||||
|
||||
```python
|
||||
|
@@ -20,7 +20,7 @@ from langchain.embeddings import OpenAIEmbeddings
|
||||
embeddings_model = OpenAIEmbeddings(openai_api_key="...")
|
||||
```
|
||||
|
||||
otherwise you can initialize without any params:
|
||||
Otherwise you can initialize without any params:
|
||||
```python
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
|
@@ -1,4 +1,4 @@
|
||||
Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
|
||||
LangChain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
|
||||
|
||||
`Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.
|
||||
|
||||
@@ -47,7 +47,7 @@ docs = await db.asimilarity_search_by_vector(embedding_vector)
|
||||
|
||||
## Maximum marginal relevance search (MMR)
|
||||
|
||||
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It is also supported in async API.
|
||||
Maximal marginal relevance optimizes for similarity to query **and** diversity among selected documents. It is also supported in async API.
|
||||
|
||||
```python
|
||||
query = "What did the president say about Ketanji Brown Jackson"
|
||||
|
Reference in New Issue
Block a user