docs: misc retrievers fixes (#9791)

Various miscellaneous fixes to most pages in the 'Retrievers' section of
the documentation:
- "VectorStore" and "vectorstore" changed to "vector store" for
consistency
- Various spelling, grammar, and formatting improvements for readability

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
seamusp
2023-09-03 20:26:49 -07:00
committed by GitHub
parent 8bc452a466
commit 16945c9922
39 changed files with 148 additions and 163 deletions

View File

@@ -1,4 +1,4 @@
The simplest loader reads in a file as text and places it all into one Document.
The simplest loader reads in a file as text and places it all into one document.
```python
from langchain.document_loaders import TextLoader

View File

@@ -19,7 +19,7 @@ print(data)
</CodeOutputBlock>
## Customizing the csv parsing and loading
## Customizing the CSV parsing and loading
See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

View File

@@ -1,4 +1,4 @@
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
```python
from langchain.document_loaders import DirectoryLoader
@@ -121,7 +121,7 @@ len(docs)
</CodeOutputBlock>
## Auto detect file encodings with TextLoader
## Auto-detect file encodings with TextLoader
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
@@ -212,7 +212,7 @@ loader.load()
</HTMLOutputBlock>
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.

View File

@@ -139,9 +139,9 @@ data[0]
### Fetching remote PDFs using Unstructured
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
@@ -208,7 +208,7 @@ data = loader.load()
### Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
```python
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
```python
data = loader.load()[0] # entire pdf is loaded as a single Document
data = loader.load()[0] # entire PDF is loaded as a single Document
```
@@ -259,7 +259,7 @@ for c in content:
cur_text = c.text
snippets.append((cur_text,cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
```
@@ -285,7 +285,7 @@ for s in snippets:
continue
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
# section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
metadata.update(data.metadata)
semantic_snippets.append(Document(page_content='',metadata=metadata))
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
docs = loader.load()
```
## Using pdfplumber
## Using PDFPlumber
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

View File

@@ -50,7 +50,7 @@ RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
## Python
Here's an example using the PythonTextSplitter
Here's an example using the PythonTextSplitter:
```python
@@ -78,7 +78,7 @@ python_docs
</CodeOutputBlock>
## JS
Here's an example using the JS text splitter
Here's an example using the JS text splitter:
```python
@@ -109,7 +109,7 @@ js_docs
## Markdown
Here's an example using the Markdown text splitter.
Here's an example using the Markdown text splitter:
````python
@@ -155,7 +155,7 @@ md_docs
## Latex
Here's an example on Latex text
Here's an example on Latex text:
```python
@@ -219,7 +219,7 @@ latex_docs
## HTML
Here's an example using an HTML text splitter
Here's an example using an HTML text splitter:
```python
@@ -281,7 +281,7 @@ html_docs
## Solidity
Here's an example using the Solidity text splitter
Here's an example using the Solidity text splitter:
```python
SOL_CODE = """

View File

@@ -36,27 +36,27 @@ class BaseRetriever(ABC):
It's that simple! You can call `get_relevant_documents` or the async `aget_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
the specific retriever object you are calling.
Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
Of course, we also help construct what we think useful retrievers are. The main type of retriever that we focus on is a vector store retriever. We will focus on that for the rest of this guide.
In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
In order to understand what a vector store retriever is, it's important to understand what a vector store is. So let's look at that.
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vectorstore to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vector store to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
```
pip install chromadb
```
This example showcases question answering over documents.
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vectorstores) and then also shows how to use them in a chain.
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vector stores) and then also shows how to use them in a chain.
Question answering over documents consists of four steps:
1. Create an index
2. Create a Retriever from that index
2. Create a retriever from that index
3. Create a question answering chain
4. Ask questions!
Each of the steps has multiple sub steps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
Each of the steps has multiple substeps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
First, let's import some common classes we'll use no matter what.
@@ -66,7 +66,7 @@ from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
```
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt).
```python
@@ -129,7 +129,7 @@ index.query_with_sources(query)
</CodeOutputBlock>
What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
What is returned from the `VectorstoreIndexCreator` is a `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionalities. If we just want to access the vector store directly, we can also do that.
```python
@@ -144,7 +144,7 @@ index.vectorstore
</CodeOutputBlock>
If we then want to access the VectorstoreRetriever, we can do that with:
If we then want to access the `VectorStoreRetriever`, we can do that with:
```python
@@ -159,7 +159,7 @@ index.vectorstore.as_retriever()
</CodeOutputBlock>
It can also be convenient to filter the vectorstore by the metadata associated with documents, particularly when your vectorstore has multiple sources. This can be done using the `query` method like so:
It can also be convenient to filter the vector store by the metadata associated with documents, particularly when your vector store has multiple sources. This can be done using the `query` method like so:
```python
@@ -185,7 +185,7 @@ There are three main steps going on after the documents are loaded:
1. Splitting documents into chunks
2. Creating embeddings for each document
3. Storing documents and embeddings in a vectorstore
3. Storing documents and embeddings in a vector store
Let's walk through this in code
@@ -211,7 +211,7 @@ from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
```
We now create the vectorstore to use as the index.
We now create the vector store to use as the index.
```python

View File

@@ -9,9 +9,9 @@ from langchain.schema import Document
from langchain.vectorstores import FAISS
```
## Low Decay Rate
## Low decay rate
A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
A low `decay rate` (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
```python
@@ -53,7 +53,7 @@ retriever.get_relevant_documents("hello world")
</CodeOutputBlock>
## High Decay Rate
## High decay rate
With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.
@@ -98,9 +98,9 @@ retriever.get_relevant_documents("hello world")
</CodeOutputBlock>
## Virtual Time
## Virtual time
Using some utils in LangChain, you can mock out the time component
Using some utils in LangChain, you can mock out the time component.
```python

View File

@@ -34,8 +34,8 @@ retriever = db.as_retriever()
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```
## Maximum Marginal Relevance Retrieval
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
## Maximum marginal relevance retrieval
By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.
```python
@@ -47,9 +47,9 @@ retriever = db.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```
## Similarity Score Threshold Retrieval
## Similarity score threshold retrieval
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.
```python

View File

@@ -1,11 +1,11 @@
## Get started
We'll use a Pinecone vector store in this example.
First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
NOTE: The self-query retriever requires you to have `lark` package installed.
**Note:** The self-query retriever requires you to have `lark` package installed.
```python

View File

@@ -20,7 +20,7 @@ from langchain.embeddings import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(openai_api_key="...")
```
otherwise you can initialize without any params:
Otherwise you can initialize without any params:
```python
from langchain.embeddings import OpenAIEmbeddings

View File

@@ -1,4 +1,4 @@
Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
LangChain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
`Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.
@@ -47,7 +47,7 @@ docs = await db.asimilarity_search_by_vector(embedding_vector)
## Maximum marginal relevance search (MMR)
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It is also supported in async API.
Maximal marginal relevance optimizes for similarity to query **and** diversity among selected documents. It is also supported in async API.
```python
query = "What did the president say about Ketanji Brown Jackson"