docs: misc retrievers fixes (#9791)

Various miscellaneous fixes to most pages in the 'Retrievers' section of the documentation: - "VectorStore" and "vectorstore" changed to "vector store" for consistency - Various spelling, grammar, and formatting improvements for readability Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2025-09-07 05:52:15 +00:00 · 2023-09-03 20:26:49 -07:00
parent 8bc452a466
commit 16945c9922
39 changed files with 148 additions and 163 deletions
--- a/docs/snippets/modules/data_connection/document_loaders/get_started.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/get_started.mdx
@@ -1,4 +1,4 @@
-The simplest loader reads in a file as text and places it all into one Document.
+The simplest loader reads in a file as text and places it all into one document.

 ```python
 from langchain.document_loaders import TextLoader
--- a/docs/snippets/modules/data_connection/document_loaders/how_to/csv.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/csv.mdx
@@ -19,7 +19,7 @@ print(data)

 </CodeOutputBlock>

-## Customizing the csv parsing and loading
+## Customizing the CSV parsing and loading

 See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

--- a/docs/snippets/modules/data_connection/document_loaders/how_to/file_directory.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/file_directory.mdx
@@ -1,4 +1,4 @@
-Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
+Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).

 ```python
 from langchain.document_loaders import DirectoryLoader
@@ -121,7 +121,7 @@ len(docs)

 </CodeOutputBlock>

-## Auto detect file encodings with TextLoader
+## Auto-detect file encodings with TextLoader

 In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.

@@ -212,7 +212,7 @@ loader.load()

 </HTMLOutputBlock>

-The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding. 
+The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding. 

 With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded. 

--- a/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
@@ -139,9 +139,9 @@ data[0]

 ### Fetching remote PDFs using Unstructured

-This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
+This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/

-Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
+Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.



@@ -208,7 +208,7 @@ data = loader.load()

 ### Using PDFMiner to generate HTML text

-This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
+This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.


 ```python
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")


 ```python
-data = loader.load()[0]   # entire pdf is loaded as a single Document
+data = loader.load()[0]   # entire PDF is loaded as a single Document
 ```


@@ -259,7 +259,7 @@ for c in content:
        cur_text = c.text
 snippets.append((cur_text,cur_fs))
 # Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
-# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
+# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
 ```


@@ -285,7 +285,7 @@ for s in snippets:
        continue
    
    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new 
-    # section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
+    # section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
    metadata.update(data.metadata)
    semantic_snippets.append(Document(page_content='',metadata=metadata))
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
 docs = loader.load()
 ```

-## Using pdfplumber
+## Using PDFPlumber

 Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

--- a/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx
+++ b/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx
@@ -50,7 +50,7 @@ RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

 ## Python

-Here's an example using the PythonTextSplitter
+Here's an example using the PythonTextSplitter:


 ```python
@@ -78,7 +78,7 @@ python_docs
 </CodeOutputBlock>

 ## JS
-Here's an example using the JS text splitter
+Here's an example using the JS text splitter:


 ```python
@@ -109,7 +109,7 @@ js_docs

 ## Markdown

-Here's an example using the Markdown text splitter.
+Here's an example using the Markdown text splitter:


 ````python
@@ -155,7 +155,7 @@ md_docs

 ## Latex

-Here's an example on Latex text
+Here's an example on Latex text:


 ```python
@@ -219,7 +219,7 @@ latex_docs

 ## HTML

-Here's an example using an HTML text splitter
+Here's an example using an HTML text splitter:


 ```python
@@ -281,7 +281,7 @@ html_docs


 ## Solidity
-Here's an example using the Solidity text splitter
+Here's an example using the Solidity text splitter:

 ```python
 SOL_CODE = """
--- a/docs/snippets/modules/data_connection/retrievers/get_started.mdx
+++ b/docs/snippets/modules/data_connection/retrievers/get_started.mdx
@@ -36,27 +36,27 @@ class BaseRetriever(ABC):
 It's that simple! You can call `get_relevant_documents` or the async `aget_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
 the specific retriever object you are calling.

-Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
+Of course, we also help construct what we think useful retrievers are. The main type of retriever that we focus on is a vector store retriever. We will focus on that for the rest of this guide.

-In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
+In order to understand what a vector store retriever is, it's important to understand what a vector store is. So let's look at that.

-By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vectorstore to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
+By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vector store to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.

 ```
 pip install chromadb
 ```

 This example showcases question answering over documents.
-We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vectorstores) and then also shows how to use them in a chain.
+We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vector stores) and then also shows how to use them in a chain.

 Question answering over documents consists of four steps:

 1. Create an index
-2. Create a Retriever from that index
+2. Create a retriever from that index
 3. Create a question answering chain
 4. Ask questions!

-Each of the steps has multiple sub steps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
+Each of the steps has multiple substeps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.

 First, let's import some common classes we'll use no matter what.

@@ -66,7 +66,7 @@ from langchain.chains import RetrievalQA
 from langchain.llms import OpenAI
 ```

-Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
+Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt).


 ```python
@@ -129,7 +129,7 @@ index.query_with_sources(query)

 </CodeOutputBlock>

-What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
+What is returned from the `VectorstoreIndexCreator` is a `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionalities. If we just want to access the vector store directly, we can also do that.


 ```python
@@ -144,7 +144,7 @@ index.vectorstore

 </CodeOutputBlock>

-If we then want to access the VectorstoreRetriever, we can do that with:
+If we then want to access the `VectorStoreRetriever`, we can do that with:


 ```python
@@ -159,7 +159,7 @@ index.vectorstore.as_retriever()

 </CodeOutputBlock>

-It can also be convenient to filter the vectorstore by the metadata associated with documents, particularly when your vectorstore has multiple sources.  This can be done using the `query` method like so:
+It can also be convenient to filter the vector store by the metadata associated with documents, particularly when your vector store has multiple sources.  This can be done using the `query` method like so:


 ```python
@@ -185,7 +185,7 @@ There are three main steps going on after the documents are loaded:

 1. Splitting documents into chunks
 2. Creating embeddings for each document
-3. Storing documents and embeddings in a vectorstore
+3. Storing documents and embeddings in a vector store

 Let's walk through this in code

@@ -211,7 +211,7 @@ from langchain.embeddings import OpenAIEmbeddings
 embeddings = OpenAIEmbeddings()
 ```

-We now create the vectorstore to use as the index.
+We now create the vector store to use as the index.


 ```python
--- a/docs/snippets/modules/data_connection/retrievers/how_to/time_weighted_vectorstore.mdx
+++ b/docs/snippets/modules/data_connection/retrievers/how_to/time_weighted_vectorstore.mdx
@@ -9,9 +9,9 @@ from langchain.schema import Document
 from langchain.vectorstores import FAISS
 ```

-## Low Decay Rate
+## Low decay rate

-A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
+A low `decay rate` (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.


 ```python
@@ -53,7 +53,7 @@ retriever.get_relevant_documents("hello world")

 </CodeOutputBlock>

-## High Decay Rate
+## High decay rate

 With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.

@@ -98,9 +98,9 @@ retriever.get_relevant_documents("hello world")

 </CodeOutputBlock>

-## Virtual Time
+## Virtual time

-Using some utils in LangChain, you can mock out the time component
+Using some utils in LangChain, you can mock out the time component.


 ```python
--- a/docs/snippets/modules/data_connection/retrievers/how_to/vectorstore.mdx
+++ b/docs/snippets/modules/data_connection/retrievers/how_to/vectorstore.mdx
@@ -34,8 +34,8 @@ retriever = db.as_retriever()
 docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
 ```

-## Maximum Marginal Relevance Retrieval
-By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
+## Maximum marginal relevance retrieval
+By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.


 ```python
@@ -47,9 +47,9 @@ retriever = db.as_retriever(search_type="mmr")
 docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
 ```

-## Similarity Score Threshold Retrieval
+## Similarity score threshold retrieval

-You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold
+You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.


 ```python
--- a/docs/snippets/modules/data_connection/retrievers/self_query/get_started.mdx
+++ b/docs/snippets/modules/data_connection/retrievers/self_query/get_started.mdx
@@ -1,11 +1,11 @@
 ## Get started
 We'll use a Pinecone vector store in this example.

-First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
+First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.

-To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
+To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).

-NOTE: The self-query retriever requires you to have `lark` package installed.
+**Note:** The self-query retriever requires you to have `lark` package installed.


 ```python
--- a/docs/snippets/modules/data_connection/text_embedding/get_started.mdx
+++ b/docs/snippets/modules/data_connection/text_embedding/get_started.mdx
@@ -20,7 +20,7 @@ from langchain.embeddings import OpenAIEmbeddings
 embeddings_model = OpenAIEmbeddings(openai_api_key="...")
 ```

-otherwise you can initialize without any params:
+Otherwise you can initialize without any params:
 ```python
 from langchain.embeddings import OpenAIEmbeddings

--- a/docs/snippets/modules/data_connection/vectorstores/async.mdx
+++ b/docs/snippets/modules/data_connection/vectorstores/async.mdx
@@ -1,4 +1,4 @@
-Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
+LangChain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.

 `Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.

@@ -47,7 +47,7 @@ docs = await db.asimilarity_search_by_vector(embedding_vector)

 ## Maximum marginal relevance search (MMR)

-Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It is also supported in async API.
+Maximal marginal relevance optimizes for similarity to query **and** diversity among selected documents. It is also supported in async API.

 ```python
 query = "What did the president say about Ketanji Brown Jackson"