docs: misc retrievers fixes (#9791)

Various miscellaneous fixes to most pages in the 'Retrievers' section of the documentation: - "VectorStore" and "vectorstore" changed to "vector store" for consistency - Various spelling, grammar, and formatting improvements for readability Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2025-10-05 20:28:56 +00:00 · 2023-09-03 20:26:49 -07:00
parent 8bc452a466
commit 16945c9922
39 changed files with 148 additions and 163 deletions
--- a/docs/snippets/modules/data_connection/document_loaders/how_to/csv.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/csv.mdx
@@ -19,7 +19,7 @@ print(data)

 </CodeOutputBlock>

-## Customizing the csv parsing and loading
+## Customizing the CSV parsing and loading

 See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

--- a/docs/snippets/modules/data_connection/document_loaders/how_to/file_directory.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/file_directory.mdx
@@ -1,4 +1,4 @@
-Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
+Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).

 ```python
 from langchain.document_loaders import DirectoryLoader
@@ -121,7 +121,7 @@ len(docs)

 </CodeOutputBlock>

-## Auto detect file encodings with TextLoader
+## Auto-detect file encodings with TextLoader

 In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.

@@ -212,7 +212,7 @@ loader.load()

 </HTMLOutputBlock>

-The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding. 
+The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding. 

 With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded. 

--- a/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
@@ -139,9 +139,9 @@ data[0]

 ### Fetching remote PDFs using Unstructured

-This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
+This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/

-Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
+Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.



@@ -208,7 +208,7 @@ data = loader.load()

 ### Using PDFMiner to generate HTML text

-This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
+This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.


 ```python
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")


 ```python
-data = loader.load()[0]   # entire pdf is loaded as a single Document
+data = loader.load()[0]   # entire PDF is loaded as a single Document
 ```


@@ -259,7 +259,7 @@ for c in content:
        cur_text = c.text
 snippets.append((cur_text,cur_fs))
 # Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
-# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
+# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
 ```


@@ -285,7 +285,7 @@ for s in snippets:
        continue
    
    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new 
-    # section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
+    # section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
    metadata.update(data.metadata)
    semantic_snippets.append(Document(page_content='',metadata=metadata))
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
 docs = loader.load()
 ```

-## Using pdfplumber
+## Using PDFPlumber

 Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.