docs: misc retrievers fixes (#9791)

Various miscellaneous fixes to most pages in the 'Retrievers' section of
the documentation:
- "VectorStore" and "vectorstore" changed to "vector store" for
consistency
- Various spelling, grammar, and formatting improvements for readability

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
seamusp
2023-09-03 20:26:49 -07:00
committed by GitHub
parent 8bc452a466
commit 16945c9922
39 changed files with 148 additions and 163 deletions

View File

@@ -19,7 +19,7 @@ print(data)
</CodeOutputBlock>
## Customizing the csv parsing and loading
## Customizing the CSV parsing and loading
See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

View File

@@ -1,4 +1,4 @@
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
```python
from langchain.document_loaders import DirectoryLoader
@@ -121,7 +121,7 @@ len(docs)
</CodeOutputBlock>
## Auto detect file encodings with TextLoader
## Auto-detect file encodings with TextLoader
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
@@ -212,7 +212,7 @@ loader.load()
</HTMLOutputBlock>
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.

View File

@@ -139,9 +139,9 @@ data[0]
### Fetching remote PDFs using Unstructured
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
@@ -208,7 +208,7 @@ data = loader.load()
### Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
```python
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
```python
data = loader.load()[0] # entire pdf is loaded as a single Document
data = loader.load()[0] # entire PDF is loaded as a single Document
```
@@ -259,7 +259,7 @@ for c in content:
cur_text = c.text
snippets.append((cur_text,cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
```
@@ -285,7 +285,7 @@ for s in snippets:
continue
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
# section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
metadata.update(data.metadata)
semantic_snippets.append(Document(page_content='',metadata=metadata))
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
docs = loader.load()
```
## Using pdfplumber
## Using PDFPlumber
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.