mirror of
https://github.com/hwchase17/langchain.git
synced 2025-10-05 20:28:56 +00:00
docs: misc retrievers fixes (#9791)
Various miscellaneous fixes to most pages in the 'Retrievers' section of the documentation: - "VectorStore" and "vectorstore" changed to "vector store" for consistency - Various spelling, grammar, and formatting improvements for readability Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This commit is contained in:
@@ -19,7 +19,7 @@ print(data)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Customizing the csv parsing and loading
|
||||
## Customizing the CSV parsing and loading
|
||||
|
||||
See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.
|
||||
|
||||
|
@@ -1,4 +1,4 @@
|
||||
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
|
||||
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DirectoryLoader
|
||||
@@ -121,7 +121,7 @@ len(docs)
|
||||
|
||||
</CodeOutputBlock>
|
||||
|
||||
## Auto detect file encodings with TextLoader
|
||||
## Auto-detect file encodings with TextLoader
|
||||
|
||||
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
|
||||
|
||||
@@ -212,7 +212,7 @@ loader.load()
|
||||
|
||||
</HTMLOutputBlock>
|
||||
|
||||
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
|
||||
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
|
||||
|
||||
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.
|
||||
|
||||
|
@@ -139,9 +139,9 @@ data[0]
|
||||
|
||||
### Fetching remote PDFs using Unstructured
|
||||
|
||||
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
|
||||
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
|
||||
|
||||
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
|
||||
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
|
||||
|
||||
|
||||
|
||||
@@ -208,7 +208,7 @@ data = loader.load()
|
||||
|
||||
### Using PDFMiner to generate HTML text
|
||||
|
||||
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
|
||||
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
|
||||
|
||||
|
||||
```python
|
||||
@@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
|
||||
|
||||
|
||||
```python
|
||||
data = loader.load()[0] # entire pdf is loaded as a single Document
|
||||
data = loader.load()[0] # entire PDF is loaded as a single Document
|
||||
```
|
||||
|
||||
|
||||
@@ -259,7 +259,7 @@ for c in content:
|
||||
cur_text = c.text
|
||||
snippets.append((cur_text,cur_fs))
|
||||
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
|
||||
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
|
||||
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
|
||||
```
|
||||
|
||||
|
||||
@@ -285,7 +285,7 @@ for s in snippets:
|
||||
continue
|
||||
|
||||
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
|
||||
# section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
|
||||
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
|
||||
metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
|
||||
metadata.update(data.metadata)
|
||||
semantic_snippets.append(Document(page_content='',metadata=metadata))
|
||||
@@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
## Using pdfplumber
|
||||
## Using PDFPlumber
|
||||
|
||||
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.
|
||||
|
||||
|
Reference in New Issue
Block a user