langchain/docs/modules/document_loaders/how_to_guides.rst
Andrew White 9011f690c6
Added PyPDF Loader and Splitter [Ready for Review] (#958)
Per discussion on Discord. This adds a PDF reader that uses `PyPDF` - a
simple PDF reader. It also tracks page numbers in a per split metadata.
Here's an example:

```python
from langchain.document_loaders import PagedPDFSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

loader = PagedPDFSplitter(chunk_size=250)
splits, metadatas = loader.load_and_split("examples/example_data/layout-parser-paper.pdf")

faiss_index = FAISS.from_texts(splits, OpenAIEmbeddings(), metadatas=metadatas)
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(doc.metadata["pages"] + ":", doc.page_content)
```

## TODO

- [x] Learn where to add `pypdf` as dependency for building docs
- [x] Add unit test?

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-02-09 23:33:18 -08:00

58 lines
2.7 KiB
ReStructuredText

How To Guides
====================================
There are a lot of different document loaders that LangChain supports. Below are how-to guides for working with them
`File Loader <./examples/unstructured_file.html>`_: A walkthrough of how to use Unstructured to load files of arbitrary types (pdfs, txt, html, etc).
`Directory Loader <./examples/directory_loader.html>`_: A walkthrough of how to use Unstructured load files from a given directory.
`Notion <./examples/notion.html>`_: A walkthrough of how to load data for an arbitrary Notion DB.
`ReadTheDocs <./examples/readthedocs_documentation.html>`_: A walkthrough of how to load data for documentation generated by ReadTheDocs.
`HTML <./examples/html.html>`_: A walkthrough of how to load data from an html file.
`PDF <./examples/pdf.html>`_: A walkthrough of how to load data from a PDF file.
`Paged PDF <./examples/paged_pdf_splitter.html>`_: A walkthrough of how to load data from a PDF file with a text reader while tracking page numbers.
`PowerPoint <./examples/powerpoint.html>`_: A walkthrough of how to load data from a powerpoint file.
`Email <./examples/email.html>`_: A walkthrough of how to load data from an email (`.eml`) file.
`GoogleDrive <./examples/googledrive.html>`_: A walkthrough of how to load data from Google drive.
`Microsoft Word <./examples/microsoft_word.html>`_: A walkthrough of how to load data from Microsoft Word files.
`Obsidian <./examples/obsidian.html>`_: A walkthrough of how to load data from an Obsidian file dump.
`Roam <./examples/roam.html>`_: A walkthrough of how to load data from a Roam file export.
`YouTube <./examples/youtube.html>`_: A walkthrough of how to load the transcript from a YouTube video.
`s3 File <./examples/s3_file.html>`_: A walkthrough of how to load a file from s3.
`s3 Directory <./examples/s3_directory.html>`_: A walkthrough of how to load all files in a directory from s3.
`GCS File <./examples/gcs_file.html>`_: A walkthrough of how to load a file from Google Cloud Storage (GCS).
`GCS Directory <./examples/gcs_directory.html>`_: A walkthrough of how to load all files in a directory from Google Cloud Storage (GCS).
`Web Base <./examples/web_base.html>`_: A walkthrough of how to load all text data from webpages.
`IMSDb <./examples/imsdb.html>`_: A walkthrough of how to load all text data from IMSDb webpage.
`AZLyrics <./examples/azlyrics.html>`_: A walkthrough of how to load all text data from AZLyrics webpage.
`College Confidential <./examples/college_confidential.html>`_: A walkthrough of how to load all text data from College Confidential webpage.
`Gutenberg <./examples/gutenberg.html>`_: A walkthrough of how to load data from a Gutenberg ebook text.
.. toctree::
:maxdepth: 1
:glob:
:hidden:
examples/*