mirror of
https://github.com/hwchase17/langchain.git
synced 2025-08-27 05:20:34 +00:00
Per discussion on Discord. This adds a PDF reader that uses `PyPDF` - a simple PDF reader. It also tracks page numbers in a per split metadata. Here's an example: ```python from langchain.document_loaders import PagedPDFSplitter from langchain.vectorstores import FAISS from langchain.embeddings.openai import OpenAIEmbeddings loader = PagedPDFSplitter(chunk_size=250) splits, metadatas = loader.load_and_split("examples/example_data/layout-parser-paper.pdf") faiss_index = FAISS.from_texts(splits, OpenAIEmbeddings(), metadatas=metadatas) docs = faiss_index.similarity_search("How will the community be engaged?", k=2) for doc in docs: print(doc.metadata["pages"] + ":", doc.page_content) ``` ## TODO - [x] Learn where to add `pypdf` as dependency for building docs - [x] Add unit test? --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
58 lines
2.7 KiB
ReStructuredText
58 lines
2.7 KiB
ReStructuredText
How To Guides
|
|
====================================
|
|
|
|
There are a lot of different document loaders that LangChain supports. Below are how-to guides for working with them
|
|
|
|
`File Loader <./examples/unstructured_file.html>`_: A walkthrough of how to use Unstructured to load files of arbitrary types (pdfs, txt, html, etc).
|
|
|
|
`Directory Loader <./examples/directory_loader.html>`_: A walkthrough of how to use Unstructured load files from a given directory.
|
|
|
|
`Notion <./examples/notion.html>`_: A walkthrough of how to load data for an arbitrary Notion DB.
|
|
|
|
`ReadTheDocs <./examples/readthedocs_documentation.html>`_: A walkthrough of how to load data for documentation generated by ReadTheDocs.
|
|
|
|
`HTML <./examples/html.html>`_: A walkthrough of how to load data from an html file.
|
|
|
|
`PDF <./examples/pdf.html>`_: A walkthrough of how to load data from a PDF file.
|
|
|
|
`Paged PDF <./examples/paged_pdf_splitter.html>`_: A walkthrough of how to load data from a PDF file with a text reader while tracking page numbers.
|
|
|
|
`PowerPoint <./examples/powerpoint.html>`_: A walkthrough of how to load data from a powerpoint file.
|
|
|
|
`Email <./examples/email.html>`_: A walkthrough of how to load data from an email (`.eml`) file.
|
|
|
|
`GoogleDrive <./examples/googledrive.html>`_: A walkthrough of how to load data from Google drive.
|
|
|
|
`Microsoft Word <./examples/microsoft_word.html>`_: A walkthrough of how to load data from Microsoft Word files.
|
|
|
|
`Obsidian <./examples/obsidian.html>`_: A walkthrough of how to load data from an Obsidian file dump.
|
|
|
|
`Roam <./examples/roam.html>`_: A walkthrough of how to load data from a Roam file export.
|
|
|
|
`YouTube <./examples/youtube.html>`_: A walkthrough of how to load the transcript from a YouTube video.
|
|
|
|
`s3 File <./examples/s3_file.html>`_: A walkthrough of how to load a file from s3.
|
|
|
|
`s3 Directory <./examples/s3_directory.html>`_: A walkthrough of how to load all files in a directory from s3.
|
|
|
|
`GCS File <./examples/gcs_file.html>`_: A walkthrough of how to load a file from Google Cloud Storage (GCS).
|
|
|
|
`GCS Directory <./examples/gcs_directory.html>`_: A walkthrough of how to load all files in a directory from Google Cloud Storage (GCS).
|
|
|
|
`Web Base <./examples/web_base.html>`_: A walkthrough of how to load all text data from webpages.
|
|
|
|
`IMSDb <./examples/imsdb.html>`_: A walkthrough of how to load all text data from IMSDb webpage.
|
|
|
|
`AZLyrics <./examples/azlyrics.html>`_: A walkthrough of how to load all text data from AZLyrics webpage.
|
|
|
|
`College Confidential <./examples/college_confidential.html>`_: A walkthrough of how to load all text data from College Confidential webpage.
|
|
|
|
`Gutenberg <./examples/gutenberg.html>`_: A walkthrough of how to load data from a Gutenberg ebook text.
|
|
|
|
.. toctree::
|
|
:maxdepth: 1
|
|
:glob:
|
|
:hidden:
|
|
|
|
examples/*
|