Commit Graph

2 Commits

Author SHA1 Message Date
Harrison Chase
05d125ac23 cr 2023-02-09 23:44:14 -08:00
Andrew White
9011f690c6
Added PyPDF Loader and Splitter [Ready for Review] (#958)
Per discussion on Discord. This adds a PDF reader that uses `PyPDF` - a
simple PDF reader. It also tracks page numbers in a per split metadata.
Here's an example:

```python
from langchain.document_loaders import PagedPDFSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

loader = PagedPDFSplitter(chunk_size=250)
splits, metadatas = loader.load_and_split("examples/example_data/layout-parser-paper.pdf")

faiss_index = FAISS.from_texts(splits, OpenAIEmbeddings(), metadatas=metadatas)
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(doc.metadata["pages"] + ":", doc.page_content)
```

## TODO

- [x] Learn where to add `pypdf` as dependency for building docs
- [x] Add unit test?

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-02-09 23:33:18 -08:00