# Paged PDF Splitter

This notebook shows how to load a PDF file with the `PagedPDFSplitter`, which 
uses the [pypdf](https://github.com/mstamy2/PyPDF2) library to read a 
PDF file. **Note this reads & splits.** 

## Compared with other PDF Reader
Compared with the `unstructured` PDF reader, this one is local
and does not require an model - it just extracts text. This means it will
not work for scanned documents or PDFs containing images of text.

In [3]:
from langchain.document_loaders import PagedPDFSplitter

loader = PagedPDFSplitter("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

## Using with document retrieval

An advantage of this approach is that documents can be retrieved with page numbers.

In [7]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content)

9: 10 Z. Shen et al.
Fig. 4: Illustration of (a) the original historical Japanese document with layout
detection results and (b) a recreated version of the document image that achieves
much better character recognition recall. The reorganization algorithm rearranges
the tokens based on the their detected bounding boxes given a maximum allowed
height.
4LayoutParser Community Platform
Another focus of LayoutParser is promoting the reusability of layout detection
models and full digitization pipelines. Similar to many existing deep learning
libraries, LayoutParser comes with a community model hub for distributing
layout models. End-users can upload their self-trained models to the model hub,
and these models can be loaded into a similar interface as the currently available
LayoutParser pre-trained models. For example, the model trained on the News
Navigator dataset [17] has been incorporated in the model hub.
Beyond DL models, LayoutParser also promotes the sharing of entire doc-
ument di