langchain/docs/docs
Martin Triska 7a9149f5dd
community: ZeroxPDFLoader (#27800)
# OCR-based PDF loader

This implements [Zerox](https://github.com/getomni-ai/zerox) PDF
document loader.
Zerox utilizes simple but very powerful (even though slower and more
costly) approach to parsing PDF documents: it converts PDF to series of
images and passes it to a vision model requesting the contents in
markdown.

It is especially suitable for complex PDFs that are not parsed well by
other alternatives.

## Example use:
```python
from langchain_community.document_loaders.pdf import ZeroxPDFLoader

os.environ["OPENAI_API_KEY"] = "" ## your-api-key

model = "gpt-4o-mini" ## openai model
pdf_url = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf"

loader = ZeroxPDFLoader(file_path=pdf_url, model=model)
docs = loader.load()
```

The Zerox library supports wide range of provides/models. See Zerox
documentation for details.

- **Dependencies:** `zerox`
- **Twitter handle:** @martintriska1

If no one reviews your PR within a few days, please @-mention one of
baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

---------

Co-authored-by: Erick Friis <erickfriis@gmail.com>
2024-11-07 03:14:57 +00:00
..
_templates
additional_resources docs: make docs mdxv2 compatible (#26798) 2024-09-23 21:24:23 -07:00
changes/changelog docs: Add langchain over time (#21434) 2024-05-10 00:34:35 +00:00
concepts docs: Update messages.mdx (#27856) 2024-11-04 20:36:31 +00:00
contributing fix the grammar and markdown component (#27657) 2024-10-30 14:47:26 +00:00
example_data docs[minor]: Add "Build a PDF ingestion and Question/Answering system" tutorial (#22570) 2024-06-05 17:09:28 -07:00
how_to update llm graph transformer documentation (#27905) 2024-11-05 11:54:26 -05:00
integrations community: ZeroxPDFLoader (#27800) 2024-11-07 03:14:57 +00:00
troubleshooting/errors docs: fix more links (#27809) 2024-10-31 17:15:46 -04:00
tutorials docs: Update VectorStore api reference url in rag.ipynb (#27841) 2024-11-04 20:27:03 +00:00
versions docs: fix more links (#27809) 2024-10-31 17:15:46 -04:00
.gitignore
introduction.mdx docs: fix more links (#27809) 2024-10-31 17:15:46 -04:00
people.mdx
security.md community[minor]: add proxy support to RecursiveUrlLoader (#27364) 2024-10-16 16:29:59 +00:00