community: ZeroxPDFLoader (#27800)

# OCR-based PDF loader This implements [Zerox](https://github.com/getomni-ai/zerox) PDF document loader. Zerox utilizes simple but very powerful (even though slower and more costly) approach to parsing PDF documents: it converts PDF to series of images and passes it to a vision model requesting the contents in markdown. It is especially suitable for complex PDFs that are not parsed well by other alternatives. ## Example use: ```python from langchain_community.document_loaders.pdf import ZeroxPDFLoader os.environ["OPENAI_API_KEY"] = "" ## your-api-key model = "gpt-4o-mini" ## openai model pdf_url = "https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf" loader = ZeroxPDFLoader(file_path=pdf_url, model=model) docs = loader.load() ``` The Zerox library supports wide range of provides/models. See Zerox documentation for details. - **Dependencies:** `zerox` - **Twitter handle:** @martintriska1 If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: Erick Friis <erickfriis@gmail.com>
2025-08-20 09:57:32 +00:00 · 2024-11-07 04:14:57 +01:00 · 2024-11-07 04:14:57 +01:00 · 7a9149f5dd
commit 7a9149f5dd
parent 53b0a99f37
2 changed files with 354 additions and 0 deletions
--- a/docs/docs/integrations/document_loaders/zeroxpdfloader.ipynb
+++ b/docs/docs/integrations/document_loaders/zeroxpdfloader.ipynb
@ -0,0 +1,277 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ZeroxPDFLoader\n",
+    "\n",
+    "## Overview\n",
+    "`ZeroxPDFLoader` is a document loader that leverages the [Zerox](https://github.com/getomni-ai/zerox) library. Zerox converts PDF documents into images, processes them using a vision-capable language model, and generates a structured Markdown representation. This loader allows for asynchronous operations and provides page-level document extraction.\n",
+    "\n",
+    "### Integration details\n",
+    "\n",
+    "| Class | Package | Local | Serializable | JS support|\n",
+    "| :--- | :--- | :---: | :---: |  :---: |\n",
+    "| [ZeroxPDFLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.ZeroxPDFLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ❌ | ❌ | ❌ | \n",
+    "\n",
+    "### Loader features\n",
+    "| Source | Document Lazy Loading | Native Async Support\n",
+    "| :---: | :---: | :---: | \n",
+    "| ZeroxPDFLoader | ✅ | ❌ | \n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "### Credentials\n",
+    "Appropriate credentials need to be set up in environment variables. The loader supports number of different models and model providers. See _Usage_ header below to see few examples or [Zerox documentation](https://github.com/getomni-ai/zerox) for a full list of supported models.\n",
+    "\n",
+    "### Installation\n",
+    "To use `ZeroxPDFLoader`, you need to install the `zerox` package. Also make sure to have `langchain-community` installed.\n",
+    "\n",
+    "```bash\n",
+    "pip install zerox langchain-community\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization\n",
+    "\n",
+    "`ZeroxPDFLoader` enables PDF text extraction using vision-capable language models by converting each page into an image and processing it asynchronously. To use this loader, you need to specify a model and configure any necessary environment variables for Zerox, such as API keys.\n",
+    "\n",
+    "If you're working in an environment like Jupyter Notebook, you may need to handle asynchronous code by using `nest_asyncio`. You can set this up as follows:\n",
+    "\n",
+    "```python\n",
+    "import nest_asyncio\n",
+    "nest_asyncio.apply()\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# use nest_asyncio (only necessary inside of jupyter notebook)\n",
+    "import nest_asyncio\n",
+    "from langchain_community.document_loaders.pdf import ZeroxPDFLoader\n",
+    "\n",
+    "nest_asyncio.apply()\n",
+    "\n",
+    "# Specify the url or file path for the PDF you want to process\n",
+    "# In this case let's use pdf from web\n",
+    "file_path = \"https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf\"\n",
+    "\n",
+    "# Set up necessary env vars for a vision model\n",
+    "os.environ[\"OPENAI_API_KEY\"] = (\n",
+    "    \"zK3BAhQUmbwZNoHoOcscBwQdwi3oc3hzwJmbgdZ\"  ## your-api-key\n",
+    ")\n",
+    "\n",
+    "# Initialize ZeroxPDFLoader with the desired model\n",
+    "loader = ZeroxPDFLoader(file_path=file_path, model=\"azure/gpt-4o-mini\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(metadata={'source': 'https://assets.ctfassets.net/f1df9zr7wr1a/soP1fjvG1Wu66HJhu3FBS/034d6ca48edb119ae77dec5ce01a8612/OpenAI_Sacra_Teardown.pdf', 'page': 1, 'num_pages': 5}, page_content='# OpenAI\\n\\nOpenAI is an AI research laboratory.\\n\\n#ai-models #ai\\n\\n## Revenue\\n- **$1,000,000,000**  \\n  2023\\n\\n## Valuation\\n- **$28,000,000,000**  \\n  2023\\n\\n## Growth Rate (Y/Y)\\n- **400%**  \\n  2023\\n\\n## Funding\\n- **$11,300,000,000**  \\n  2023\\n\\n---\\n\\n## Details\\n- **Headquarters:** San Francisco, CA\\n- **CEO:** Sam Altman\\n\\n[Visit Website](#)\\n\\n---\\n\\n## Revenue\\n### ARR ($M)  | Growth\\n--- | ---\\n$1000M  | 456%\\n$750M   | \\n$500M   | \\n$250M   | $36M\\n$0     | $200M\\n\\nis on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\\n\\nOpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be \"the most capital-intensive startup in Silicon Valley history.\"\\n\\nThe reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\\n\\n---\\n\\n## Valuation\\nIn April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\\n\\nAssuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\\n\\n---\\n\\n## Product\\n\\n### ChatGPT\\n| Examples                       | Capabilities                        | Limitations                        |\\n|---------------------------------|-------------------------------------|------------------------------------|\\n| \"Explain quantum computing in simple terms\" | \"Remember what users said earlier in the conversation\" | May occasionally generate incorrect information |\\n| \"What can you give me for my dad\\'s birthday?\" | \"Allows users to follow-up questions\" | Limited knowledge of world events after 2021 |\\n| \"How do I make an HTTP request in JavaScript?\" | \"Trained to provide harmless requests\" |                                    |')"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Load the document and look at the first page:\n",
+    "documents = loader.load()\n",
+    "documents[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "# OpenAI\n",
+      "\n",
+      "OpenAI is an AI research laboratory.\n",
+      "\n",
+      "#ai-models #ai\n",
+      "\n",
+      "## Revenue\n",
+      "- **$1,000,000,000**  \n",
+      "  2023\n",
+      "\n",
+      "## Valuation\n",
+      "- **$28,000,000,000**  \n",
+      "  2023\n",
+      "\n",
+      "## Growth Rate (Y/Y)\n",
+      "- **400%**  \n",
+      "  2023\n",
+      "\n",
+      "## Funding\n",
+      "- **$11,300,000,000**  \n",
+      "  2023\n",
+      "\n",
+      "---\n",
+      "\n",
+      "## Details\n",
+      "- **Headquarters:** San Francisco, CA\n",
+      "- **CEO:** Sam Altman\n",
+      "\n",
+      "[Visit Website](#)\n",
+      "\n",
+      "---\n",
+      "\n",
+      "## Revenue\n",
+      "### ARR ($M)  | Growth\n",
+      "--- | ---\n",
+      "$1000M  | 456%\n",
+      "$750M   | \n",
+      "$500M   | \n",
+      "$250M   | $36M\n",
+      "$0     | $200M\n",
+      "\n",
+      "is on track to hit $1B in annual recurring revenue by the end of 2023, up about 400% from an estimated $200M at the end of 2022.\n",
+      "\n",
+      "OpenAI overall lost about $540M last year while developing ChatGPT, and those losses are expected to increase dramatically in 2023 with the growth in popularity of their consumer tools, with CEO Sam Altman remarking that OpenAI is likely to be \"the most capital-intensive startup in Silicon Valley history.\"\n",
+      "\n",
+      "The reason for that is operating ChatGPT is massively expensive. One analysis of ChatGPT put the running cost at about $700,000 per day taking into account the underlying costs of GPU hours and hardware. That amount—derived from the 175 billion parameter-large architecture of GPT-3—would be even higher with the 100 trillion parameters of GPT-4.\n",
+      "\n",
+      "---\n",
+      "\n",
+      "## Valuation\n",
+      "In April 2023, OpenAI raised its latest round of $300M at a roughly $29B valuation from Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global.\n",
+      "\n",
+      "Assuming OpenAI was at roughly $300M in ARR at the time, that would have given them a 96x forward revenue multiple.\n",
+      "\n",
+      "---\n",
+      "\n",
+      "## Product\n",
+      "\n",
+      "### ChatGPT\n",
+      "| Examples                       | Capabilities                        | Limitations                        |\n",
+      "|---------------------------------|-------------------------------------|------------------------------------|\n",
+      "| \"Explain quantum computing in simple terms\" | \"Remember what users said earlier in the conversation\" | May occasionally generate incorrect information |\n",
+      "| \"What can you give me for my dad's birthday?\" | \"Allows users to follow-up questions\" | Limited knowledge of world events after 2021 |\n",
+      "| \"How do I make an HTTP request in JavaScript?\" | \"Trained to provide harmless requests\" |                                    |\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Let's look at parsed first page\n",
+    "print(documents[0].page_content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Lazy Load\n",
+    "The loader always fetches results lazily. `.load()` method is equivalent to `.lazy_load()` "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "### `ZeroxPDFLoader`\n",
+    "\n",
+    "This loader class initializes with a file path and model type, and supports custom configurations via `zerox_kwargs` for handling Zerox-specific parameters.\n",
+    "\n",
+    "**Arguments**:\n",
+    "- `file_path` (Union[str, Path]): Path to the PDF file.\n",
+    "- `model` (str): Vision-capable model to use for processing in format `<provider>/<model>`.\n",
+    "Some examples of valid values are: \n",
+    "  - `model = \"gpt-4o-mini\" ## openai model`\n",
+    "  - `model = \"azure/gpt-4o-mini\"`\n",
+    "  - `model = \"gemini/gpt-4o-mini\"`\n",
+    "  - `model=\"claude-3-opus-20240229\"`\n",
+    "  - `model = \"vertex_ai/gemini-1.5-flash-001\"`\n",
+    "  - See more details in [Zerox documentation](https://github.com/getomni-ai/zerox)\n",
+    "  - Defaults to `\"gpt-4o-mini\".`\n",
+    "- `**zerox_kwargs` (dict): Additional Zerox-specific parameters such as API key, endpoint, etc.\n",
+    "  - See [Zerox documentation](https://github.com/getomni-ai/zerox)\n",
+    "\n",
+    "**Methods**:\n",
+    "- `lazy_load`: Generates an iterator of `Document` instances, each representing a page of the PDF, along with metadata including page number and source.\n",
+    "\n",
+    "See full API documentaton [here](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.ZeroxPDFLoader.html)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Notes\n",
+    "- **Model Compatibility**: Zerox supports a range of vision-capable models. Refer to [Zerox's GitHub documentation](https://github.com/getomni-ai/zerox) for a list of supported models and configuration details.\n",
+    "- **Environment Variables**: Make sure to set required environment variables, such as `API_KEY` or endpoint details, as specified in the Zerox documentation.\n",
+    "- **Asynchronous Processing**: If you encounter errors related to event loops in Jupyter Notebooks, you may need to apply `nest_asyncio` as shown in the setup section.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Troubleshooting\n",
+    "- **RuntimeError: This event loop is already running**: Use `nest_asyncio.apply()` to prevent asynchronous loop conflicts in environments like Jupyter.\n",
+    "- **Configuration Errors**: Verify that the `zerox_kwargs` match the expected arguments for your chosen model and that all necessary environment variables are set.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Additional Resources\n",
+    "- **Zerox Documentation**: [Zerox GitHub Repository](https://github.com/getomni-ai/zerox)\n",
+    "- **LangChain Document Loaders**: [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "sharepoint_chatbot",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/libs/community/langchain_community/document_loaders/pdf.py
+++ b/libs/community/langchain_community/document_loaders/pdf.py
@ -945,5 +945,82 @@ class DocumentIntelligenceLoader(BasePDFLoader):
        yield from self.parser.parse(blob)


+class ZeroxPDFLoader(BasePDFLoader):
+    """
+    Document loader utilizing Zerox library:
+    https://github.com/getomni-ai/zerox
+
+    Zerox converts PDF document to serties of images (page-wise) and
+    uses vision-capable LLM model to generate Markdown representation.
+
+    Zerox utilizes anyc operations. Therefore when using this loader
+    inside Jupyter Notebook (or any environment running async)
+    you will need to:
+    ```python
+        import nest_asyncio
+        nest_asyncio.apply()
+    ```
+    """
+
+    def __init__(
+        self,
+        file_path: Union[str, Path],
+        model: str = "gpt-4o-mini",
+        **zerox_kwargs: Any,
+    ) -> None:
+        super().__init__(file_path=file_path)
+        """
+        Initialize the parser with arguments to be passed to the zerox function.
+        Make sure to set necessary environmnet variables such as API key, endpoint, etc.
+        Check zerox documentation for list of necessary environment variables for
+        any given model.
+
+        Args:
+            file_path:
+                Path or url of the pdf file
+            model:
+                Vision capable model to use. Defaults to "gpt-4o-mini".
+                Hosted models are passed in format "<provider>/<model>"
+                Examples: "azure/gpt-4o-mini", "vertex_ai/gemini-1.5-flash-001"
+                          See more details in zerox documentation.
+            **zerox_kwargs: 
+                Arguments specific to the zerox function.
+                see datailed list of arguments here in zerox repository:
+                https://github.com/getomni-ai/zerox/blob/main/py_zerox/pyzerox/core/zerox.py#L25
+        """  # noqa: E501
+        self.zerox_kwargs = zerox_kwargs
+        self.model = model
+
+    def lazy_load(self) -> Iterator[Document]:
+        """
+        Loads documnts from pdf utilizing zerox library:
+        https://github.com/getomni-ai/zerox
+
+        Returns:
+            Iterator[Document]: An iterator over parsed Document instances.
+        """
+        import asyncio
+
+        from pyzerox import zerox
+
+        # Directly call asyncio.run to execute zerox synchronously
+        zerox_output = asyncio.run(
+            zerox(file_path=self.file_path, model=self.model, **self.zerox_kwargs)
+        )
+
+        # Convert zerox output to Document instances and yield them
+        if len(zerox_output.pages) > 0:
+            num_pages = zerox_output.pages[-1].page
+            for page in zerox_output.pages:
+                yield Document(
+                    page_content=page.content,
+                    metadata={
+                        "source": self.source,
+                        "page": page.page,
+                        "num_pages": num_pages,
+                    },
+                )
+
+
 # Legacy: only for backwards compatibility. Use PyPDFLoader instead
 PagedPDFSplitter = PyPDFLoader