Feature: pdfplumber PDF loader with BaseBlobParser (#4552)

# Feature: pdfplumber PDF loader with BaseBlobParser * Adds pdfplumber as a PDF loader * Adds pdfplumber as a blob parser.
2025-09-07 05:52:15 +00:00 · 2023-05-15 21:47:02 +08:00
parent b6e3ac17c4
commit cd3f9865f3
8 changed files with 149 additions and 4 deletions
--- a/docs/modules/indexes/document_loaders/examples/pdf.ipynb
+++ b/docs/modules/indexes/document_loaders/examples/pdf.ipynb
@@ -97,7 +97,7 @@
   },
   "outputs": [
    {
-     "name": "stdin",
+     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OpenAI API Key: ········\n"
@@ -673,6 +673,68 @@
    "docs = loader.load()"
   ]
  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "45bb0415",
+   "metadata": {},
+   "source": [
+    "## Using pdfplumber\n",
+    "\n",
+    "Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "aefa758d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import PDFPlumberLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "049e9d9a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = PDFPlumberLoader(\"example_data/layout-parser-paper.pdf\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a8610efa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "8132e551",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(page_content='LayoutParser: A Unified Toolkit for Deep\\nLearning Based Document Image Analysis\\nZejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\\nLee4, Jacob Carlson3, and Weining Li5\\n1 Allen Institute for AI\\n1202 shannons@allenai.org\\n2 Brown University\\nruochen zhang@brown.edu\\n3 Harvard University\\nnuJ {melissadell,jacob carlson}@fas.harvard.edu\\n4 University of Washington\\nbcgl@cs.washington.edu\\n12 5 University of Waterloo\\nw422li@uwaterloo.ca\\n]VC.sc[\\nAbstract. Recentadvancesindocumentimageanalysis(DIA)havebeen\\nprimarily driven by the application of neural networks. Ideally, research\\noutcomescouldbeeasilydeployedinproductionandextendedforfurther\\ninvestigation. However, various factors like loosely organized codebases\\nand sophisticated model configurations complicate the easy reuse of im-\\n2v84351.3012:viXra portantinnovationsbyawideaudience.Thoughtherehavebeenon-going\\nefforts to improve reusability and simplify deep learning (DL) model\\ndevelopmentindisciplineslikenaturallanguageprocessingandcomputer\\nvision, none of them are optimized for challenges in the domain of DIA.\\nThis represents a major gap in the existing toolkit, as DIA is central to\\nacademicresearchacross awiderangeof disciplinesinthesocialsciences\\nand humanities. This paper introduces LayoutParser, an open-source\\nlibrary for streamlining the usage of DL in DIA research and applica-\\ntions. The core LayoutParser library comes with a set of simple and\\nintuitiveinterfacesforapplyingandcustomizingDLmodelsforlayoutde-\\ntection,characterrecognition,andmanyotherdocumentprocessingtasks.\\nTo promote extensibility, LayoutParser also incorporates a community\\nplatform for sharing both pre-trained models and full document digiti-\\nzation pipelines. We demonstrate that LayoutParser is helpful for both\\nlightweight and large-scale digitization pipelines in real-word use cases.\\nThe library is publicly available at https://layout-parser.github.io.\\nKeywords: DocumentImageAnalysis·DeepLearning·LayoutAnalysis\\n· Character Recognition · Open Source library · Toolkit.\\n1 Introduction\\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\\ndocumentimageanalysis(DIA)tasksincludingdocumentimageclassification[11,', metadata={'source': 'example_data/layout-parser-paper.pdf', 'file_path': 'example_data/layout-parser-paper.pdf', 'page': 1, 'total_pages': 16, 'Author': '', 'CreationDate': 'D:20210622012710Z', 'Creator': 'LaTeX with hyperref', 'Keywords': '', 'ModDate': 'D:20210622012710Z', 'PTEX.Fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'Producer': 'pdfTeX-1.40.21', 'Subject': '', 'Title': '', 'Trapped': 'False'})"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data[0]"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -698,7 +760,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.9.16"
  }
 },
 "nbformat": 4,