community: support advanced text extraction options for pdf documents (#20265)

**Description:** - Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs. **Issue:** fixes #19735 **Dependencies:** This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0. Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Erick Friis <erick@langchain.dev>
2025-09-17 07:26:16 +00:00 · 2024-07-17 22:47:09 +02:00
parent a402de3dae
commit 034a8c7c1b
7 changed files with 101 additions and 6 deletions
--- a/libs/community/tests/integration_tests/document_loaders/test_pdf.py
+++ b/libs/community/tests/integration_tests/document_loaders/test_pdf.py
@@ -1,3 +1,4 @@
+import re
 from pathlib import Path
 from typing import Sequence, Union

@@ -100,6 +101,22 @@ def test_pypdf_loader() -> None:
    assert len(docs) == 16


+def test_pypdf_loader_with_layout() -> None:
+    """Test PyPDFLoader with layout mode."""
+    file_path = Path(__file__).parent.parent / "examples/layout-parser-paper.pdf"
+    loader = PyPDFLoader(str(file_path), extraction_mode="layout")
+
+    docs = loader.load()
+    first_page = docs[0].page_content
+
+    expected = (
+        Path(__file__).parent.parent / "examples/layout-parser-paper-page-1.txt"
+    ).read_text(encoding="utf-8")
+    cleaned_first_page = re.sub(r"\x00", "", first_page)
+    cleaned_expected = re.sub(r"\x00", "", expected)
+    assert cleaned_first_page == cleaned_expected
+
+
 def test_pypdfium2_loader() -> None:
    """Test PyPDFium2Loader."""
    file_path = Path(__file__).parent.parent / "examples/hello.pdf"
--- a/libs/community/tests/integration_tests/examples/layout-parser-paper-page-1.txt
+++ b/libs/community/tests/integration_tests/examples/layout-parser-paper-page-1.txt
@@ -0,0 +1,49 @@
+             LayoutParser         : A Uniﬁed Toolkit for Deep
+          Learning Based Document Image Analysis
+
+
+Zejiang Shen           1  (     ), Ruochen Zhang                2, Melissa Dell         3, Benjamin Charles Germain
+                                         Lee   4, Jacob Carlson            3, and Weining Li              5
+
+                                                           1  Allen Institute for AI
+                                                           shannons@allenai.org
+                                                               2  Brown University
+                                                        ruochen          zhang@brown.edu
+                                                             3  Harvard University
+                                  {melissadell,jacob                       carlson       }@fas.harvard.edu
+                                                       4  University of Washington
+                                                         bcgl@cs.washington.edu
+                                                          5  University of Waterloo
+                                                            w422li@uwaterloo.ca
+
+
+
+             Abstract.        Recentadvancesindocumentimageanalysis(DIA)havebeen
+             primarily driven by the application of neural networks. Ideally, research
+             outcomes could be easily deployed in production and extended for further
+             investigation. However, various factors like loosely organized codebases
+             and sophisticated model conﬁgurations complicate the easy reuse of im-
+             portant innovations by awide audience. Though there havebeen on-going
+             eﬀorts to improve reusability and simplify deep learning (DL) model
+             development in disciplines like natural language processing and computer
+             vision, none of them are optimized for challenges in the domain of DIA.
+             This represents a major gap in the existing toolkit, as DIA is central to
+             academic research across a wide range of disciplines in the social sciences
+             and humanities. This paper introduces                           LayoutParser           , an open-source
+             library for streamlining the usage of DL in DIA research and applica-
+             tions. The core          LayoutParser            library comes with a set of simple and
+             intuitive interfaces for applying and customizing DL models for layout de-
+             tection,characterrecognition,andmanyotherdocumentprocessingtasks.
+             To promote extensibility,                 LayoutParser            also incorporates a community
+             platform for sharing both pre-trained models and full document digiti-
+             zation pipelines. We demonstrate that                         LayoutParser            is helpful for both
+             lightweight and large-scale digitization pipelines in real-word use cases.
+             The library is publicly available at                    https://layout-parser.github.io                            .
+
+             Keywords:          DocumentImageAnalysis                                    ·DeepLearning                    ·LayoutAnalysis
+             · Character Recognition                              · Open Source library                           · Toolkit.
+
+1   Introduction
+
+Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
+documentimageanalysis(DIA)tasksincludingdocumentimageclassiﬁcation[                                              11 ,