mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-17 07:26:16 +00:00
community: support advanced text extraction options for pdf documents (#20265)
**Description:** - Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs. **Issue:** fixes #19735 **Dependencies:** This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0. Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
@@ -1,3 +1,4 @@
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Sequence, Union
|
||||
|
||||
@@ -100,6 +101,22 @@ def test_pypdf_loader() -> None:
|
||||
assert len(docs) == 16
|
||||
|
||||
|
||||
def test_pypdf_loader_with_layout() -> None:
|
||||
"""Test PyPDFLoader with layout mode."""
|
||||
file_path = Path(__file__).parent.parent / "examples/layout-parser-paper.pdf"
|
||||
loader = PyPDFLoader(str(file_path), extraction_mode="layout")
|
||||
|
||||
docs = loader.load()
|
||||
first_page = docs[0].page_content
|
||||
|
||||
expected = (
|
||||
Path(__file__).parent.parent / "examples/layout-parser-paper-page-1.txt"
|
||||
).read_text(encoding="utf-8")
|
||||
cleaned_first_page = re.sub(r"\x00", "", first_page)
|
||||
cleaned_expected = re.sub(r"\x00", "", expected)
|
||||
assert cleaned_first_page == cleaned_expected
|
||||
|
||||
|
||||
def test_pypdfium2_loader() -> None:
|
||||
"""Test PyPDFium2Loader."""
|
||||
file_path = Path(__file__).parent.parent / "examples/hello.pdf"
|
||||
|
@@ -0,0 +1,49 @@
|
||||
LayoutParser : A Unified Toolkit for Deep
|
||||
Learning Based Document Image Analysis
|
||||
|
||||
|
||||
Zejiang Shen 1 ( ), Ruochen Zhang 2, Melissa Dell 3, Benjamin Charles Germain
|
||||
Lee 4, Jacob Carlson 3, and Weining Li 5
|
||||
|
||||
1 Allen Institute for AI
|
||||
shannons@allenai.org
|
||||
2 Brown University
|
||||
ruochen zhang@brown.edu
|
||||
3 Harvard University
|
||||
{melissadell,jacob carlson }@fas.harvard.edu
|
||||
4 University of Washington
|
||||
bcgl@cs.washington.edu
|
||||
5 University of Waterloo
|
||||
w422li@uwaterloo.ca
|
||||
|
||||
|
||||
|
||||
Abstract. Recentadvancesindocumentimageanalysis(DIA)havebeen
|
||||
primarily driven by the application of neural networks. Ideally, research
|
||||
outcomes could be easily deployed in production and extended for further
|
||||
investigation. However, various factors like loosely organized codebases
|
||||
and sophisticated model configurations complicate the easy reuse of im-
|
||||
portant innovations by awide audience. Though there havebeen on-going
|
||||
efforts to improve reusability and simplify deep learning (DL) model
|
||||
development in disciplines like natural language processing and computer
|
||||
vision, none of them are optimized for challenges in the domain of DIA.
|
||||
This represents a major gap in the existing toolkit, as DIA is central to
|
||||
academic research across a wide range of disciplines in the social sciences
|
||||
and humanities. This paper introduces LayoutParser , an open-source
|
||||
library for streamlining the usage of DL in DIA research and applica-
|
||||
tions. The core LayoutParser library comes with a set of simple and
|
||||
intuitive interfaces for applying and customizing DL models for layout de-
|
||||
tection,characterrecognition,andmanyotherdocumentprocessingtasks.
|
||||
To promote extensibility, LayoutParser also incorporates a community
|
||||
platform for sharing both pre-trained models and full document digiti-
|
||||
zation pipelines. We demonstrate that LayoutParser is helpful for both
|
||||
lightweight and large-scale digitization pipelines in real-word use cases.
|
||||
The library is publicly available at https://layout-parser.github.io .
|
||||
|
||||
Keywords: DocumentImageAnalysis ·DeepLearning ·LayoutAnalysis
|
||||
· Character Recognition · Open Source library · Toolkit.
|
||||
|
||||
1 Introduction
|
||||
|
||||
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
|
||||
documentimageanalysis(DIA)tasksincludingdocumentimageclassification[ 11 ,
|
Reference in New Issue
Block a user