mirror of
https://github.com/hwchase17/langchain.git
synced 2025-06-04 14:13:29 +00:00
**Description:** - Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs. **Issue:** fixes #19735 **Dependencies:** This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0. Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Erick Friis <erick@langchain.dev>
49 lines
3.5 KiB
Plaintext
49 lines
3.5 KiB
Plaintext
LayoutParser : A Unified Toolkit for Deep
|
|
Learning Based Document Image Analysis
|
|
|
|
|
|
Zejiang Shen 1 ( ), Ruochen Zhang 2, Melissa Dell 3, Benjamin Charles Germain
|
|
Lee 4, Jacob Carlson 3, and Weining Li 5
|
|
|
|
1 Allen Institute for AI
|
|
shannons@allenai.org
|
|
2 Brown University
|
|
ruochen zhang@brown.edu
|
|
3 Harvard University
|
|
{melissadell,jacob carlson }@fas.harvard.edu
|
|
4 University of Washington
|
|
bcgl@cs.washington.edu
|
|
5 University of Waterloo
|
|
w422li@uwaterloo.ca
|
|
|
|
|
|
|
|
Abstract. Recentadvancesindocumentimageanalysis(DIA)havebeen
|
|
primarily driven by the application of neural networks. Ideally, research
|
|
outcomes could be easily deployed in production and extended for further
|
|
investigation. However, various factors like loosely organized codebases
|
|
and sophisticated model configurations complicate the easy reuse of im-
|
|
portant innovations by awide audience. Though there havebeen on-going
|
|
efforts to improve reusability and simplify deep learning (DL) model
|
|
development in disciplines like natural language processing and computer
|
|
vision, none of them are optimized for challenges in the domain of DIA.
|
|
This represents a major gap in the existing toolkit, as DIA is central to
|
|
academic research across a wide range of disciplines in the social sciences
|
|
and humanities. This paper introduces LayoutParser , an open-source
|
|
library for streamlining the usage of DL in DIA research and applica-
|
|
tions. The core LayoutParser library comes with a set of simple and
|
|
intuitive interfaces for applying and customizing DL models for layout de-
|
|
tection,characterrecognition,andmanyotherdocumentprocessingtasks.
|
|
To promote extensibility, LayoutParser also incorporates a community
|
|
platform for sharing both pre-trained models and full document digiti-
|
|
zation pipelines. We demonstrate that LayoutParser is helpful for both
|
|
lightweight and large-scale digitization pipelines in real-word use cases.
|
|
The library is publicly available at https://layout-parser.github.io .
|
|
|
|
Keywords: DocumentImageAnalysis ·DeepLearning ·LayoutAnalysis
|
|
· Character Recognition · Open Source library · Toolkit.
|
|
|
|
1 Introduction
|
|
|
|
Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
|
|
documentimageanalysis(DIA)tasksincludingdocumentimageclassification[ 11 , |