community: support advanced text extraction options for pdf documents (#20265)

**Description:** - Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs. **Issue:** fixes #19735 **Dependencies:** This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0. Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities. --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Erick Friis <erick@langchain.dev>
2025-09-17 15:35:14 +00:00 · 2024-07-17 22:47:09 +02:00
parent a402de3dae
commit 034a8c7c1b
7 changed files with 101 additions and 6 deletions
--- a/templates/mongo-parent-document-retrieval/pyproject.toml
+++ b/templates/mongo-parent-document-retrieval/pyproject.toml
@@ -10,7 +10,7 @@ python = ">=3.8.1,<4.0"
 langchain = "^0.1"
 openai = "<2"
 pymongo = "^4.6.0"
-pypdf = "^3.17.0"
+pypdf = "^4.0.0"
 tiktoken = "^0.5.1"
 langchain-text-splitters = ">=0.0.1,<0.1"