langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-09-10 15:33:11 +00:00

Files

Brice Fotzo 034a8c7c1b community: support advanced text extraction options for pdf documents (#20265 )

**Description:** 
- Updated constructors in PyPDFParser and PyPDFLoader to handle
`extraction_mode` and additional kwargs, aligning with the capabilities
of `PageObject.extract_text()` from pypdf.

- Added `test_pypdf_loader_with_layout` along with a corresponding
example text file to validate layout extraction from PDFs.

**Issue:** fixes #19735 

**Dependencies:** This change requires updating the pypdf dependency
from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test
test_pypdf_loader_with_layout and an example text file to ensure the
functionality of layout extraction from PDFs aligns with the new
capabilities.

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Erick Friis <erick@langchain.dev>

2024-07-17 20:47:09 +00:00

examples

…

integration_tests

community: support advanced text extraction options for pdf documents (#20265 )

2024-07-17 20:47:09 +00:00

unit_tests

community[minor]: Add ApertureDB as a vectorstore (#24088 )

2024-07-16 09:32:59 -07:00

__init__.py

…

data.py

infra: update mypy 1.10, ruff 0.5 (#23721 )

2024-07-03 10:33:27 -07:00