mirror of https://github.com/hwchase17/langchain.git synced 2025-08-22 19:08:40 +00:00

History

Nadeem Sajjad eaf2fb287f community(pypdfloader): added page_label in metadata for pypdf loader (#29225 ) # Description ## Summary This PR adds support for handling multi-labeled page numbers in the PyPDFLoader. Some PDFs use complex page numbering systems where the actual content may begin after multiple introductory pages. The page_label field helps accurately reflect the document’s page structure, making it easier to handle such cases during document parsing. ## Motivation This feature improves document parsing accuracy by allowing users to access the actual page labels instead of relying only on the physical page numbers. This is particularly useful for documents where the first few pages have roman numerals or other non-standard page labels. ## Use Case This feature is especially useful for Retrieval-Augmented Generation (RAG) systems where users may reference page numbers when asking questions. Some PDFs have both labeled page numbers (like roman numerals for introductory sections) and index-based page numbers. For example, a user might ask: "What is mentioned on page 5?" The system can now check both: • Index-based page number (page) • Labeled page number (page_label) This dual-check helps improve retrieval accuracy. Additionally, the results can be validated with an agent or tool to ensure the retrieved pages match the user’s query contextually. ## Code Changes - Added a page_label field to the metadata of the Document class in PyPDFLoader. - Implemented support for retrieving page_label from the pdf_reader.page_labels. - Created a test case (test_pypdf_loader_with_multi_label_page_numbers) with a sample PDF containing multi-labeled pages (geotopo-komprimiert.pdf) [[Source of pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)]. - Updated existing tests to ensure compatibility and verify page_label extraction. ## Tests Added - Added a new test case for a PDF with multi-labeled pages. - Verified both page and page_label metadata fields are correctly extracted. ## Screenshots <img width="549" alt="image" src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33" />		2025-01-15 14:18:07 -05:00
..
langchain_community	community(pypdfloader): added page_label in metadata for pypdf loader (#29225 )	2025-01-15 14:18:07 -05:00
scripts	community: adding langchain-predictionguard partner package documentation (#28832 )	2024-12-20 10:51:44 -05:00
tests	community(pypdfloader): added page_label in metadata for pypdf loader (#29225 )	2025-01-15 14:18:07 -05:00
extended_testing_deps.txt	community: Add configurable `VisualFeatures` to the `AzureAiServicesImageAnalysisTool` (#27444 )	2024-12-16 18:30:04 +00:00
Makefile	infra: speed up unit tests (#28974 )	2025-01-02 04:13:08 +00:00
poetry.lock	community[patch]: release 0.3.14 (#29019 )	2025-01-03 15:34:24 -05:00
pyproject.toml	community[patch]: release 0.3.14 (#29019 )	2025-01-03 15:34:24 -05:00
README.md	docs: Fix stack diagram in community README (#28685 )	2024-12-12 13:33:50 -08:00

README.md

🦜️🧑‍🤝‍🧑 LangChain Community

Quick Install

pip install langchain-community

What is it?

LangChain Community contains third-party integrations that implement the base interfaces defined in LangChain Core, making them ready-to-use in any LangChain application.

For full documentation see the API reference.

📕 Releases & Versioning

langchain-community is currently on version 0.0.x

All changes will be accompanied by a patch version increase.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see the Contributing Guide.

README.md Unescape Escape