langchain/libs/community/langchain_community
Nadeem Sajjad eaf2fb287f
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description

## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.

## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.

## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.

For example, a user might ask:

	"What is mentioned on page 5?"

The system can now check both:
	•	**Index-based page number** (page)
	•	**Labeled page number** (page_label)

This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.

## Code Changes

- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.

## Tests Added

- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.

## Screenshots

<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
2025-01-15 14:18:07 -05:00
..
adapters community: Add the additonnal kward 'context' for openai (#28351) 2024-12-02 16:43:30 -05:00
agent_toolkits community[patch]: fix instantiation for Slack tools (#28990) 2025-01-02 16:14:17 +00:00
agents community: add truncation params when an openai assistant's run is created (#28158) 2024-11-27 10:53:53 -05:00
callbacks community[patch]: additional check for prompt caching support (#29008) 2025-01-03 10:14:07 -05:00
chains community: Deprecate Amazon Neptune resources in langchain-community (#29191) 2025-01-14 10:23:34 -05:00
chat_loaders
chat_message_histories Fixed adding float values into DynamoDB (#26562) 2024-12-18 13:45:00 -05:00
chat_models fix chatperplexity: remove 'stream' from params in _stream method (#29173) 2025-01-13 09:31:37 -05:00
cross_encoders
docstore
document_compressors community: Fix rank-llm import paths for new 0.20.3 version (#29154) 2025-01-13 10:22:14 -05:00
document_loaders community(pypdfloader): added page_label in metadata for pypdf loader (#29225) 2025-01-15 14:18:07 -05:00
document_transformers community: fallback on core async atransform_documents method for MarkdownifyTransformer (#27866) 2024-12-13 22:32:22 +00:00
embeddings partner: Update Upstage Model Names and Remove Deprecated Model (#29093) 2025-01-08 10:13:22 -05:00
example_selectors
graph_vectorstores core: add kwargs support to VectorStore (#25934) 2024-12-16 18:57:57 +00:00
graphs community: Deprecate Amazon Neptune resources in langchain-community (#29191) 2025-01-14 10:23:34 -05:00
indexes
llms [langchain_community.llms.xinference]: fix error in xinference.py (#29216) 2025-01-15 10:11:26 -05:00
memory all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
output_parsers
query_constructors community: refactor opensearch query constructor to use wildcard instead of match in the contain comparator (#26653) 2024-12-16 11:16:34 -05:00
retrievers langchain_community: Add default None values to DocumentAttributeValue class properties (#28785) 2024-12-18 09:43:04 -05:00
storage Fixed typo in llibs/community/langchain_community/storage/sql.py (#27029) 2024-10-08 17:51:26 +00:00
tools community[patch]: fix instantiation for Slack tools (#28990) 2025-01-02 16:14:17 +00:00
utilities [fix] Convert table names to list for compatibility in SQLDatabase (#29229) 2025-01-15 10:00:03 -05:00
utils
vectorstores community(azuresearch): allow to use any valid credential (#28873) 2024-12-23 10:05:48 -05:00
__init__.py
cache.py cosmosdbnosql: Added Cosmos DB NoSQL Semantic Cache Integration with tests and jupyter notebook (#24424) 2024-12-16 21:57:05 -05:00
py.typed