langchain/libs/community/tests/unit_tests
Nadeem Sajjad eaf2fb287f
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description

## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.

## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.

## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.

For example, a user might ask:

	"What is mentioned on page 5?"

The system can now check both:
	•	**Index-based page number** (page)
	•	**Labeled page number** (page_label)

This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.

## Code Changes

- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.

## Tests Added

- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.

## Screenshots

<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
2025-01-15 14:18:07 -05:00
..
agent_toolkits
agents community: add truncation params when an openai assistant's run is created (#28158) 2024-11-27 10:53:53 -05:00
callbacks Community : Add OpenAI prompt caching and reasoning tokens tracking (#27135) 2024-12-19 09:31:13 -05:00
chains core: add kwargs support to VectorStore (#25934) 2024-12-16 18:57:57 +00:00
chat_loaders
chat_message_histories community[patch]: update dynamodb chat history to update instead of overwrite (#22397) 2024-12-16 10:38:00 -05:00
chat_models community: add new parameter default_headers (#28700) 2024-12-18 22:33:23 +00:00
cross_encoders
data community: Resolve refs recursively when generating openai_fn from OpenAPI spec (#19002) 2024-09-02 13:17:39 -07:00
docstore
document_compressors community: add InfinityRerank (#27043) 2024-11-06 17:26:30 -08:00
document_loaders community(pypdfloader): added page_label in metadata for pypdf loader (#29225) 2025-01-15 14:18:07 -05:00
document_transformers community: fallback on core async atransform_documents method for MarkdownifyTransformer (#27866) 2024-12-13 22:32:22 +00:00
embeddings Community: LlamaCppEmbeddings embed_documents and embed_query (#28827) 2024-12-23 09:50:22 -05:00
evaluation
examples
graph_vectorstores [community]: Render documents to graphviz (#24830) 2024-12-14 02:02:09 +00:00
graphs community: Apache AGE wrapper additional edge cases. (#28151) 2024-12-16 11:28:01 -05:00
imports
indexes
llms community: update documentation and model IDs for FriendliAI provider (#28984) 2025-01-02 12:15:59 -05:00
load multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
query_constructors langchain_community: updated query constructor for Databricks Vector Search due to LangChainDeprecationWarning: filters was deprecated since langchain-community 0.2.11 and will be removed in 0.3. Please use filter instead. (#27974) 2024-12-03 16:03:53 -08:00
retrievers community: BM25Retriever preservation of document id (#27019) 2024-12-04 00:36:00 +00:00
storage
tools community: Add configurable VisualFeatures to the AzureAiServicesImageAnalysisTool (#27444) 2024-12-16 18:30:04 +00:00
utilities Community: Google Books API Tool (#27307) 2024-11-07 15:29:35 -08:00
utils partners: Use simsimd types (#25299) 2024-08-23 10:41:39 -04:00
vectorstores community: added FalkorDB vector store support i.e implementation, test, docs an… (#26245) 2024-12-16 19:37:55 +00:00
__init__.py
conftest.py
test_cache.py
test_dependencies.py infra: speed up unit tests (#28974) 2025-01-02 04:13:08 +00:00
test_document_transformers.py
test_graph_vectorstores.py core, community: move graph vectorstores to community (#26678) 2024-09-19 11:38:14 -07:00
test_imports.py
test_sql_database_schema.py
test_sql_database.py
test_sqlalchemy.py