langchain/libs/community/langchain_community/document_loaders
Nadeem Sajjad eaf2fb287f
community(pypdfloader): added page_label in metadata for pypdf loader (#29225)
# Description

## Summary
This PR adds support for handling multi-labeled page numbers in the
**PyPDFLoader**. Some PDFs use complex page numbering systems where the
actual content may begin after multiple introductory pages. The
page_label field helps accurately reflect the document’s page structure,
making it easier to handle such cases during document parsing.

## Motivation
This feature improves document parsing accuracy by allowing users to
access the actual page labels instead of relying only on the physical
page numbers. This is particularly useful for documents where the first
few pages have roman numerals or other non-standard page labels.

## Use Case
This feature is especially useful for **Retrieval-Augmented Generation**
(RAG) systems where users may reference page numbers when asking
questions. Some PDFs have both labeled page numbers (like roman numerals
for introductory sections) and index-based page numbers.

For example, a user might ask:

	"What is mentioned on page 5?"

The system can now check both:
	•	**Index-based page number** (page)
	•	**Labeled page number** (page_label)

This dual-check helps improve retrieval accuracy. Additionally, the
results can be validated with an **agent or tool** to ensure the
retrieved pages match the user’s query contextually.

## Code Changes

- Added a page_label field to the metadata of the Document class in
**PyPDFLoader**.
- Implemented support for retrieving page_label from the
pdf_reader.page_labels.
- Created a test case (test_pypdf_loader_with_multi_label_page_numbers)
with a sample PDF containing multi-labeled pages
(geotopo-komprimiert.pdf) [[Source of
pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)].
- Updated existing tests to ensure compatibility and verify page_label
extraction.

## Tests Added

- Added a new test case for a PDF with multi-labeled pages.
- Verified both page and page_label metadata fields are correctly
extracted.

## Screenshots

<img width="549" alt="image"
src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33"
/>
2025-01-15 14:18:07 -05:00
..
blob_loaders all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
parsers community(pypdfloader): added page_label in metadata for pypdf loader (#29225) 2025-01-15 14:18:07 -05:00
__init__.py community[patch]: Refactoring PDF loaders: 01 prepare (#29062) 2025-01-07 11:00:04 -05:00
acreom.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
airbyte_json.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
airbyte.py
airtable.py docs: fix kwargs docstring (#25010) 2024-08-02 19:54:54 -07:00
apify_dataset.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
arcgis_loader.py
arxiv.py docs: Arxiv docs update (#23871) 2024-07-05 11:43:51 -04:00
assemblyai.py community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
astradb.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
async_html.py community[patch]: Release 0.2.11 (#24989) 2024-08-02 20:08:44 +00:00
athena.py community: make AthenaLoader profile_name optional and fix type hint (#24958) 2024-08-05 14:28:58 +00:00
azlyrics.py
azure_ai_data.py
azure_blob_storage_container.py
azure_blob_storage_file.py
baiducloud_bos_directory.py
baiducloud_bos_file.py
base_o365.py Community: add modified_since argument to O365BaseLoader (#28708) 2024-12-13 17:30:17 +00:00
base.py
bibtex.py
bigquery.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
bilibili.py community[patch]: docstrings update (#20301) 2024-04-11 16:23:27 -04:00
blackboard.py community: add flag to toggle progress bar (#24463) 2024-07-20 13:18:02 +00:00
blockchain.py community: add supported blockchains to Blockchain Document Loader (#25428) 2024-08-23 14:39:42 +00:00
brave_search.py
browserbase.py community: updated Browserbase loader (#21757) 2024-05-16 08:21:23 -07:00
browserless.py
cassandra.py community[minor]: Add Cassandra ByteStore (#22064) 2024-05-23 10:46:23 -04:00
chatgpt.py
chm.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
chromium.py community[minor]: add user agent for web scraping loaders (#22480) 2024-06-05 15:20:34 +00:00
college_confidential.py
concurrent.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
confluence.py community: Fix ConfluenceLoader load() failure caused by deleted pages (#29232) 2025-01-15 09:56:23 -05:00
conllu.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
couchbase.py
csv_loader.py all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
cube_semantic.py community: add missing format specifier in error log in CubeSemanticLoader (#29172) 2025-01-13 09:32:57 -05:00
datadog_logs.py
dataframe.py Update dataframe.py (#28871) 2024-12-22 19:16:16 -05:00
dedoc.py community[minor]: added new document loaders based on dedoc library (#24303) 2024-07-23 02:04:53 +00:00
diffbot.py
directory.py community: glob multiple patterns when using DirectoryLoader (#22852) 2024-06-18 09:24:50 -07:00
discord.py
doc_intelligence.py community: bytes as a source to AzureAIDocumentIntelligenceLoader (#26618) 2024-11-07 03:40:21 +00:00
docugami.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
docusaurus.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
dropbox.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
duckdb_loader.py
email.py all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
epub.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
etherscan.py
evernote.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
excel.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
facebook_chat.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
fauna.py
figma.py
firecrawl.py Community: Updated Firecrawl Document Loader to v1 (#26548) 2024-10-15 13:13:28 +00:00
gcs_directory.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
gcs_file.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
generic.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
geodataframe.py
git.py all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
gitbook.py community: add flag to toggle progress bar (#24463) 2024-07-20 13:18:02 +00:00
github.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
glue_catalog.py community[minor]: Add glue catalog loader (#20220) 2024-04-16 11:39:23 -04:00
google_speech_to_text.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
googledrive.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
gutenberg.py
helpers.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
hn.py
html_bs.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
html.py community: add init for UnstructuredHTMLLoader to solve pathlib paths (#29091) 2025-01-08 10:19:27 -05:00
hugging_face_dataset.py
hugging_face_model.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
ifixit.py
image_captions.py all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
image.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
imsdb.py
iugu.py
joplin.py
json_loader.py community[minor]: Fix json._validate_metadata_func() (#22842) 2024-12-13 21:24:20 +00:00
kinetica_loader.py community[patch]: Kinetica Integrations handled error in querying; quotes in table names; updated gpudb API (#22724) 2024-06-11 10:01:26 -04:00
lakefs.py
larksuite.py community[minor]: Add LarkSuite wiki document loader. (#21016) 2024-04-29 10:37:50 -04:00
llmsherpa.py community[minor]: add support for llmsherpa (#19741) 2024-03-29 16:04:57 -07:00
markdown.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
mastodon.py
max_compute.py
mediawikidump.py
merge.py
mhtml.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
mintbase.py community[minor]: add mintbase loader to langchain (#20089) 2024-04-30 04:11:56 +00:00
modern_treasury.py
mongodb.py community: Enhance MongoDBLoader with flexible metadata and optimized field extraction (#23376) 2024-09-17 10:23:17 -04:00
needle.py community: add Needle retriever and document loader integration (#28157) 2024-12-03 22:06:25 +00:00
news.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
notebook.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
notion.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
notiondb.py community: Correctly handle multi-element rich text (#25762) 2024-12-16 20:20:27 +00:00
nuclia.py
obs_directory.py
obs_file.py
obsidian.py community[patch]: Add missing annotations (#24890) 2024-07-31 18:13:44 +00:00
odt.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
onedrive_file.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
onedrive.py community: Allow other than default parsers in SharePointLoader and OneDriveLoader (#27716) 2024-11-06 17:44:34 -05:00
onenote.py community[patch]: Fix validation error in SettingsConfigDict across multiple Langchain modules (#26852) 2024-09-25 10:02:14 -04:00
open_city_data.py
oracleadb_loader.py community: Add support for clob datatype in oracle database (#27330) 2024-10-16 02:19:20 +00:00
oracleai.py community[minor]: Oraclevs integration (#21123) 2024-05-04 03:15:35 +00:00
org_mode.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
pdf.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
pebblo.py community[minor]: [Pebblo] Enhance PebbloSafeLoader to take anonymize flag (#26812) 2024-09-25 09:33:06 -04:00
polars_dataframe.py
powerpoint.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
psychic.py multiple: Remove unnecessary Ruff suppression comments (#21050) 2024-04-30 17:13:48 +00:00
pubmed.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
pyspark_dataframe.py
python.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
quip.py community[major]: lint for usage of xml library (#22132) 2024-05-24 15:23:53 +00:00
readthedocs.py
recursive_url_loader.py community[minor]: add proxy support to RecursiveUrlLoader (#27364) 2024-10-16 16:29:59 +00:00
reddit.py
roam.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
rocksetdb.py
rspace.py
rss.py multiple: Remove unnecessary Ruff suppression comments (#21050) 2024-04-30 17:13:48 +00:00
rst.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
rtf.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
s3_directory.py
s3_file.py community[patch]: support unstructured_kwargs for s3 loader (#15473) 2024-03-27 22:03:48 +00:00
scrapfly.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
scrapingant.py community[minor]: Add ScrapingAnt Loader Community Integration (#24514) 2024-07-24 21:11:43 -04:00
sharepoint.py community: Allow other than default parsers in SharePointLoader and OneDriveLoader (#27716) 2024-11-06 17:44:34 -05:00
sitemap.py community[patch]: SitemapLoader restrict depth of parsing sitemap (CVE-2024-2965) (#22903) 2024-06-14 13:04:40 -04:00
slack_directory.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
snowflake_loader.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
spider.py doc list not empty (#21208) 2024-05-20 08:24:06 -07:00
spreedly.py
sql_database.py community[patch]: restore compatibility with SQLAlchemy 1.x (#22546) 2024-06-19 17:58:57 +00:00
srt.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
stripe.py
surrealdb.py
telegram.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
tencent_cos_directory.py
tencent_cos_file.py
tensorflow_datasets.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
text.py community: better support of pathlib paths in document loaders (#18396) 2024-03-26 11:51:52 -04:00
tidb.py
tomarkdown.py community[patch]: Update URL to the 2markdown API (#24546) 2024-07-23 14:27:55 +00:00
toml.py
trello.py
tsv.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
twitter.py
unstructured.py multiple: update removal targets (#25361) 2024-08-14 09:50:39 -04:00
url_playwright.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
url_selenium.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
url.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
vsdx.py community[patch]: import flattening fix (#20110) 2024-04-10 13:01:19 -04:00
weather.py infra: update mypy 1.10, ruff 0.5 (#23721) 2024-07-03 10:33:27 -07:00
web_base.py community: Corrected aload func to be asynchronous from webBaseLoader (#28337) 2024-12-20 14:42:52 -05:00
whatsapp_chat.py
wikipedia.py community[patch]: upgrade to recent version of mypy (#21616) 2024-05-13 14:55:07 -04:00
word_document.py community: add init for unstructured file loader (#29101) 2025-01-13 09:26:00 -05:00
xml.py all: test 3.13 ci (#27197) 2024-10-25 12:56:58 -07:00
xorbits.py
youtube.py multiple: pydantic 2 compatibility, v0.3 (#26443) 2024-09-13 14:38:45 -07:00
yuque.py