langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-23 19:41:54 +00:00

History

Nadeem Sajjad eaf2fb287f community(pypdfloader): added page_label in metadata for pypdf loader (#29225 ) # Description ## Summary This PR adds support for handling multi-labeled page numbers in the PyPDFLoader. Some PDFs use complex page numbering systems where the actual content may begin after multiple introductory pages. The page_label field helps accurately reflect the document’s page structure, making it easier to handle such cases during document parsing. ## Motivation This feature improves document parsing accuracy by allowing users to access the actual page labels instead of relying only on the physical page numbers. This is particularly useful for documents where the first few pages have roman numerals or other non-standard page labels. ## Use Case This feature is especially useful for Retrieval-Augmented Generation (RAG) systems where users may reference page numbers when asking questions. Some PDFs have both labeled page numbers (like roman numerals for introductory sections) and index-based page numbers. For example, a user might ask: "What is mentioned on page 5?" The system can now check both: • Index-based page number (page) • Labeled page number (page_label) This dual-check helps improve retrieval accuracy. Additionally, the results can be validated with an agent or tool to ensure the retrieved pages match the user’s query contextually. ## Code Changes - Added a page_label field to the metadata of the Document class in PyPDFLoader. - Implemented support for retrieving page_label from the pdf_reader.page_labels. - Created a test case (test_pypdf_loader_with_multi_label_page_numbers) with a sample PDF containing multi-labeled pages (geotopo-komprimiert.pdf) [[Source of pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)]. - Updated existing tests to ensure compatibility and verify page_label extraction. ## Tests Added - Added a new test case for a PDF with multi-labeled pages. - Verified both page and page_label metadata fields are correctly extracted. ## Screenshots <img width="549" alt="image" src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33" />		2025-01-15 14:18:07 -05:00
..
adapters	community: Add the additonnal kward 'context' for openai (#28351 )	2024-12-02 16:43:30 -05:00
agent_toolkits	community[patch]: fix instantiation for Slack tools (#28990 )	2025-01-02 16:14:17 +00:00
agents	community: add truncation params when an openai assistant's run is created (#28158 )	2024-11-27 10:53:53 -05:00
callbacks	community[patch]: additional check for prompt caching support (#29008 )	2025-01-03 10:14:07 -05:00
chains	community: Deprecate Amazon Neptune resources in langchain-community (#29191 )	2025-01-14 10:23:34 -05:00
chat_loaders
chat_message_histories	Fixed adding float values into DynamoDB (#26562 )	2024-12-18 13:45:00 -05:00
chat_models	fix chatperplexity: remove 'stream' from params in _stream method (#29173 )	2025-01-13 09:31:37 -05:00
cross_encoders
docstore
document_compressors	community: Fix rank-llm import paths for new 0.20.3 version (#29154 )	2025-01-13 10:22:14 -05:00
document_loaders	community(pypdfloader): added page_label in metadata for pypdf loader (#29225 )	2025-01-15 14:18:07 -05:00
document_transformers	community: fallback on core async atransform_documents method for `MarkdownifyTransformer` (#27866 )	2024-12-13 22:32:22 +00:00
embeddings	partner: Update Upstage Model Names and Remove Deprecated Model (#29093 )	2025-01-08 10:13:22 -05:00
example_selectors
graph_vectorstores	core: add kwargs support to VectorStore (#25934 )	2024-12-16 18:57:57 +00:00
graphs	community: Deprecate Amazon Neptune resources in langchain-community (#29191 )	2025-01-14 10:23:34 -05:00
indexes
llms	[langchain_community.llms.xinference]: fix error in xinference.py (#29216 )	2025-01-15 10:11:26 -05:00
memory	all: test 3.13 ci (#27197 )	2024-10-25 12:56:58 -07:00
output_parsers
query_constructors	community: refactor opensearch query constructor to use wildcard instead of match in the contain comparator (#26653 )	2024-12-16 11:16:34 -05:00
retrievers	langchain_community: Add default None values to DocumentAttributeValue class properties (#28785 )	2024-12-18 09:43:04 -05:00
storage	Fixed typo in llibs/community/langchain_community/storage/sql.py (#27029 )	2024-10-08 17:51:26 +00:00
tools	community[patch]: fix instantiation for Slack tools (#28990 )	2025-01-02 16:14:17 +00:00
utilities	[fix] Convert table names to list for compatibility in SQLDatabase (#29229 )	2025-01-15 10:00:03 -05:00
utils
vectorstores	community(azuresearch): allow to use any valid credential (#28873 )	2024-12-23 10:05:48 -05:00
__init__.py
cache.py	cosmosdbnosql: Added Cosmos DB NoSQL Semantic Cache Integration with tests and jupyter notebook (#24424 )	2024-12-16 21:57:05 -05:00
py.typed