langchain

mirror of https://github.com/hwchase17/langchain.git synced 2025-08-23 11:32:10 +00:00

History

Nadeem Sajjad eaf2fb287f community(pypdfloader): added page_label in metadata for pypdf loader (#29225 ) # Description ## Summary This PR adds support for handling multi-labeled page numbers in the PyPDFLoader. Some PDFs use complex page numbering systems where the actual content may begin after multiple introductory pages. The page_label field helps accurately reflect the document’s page structure, making it easier to handle such cases during document parsing. ## Motivation This feature improves document parsing accuracy by allowing users to access the actual page labels instead of relying only on the physical page numbers. This is particularly useful for documents where the first few pages have roman numerals or other non-standard page labels. ## Use Case This feature is especially useful for Retrieval-Augmented Generation (RAG) systems where users may reference page numbers when asking questions. Some PDFs have both labeled page numbers (like roman numerals for introductory sections) and index-based page numbers. For example, a user might ask: "What is mentioned on page 5?" The system can now check both: • Index-based page number (page) • Labeled page number (page_label) This dual-check helps improve retrieval accuracy. Additionally, the results can be validated with an agent or tool to ensure the retrieved pages match the user’s query contextually. ## Code Changes - Added a page_label field to the metadata of the Document class in PyPDFLoader. - Implemented support for retrieving page_label from the pdf_reader.page_labels. - Created a test case (test_pypdf_loader_with_multi_label_page_numbers) with a sample PDF containing multi-labeled pages (geotopo-komprimiert.pdf) [[Source of pdf](https://github.com/py-pdf/sample-files/blob/main/009-pdflatex-geotopo/GeoTopo-komprimiert.pdf)]. - Updated existing tests to ensure compatibility and verify page_label extraction. ## Tests Added - Added a new test case for a PDF with multi-labeled pages. - Verified both page and page_label metadata fields are correctly extracted. ## Screenshots <img width="549" alt="image" src="https://github.com/user-attachments/assets/65db9f5c-032e-4592-926f-824777c28f33" />		2025-01-15 14:18:07 -05:00
..
agent_toolkits
agents	community: add truncation params when an openai assistant's run is created (#28158 )	2024-11-27 10:53:53 -05:00
callbacks	Community : Add OpenAI prompt caching and reasoning tokens tracking (#27135 )	2024-12-19 09:31:13 -05:00
chains	core: add kwargs support to VectorStore (#25934 )	2024-12-16 18:57:57 +00:00
chat_loaders
chat_message_histories	community[patch]: update dynamodb chat history to update instead of overwrite (#22397 )	2024-12-16 10:38:00 -05:00
chat_models	community: add new parameter default_headers (#28700 )	2024-12-18 22:33:23 +00:00
cross_encoders
data	community: Resolve refs recursively when generating openai_fn from OpenAPI spec (#19002 )	2024-09-02 13:17:39 -07:00
docstore
document_compressors	community: add InfinityRerank (#27043 )	2024-11-06 17:26:30 -08:00
document_loaders	community(pypdfloader): added page_label in metadata for pypdf loader (#29225 )	2025-01-15 14:18:07 -05:00
document_transformers	community: fallback on core async atransform_documents method for `MarkdownifyTransformer` (#27866 )	2024-12-13 22:32:22 +00:00
embeddings	Community: LlamaCppEmbeddings `embed_documents` and `embed_query` (#28827 )	2024-12-23 09:50:22 -05:00
evaluation
examples
graph_vectorstores	[community]: Render documents to graphviz (#24830 )	2024-12-14 02:02:09 +00:00
graphs	community: Apache AGE wrapper additional edge cases. (#28151 )	2024-12-16 11:28:01 -05:00
imports
indexes
llms	community: update documentation and model IDs for FriendliAI provider (#28984 )	2025-01-02 12:15:59 -05:00
load	multiple: pydantic 2 compatibility, v0.3 (#26443 )	2024-09-13 14:38:45 -07:00
query_constructors	langchain_community: updated query constructor for Databricks Vector Search due to LangChainDeprecationWarning: `filters` was deprecated since langchain-community 0.2.11 and will be removed in 0.3. Please use `filter` instead. (#27974 )	2024-12-03 16:03:53 -08:00
retrievers	community: BM25Retriever preservation of document id (#27019 )	2024-12-04 00:36:00 +00:00
storage
tools	community: Add configurable `VisualFeatures` to the `AzureAiServicesImageAnalysisTool` (#27444 )	2024-12-16 18:30:04 +00:00
utilities	Community: Google Books API Tool (#27307 )	2024-11-07 15:29:35 -08:00
utils	partners: Use simsimd types (#25299 )	2024-08-23 10:41:39 -04:00
vectorstores	community: added FalkorDB vector store support i.e implementation, test, docs an… (#26245 )	2024-12-16 19:37:55 +00:00
__init__.py
conftest.py
test_cache.py
test_dependencies.py	infra: speed up unit tests (#28974 )	2025-01-02 04:13:08 +00:00
test_document_transformers.py
test_graph_vectorstores.py	core, community: move graph vectorstores to community (#26678 )	2024-09-19 11:38:14 -07:00
test_imports.py
test_sql_database_schema.py
test_sql_database.py
test_sqlalchemy.py