mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-04 12:39:32 +00:00
community[minor]: added new document loaders based on dedoc library (#24303)
### Description This pull request added new document loaders to load documents of various formats using [Dedoc](https://github.com/ispras/dedoc): - `DedocFileLoader` (determine file types automatically and parse) - `DedocPDFLoader` (for `PDF` and images parsing) - `DedocAPIFileLoader` (determine file types automatically and parse using Dedoc API without library installation) [Dedoc](https://dedoc.readthedocs.io) is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers. `Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more. Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1). For `PDF` documents, `Dedoc` allows to determine textual layer correctness and split the document into paragraphs. ### Issue This pull request extends variety of document loaders supported by `langchain_community` allowing users to choose the most suitable option for raw documents parsing. ### Dependencies The PR added a new (optional) dependency `dedoc>=2.2.5` ([library documentation](https://dedoc.readthedocs.io)) to the `extended_testing_deps.txt` ### Twitter handle None ### Add tests and docs 1. Test for the integration: `libs/community/tests/integration_tests/document_loaders/test_dedoc.py` 2. Example notebook: `docs/docs/integrations/document_loaders/dedoc.ipynb` 3. Information about the library: `docs/docs/integrations/providers/dedoc.mdx` ### Lint and test Done locally: - `make format` - `make lint` - `make integration_tests` - `make docs_build` (from the project root) --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>
This commit is contained in:
committed by
GitHub
parent
5ac936a284
commit
2a70a07aad
@@ -26,6 +26,7 @@ from langchain_core.utils import get_from_dict_or_env
|
||||
|
||||
from langchain_community.document_loaders.base import BaseLoader
|
||||
from langchain_community.document_loaders.blob_loaders import Blob
|
||||
from langchain_community.document_loaders.dedoc import DedocBaseLoader
|
||||
from langchain_community.document_loaders.parsers.pdf import (
|
||||
AmazonTextractPDFParser,
|
||||
DocumentIntelligenceParser,
|
||||
@@ -738,6 +739,104 @@ class AmazonTextractPDFLoader(BasePDFLoader):
|
||||
raise ValueError(f"unsupported mime type: {blob.mimetype}") # type: ignore[attr-defined]
|
||||
|
||||
|
||||
class DedocPDFLoader(DedocBaseLoader):
|
||||
"""
|
||||
DedocPDFLoader document loader integration to load PDF files using `dedoc`.
|
||||
The file loader can automatically detect the correctness of a textual layer in the
|
||||
PDF document.
|
||||
Note that `__init__` method supports parameters that differ from ones of
|
||||
DedocBaseLoader.
|
||||
|
||||
Setup:
|
||||
Install ``dedoc`` package.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -U dedoc
|
||||
|
||||
Instantiate:
|
||||
.. code-block:: python
|
||||
|
||||
from langchain_community.document_loaders import DedocPDFLoader
|
||||
|
||||
loader = DedocPDFLoader(
|
||||
file_path="example.pdf",
|
||||
# split=...,
|
||||
# with_tables=...,
|
||||
# pdf_with_text_layer=...,
|
||||
# pages=...,
|
||||
# ...
|
||||
)
|
||||
|
||||
Load:
|
||||
.. code-block:: python
|
||||
|
||||
docs = loader.load()
|
||||
print(docs[0].page_content[:100])
|
||||
print(docs[0].metadata)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
Some text
|
||||
{
|
||||
'file_name': 'example.pdf',
|
||||
'file_type': 'application/pdf',
|
||||
# ...
|
||||
}
|
||||
|
||||
Lazy load:
|
||||
.. code-block:: python
|
||||
|
||||
docs = []
|
||||
docs_lazy = loader.lazy_load()
|
||||
|
||||
for doc in docs_lazy:
|
||||
docs.append(doc)
|
||||
print(docs[0].page_content[:100])
|
||||
print(docs[0].metadata)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
Some text
|
||||
{
|
||||
'file_name': 'example.pdf',
|
||||
'file_type': 'application/pdf',
|
||||
# ...
|
||||
}
|
||||
|
||||
Parameters used for document parsing via `dedoc`
|
||||
(https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html):
|
||||
|
||||
with_attachments: enable attached files extraction
|
||||
recursion_deep_attachments: recursion level for attached files extraction,
|
||||
works only when with_attachments==True
|
||||
pdf_with_text_layer: type of handler for parsing, available options
|
||||
["true", "false", "tabby", "auto", "auto_tabby" (default)]
|
||||
language: language of the document for PDF without a textual layer,
|
||||
available options ["eng", "rus", "rus+eng" (default)], the list of
|
||||
languages can be extended, please see
|
||||
https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
|
||||
pages: page slice to define the reading range for parsing
|
||||
is_one_column_document: detect number of columns for PDF without a textual
|
||||
layer, available options ["true", "false", "auto" (default)]
|
||||
document_orientation: fix document orientation (90, 180, 270 degrees) for PDF
|
||||
without a textual layer, available options ["auto" (default), "no_change"]
|
||||
need_header_footer_analysis: remove headers and footers from the output result
|
||||
need_binarization: clean pages background (binarize) for PDF without a textual
|
||||
layer
|
||||
need_pdf_table_analysis: parse tables for PDF without a textual layer
|
||||
"""
|
||||
|
||||
def _make_config(self) -> dict:
|
||||
from dedoc.utils.langchain import make_manager_pdf_config
|
||||
|
||||
return make_manager_pdf_config(
|
||||
file_path=self.file_path,
|
||||
parsing_params=self.parsing_parameters,
|
||||
split=self.split,
|
||||
)
|
||||
|
||||
|
||||
class DocumentIntelligenceLoader(BasePDFLoader):
|
||||
"""Load a PDF with Azure Document Intelligence"""
|
||||
|
||||
|
Reference in New Issue
Block a user