mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
### Description This pull request added new document loaders to load documents of various formats using [Dedoc](https://github.com/ispras/dedoc): - `DedocFileLoader` (determine file types automatically and parse) - `DedocPDFLoader` (for `PDF` and images parsing) - `DedocAPIFileLoader` (determine file types automatically and parse using Dedoc API without library installation) [Dedoc](https://dedoc.readthedocs.io) is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats. The library is actively developed and maintained by a group of developers. `Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more. Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1). For `PDF` documents, `Dedoc` allows to determine textual layer correctness and split the document into paragraphs. ### Issue This pull request extends variety of document loaders supported by `langchain_community` allowing users to choose the most suitable option for raw documents parsing. ### Dependencies The PR added a new (optional) dependency `dedoc>=2.2.5` ([library documentation](https://dedoc.readthedocs.io)) to the `extended_testing_deps.txt` ### Twitter handle None ### Add tests and docs 1. Test for the integration: `libs/community/tests/integration_tests/document_loaders/test_dedoc.py` 2. Example notebook: `docs/docs/integrations/document_loaders/dedoc.ipynb` 3. Information about the library: `docs/docs/integrations/providers/dedoc.mdx` ### Lint and test Done locally: - `make format` - `make lint` - `make integration_tests` - `make docs_build` (from the project root) --------- Co-authored-by: Nasty <bogatenkova.anastasiya@mail.ru>
485 lines
17 KiB
Plaintext
485 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6b74f73d-1763-42d0-9c24-8f65f445bb72",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Dedoc\n",
|
|
"\n",
|
|
"This sample demonstrates the use of `Dedoc` in combination with `LangChain` as a `DocumentLoader`.\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"[Dedoc](https://dedoc.readthedocs.io) is an [open-source](https://github.com/ispras/dedoc)\n",
|
|
"library/service that extracts texts, tables, attached files and document structure\n",
|
|
"(e.g., titles, list items, etc.) from files of various formats.\n",
|
|
"\n",
|
|
"`Dedoc` supports `DOCX`, `XLSX`, `PPTX`, `EML`, `HTML`, `PDF`, images and more.\n",
|
|
"Full list of supported formats can be found [here](https://dedoc.readthedocs.io/en/latest/#id1).\n",
|
|
"\n",
|
|
"\n",
|
|
"### Integration details\n",
|
|
"\n",
|
|
"| Class | Package | Local | Serializable | JS support |\n",
|
|
"|:-----------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|:-----:|:------------:|:----------:|\n",
|
|
"| [DedocFileLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocFileLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ❌ | beta | ❌ |\n",
|
|
"| [DedocPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.DedocPDFLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ❌ | beta | ❌ | \n",
|
|
"| [DedocAPIFileLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocAPIFileLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/community_api_reference.html) | ❌ | beta | ❌ | \n",
|
|
"\n",
|
|
"\n",
|
|
"### Loader features\n",
|
|
"\n",
|
|
"Methods for lazy loading and async loading are available, but in fact, document loading is executed synchronously.\n",
|
|
"\n",
|
|
"| Source | Document Lazy Loading | Async Support |\n",
|
|
"|:------------------:|:---------------------:|:-------------:| \n",
|
|
"| DedocFileLoader | ❌ | ❌ |\n",
|
|
"| DedocPDFLoader | ❌ | ❌ | \n",
|
|
"| DedocAPIFileLoader | ❌ | ❌ | \n",
|
|
"\n",
|
|
"## Setup\n",
|
|
"\n",
|
|
"* To access `DedocFileLoader` and `DedocPDFLoader` document loaders, you'll need to install the `dedoc` integration package.\n",
|
|
"* To access `DedocAPIFileLoader`, you'll need to run the `Dedoc` service, e.g. `Docker` container (please see [the documentation](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker) \n",
|
|
"for more details):\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"docker pull dedocproject/dedoc\n",
|
|
"docker run -p 1231:1231\n",
|
|
"```\n",
|
|
"\n",
|
|
"`Dedoc` installation instruction is given [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "511c109d-a5c3-42ba-914e-5d1b385bc40f",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Note: you may need to restart the kernel to use updated packages.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Install package\n",
|
|
"%pip install --quiet \"dedoc[torch]\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6820c0e9-d56d-4899-b8c8-374760360e2b",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Instantiation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "c1f98cae-71ec-4d60-87fb-96c1a76851d8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.document_loaders import DedocFileLoader\n",
|
|
"\n",
|
|
"loader = DedocFileLoader(\"./example_data/state_of_the_union.txt\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5d7bc2b3-73a0-4cd6-8014-cc7184aa9d4a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "b9097c14-6168-4726-819e-24abb9a63b13",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'\\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"docs = loader.load()\n",
|
|
"docs[0].page_content[:100]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9ed8bd46-0047-4ccc-b2d6-beb7761f7312",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Lazy Load"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "6ae12d7e-8105-4bbe-9031-0e968475f6bf",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\n",
|
|
"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"docs = loader.lazy_load()\n",
|
|
"\n",
|
|
"for doc in docs:\n",
|
|
" print(doc.page_content[:100])\n",
|
|
" break"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8772ae40-6239-4751-bb2d-b4a9415c1ad1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## API reference\n",
|
|
"\n",
|
|
"For detailed information on configuring and calling `Dedoc` loaders, please see the API references: \n",
|
|
"\n",
|
|
"* https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocFileLoader.html\n",
|
|
"* https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.DedocPDFLoader.html\n",
|
|
"* https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocAPIFileLoader.html"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c4d5e702-0e21-4cad-a4c3-b9b3bff77203",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading any file\n",
|
|
"\n",
|
|
"For automatic handling of any file in a [supported format](https://dedoc.readthedocs.io/en/latest/#id1),\n",
|
|
"`DedocFileLoader` can be useful.\n",
|
|
"The file loader automatically detects the file type with a correct extension.\n",
|
|
"\n",
|
|
"File parsing process can be configured through `dedoc_kwargs` during the `DedocFileLoader` class initialization.\n",
|
|
"Here the basic examples of some options usage are given, \n",
|
|
"please see the documentation of `DedocFileLoader` and \n",
|
|
"[dedoc documentation](https://dedoc.readthedocs.io/en/latest/parameters/parameters.html) \n",
|
|
"to get more details about configuration parameters."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "de97d0ed-d6b1-44e0-b392-1f3d89c762f9",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Basic example"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "50ffeeee-db12-4801-b208-7e32ea3d72ad",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'\\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\n\\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\n\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\n\\n\\nWith a duty to one another to the American people to '"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders import DedocFileLoader\n",
|
|
"\n",
|
|
"loader = DedocFileLoader(\"./example_data/state_of_the_union.txt\")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"docs[0].page_content[:400]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "457e5d4c-a4ee-4f31-ae74-3f75a1bbd0af",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Modes of split\n",
|
|
"\n",
|
|
"`DedocFileLoader` supports different types of document splitting into parts (each part is returned separately).\n",
|
|
"For this purpose, `split` parameter is used with the following options:\n",
|
|
"* `document` (default value): document text is returned as a single langchain `Document` object (don't split);\n",
|
|
"* `page`: split document text into pages (works for `PDF`, `DJVU`, `PPTX`, `PPT`, `ODP`);\n",
|
|
"* `node`: split document text into `Dedoc` tree nodes (title nodes, list item nodes, raw text nodes);\n",
|
|
"* `line`: split document text into textual lines."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "eec54d31-ae7a-4a3c-aa10-4ae276b1e4c4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"2"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = DedocFileLoader(\n",
|
|
" \"./example_data/layout-parser-paper.pdf\",\n",
|
|
" split=\"page\",\n",
|
|
" pages=\":2\",\n",
|
|
")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"len(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "61e11769-4780-4f77-b10e-27db6936f226",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Handling tables\n",
|
|
"\n",
|
|
"`DedocFileLoader` supports tables handling when `with_tables` parameter is \n",
|
|
"set to `True` during loader initialization (`with_tables=True` by default). \n",
|
|
"\n",
|
|
"Tables are not split - each table corresponds to one langchain `Document` object.\n",
|
|
"For tables, `Document` object has additional `metadata` fields `type=\"table\"` \n",
|
|
"and `text_as_html` with table `HTML` representation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "bbeb2f8a-ac5e-4b59-8026-7ea3fc14c928",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"('table',\n",
|
|
" '<table border=\"1\" style=\"border-collapse: collapse; width: 100%;\">\\n<tbody>\\n<tr>\\n<td colspan=\"1\" rowspan=\"1\">Team</td>\\n<td colspan=\"1\" rowspan=\"1\"> "Payroll (millions)"</td>\\n<td colspan=\"1\" r')"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = DedocFileLoader(\"./example_data/mlb_teams_2012.csv\")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"docs[1].metadata[\"type\"], docs[1].metadata[\"text_as_html\"][:200]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b4a2b872-2aba-4e4c-8b2f-83a5a81ee1da",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Handling attached files\n",
|
|
"\n",
|
|
"`DedocFileLoader` supports attached files handling when `with_attachments` is set \n",
|
|
"to `True` during loader initialization (`with_attachments=False` by default). \n",
|
|
"\n",
|
|
"Attachments are split according to the `split` parameter.\n",
|
|
"For attachments, langchain `Document` object has an additional metadata \n",
|
|
"field `type=\"attachment\"`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "bb9d6c1c-e24c-4979-88a0-38d54abd6332",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"('attachment',\n",
|
|
" '\\nContent-Type\\nmultipart/mixed; boundary=\"0000000000005d654405f082adb7\"\\nDate\\nFri, 23 Dec 2022 12:08:48 -0600\\nFrom\\nMallori Harrell <mallori@unstructured.io>\\nMIME-Version\\n1.0\\nMessage-ID\\n<CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>\\nSubject\\nFake email with attachment\\nTo\\nMallori Harrell <mallori@unstructured.io>')"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = DedocFileLoader(\n",
|
|
" \"./example_data/fake-email-attachment.eml\",\n",
|
|
" with_attachments=True,\n",
|
|
")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"docs[1].metadata[\"type\"], docs[1].page_content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d435c3f6-703a-4064-8307-ace140de967a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading PDF file\n",
|
|
"\n",
|
|
"If you want to handle only `PDF` documents, you can use `DedocPDFLoader` with only `PDF` support.\n",
|
|
"The loader supports the same parameters for document split, tables and attachments extraction.\n",
|
|
"\n",
|
|
"`Dedoc` can extract `PDF` with or without a textual layer, \n",
|
|
"as well as automatically detect its presence and correctness.\n",
|
|
"Several `PDF` handlers are available, you can use `pdf_with_text_layer` \n",
|
|
"parameter to choose one of them.\n",
|
|
"Please see [parameters description](https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html) \n",
|
|
"to get more details.\n",
|
|
"\n",
|
|
"For `PDF` without a textual layer, `Tesseract OCR` and its language packages should be installed.\n",
|
|
"In this case, [the instruction](https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html) can be useful."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "0103a7f3-6b5e-4444-8f4d-83dd3724a9af",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'\\n2\\n\\nZ. Shen et al.\\n\\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\\n\\nA generalized learning-based framework dramatically reduces the need for the\\n\\nmanual specification of complicated rules, which is the status quo with traditional\\n\\nmethods. DL has the potential to transform DIA pipelines and benefit a broad\\n\\nspectrum of large-scale document digitization projects.\\n'"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders import DedocPDFLoader\n",
|
|
"\n",
|
|
"loader = DedocPDFLoader(\n",
|
|
" \"./example_data/layout-parser-paper.pdf\", pdf_with_text_layer=\"true\", pages=\"2:2\"\n",
|
|
")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"docs[0].page_content[:400]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13061995-1805-40c2-a77a-a6cd80999e20",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Dedoc API\n",
|
|
"\n",
|
|
"If you want to get up and running with less set up, you can use `Dedoc` as a service.\n",
|
|
"**`DedocAPIFileLoader` can be used without installation of `dedoc` library.**\n",
|
|
"The loader supports the same parameters as `DedocFileLoader` and\n",
|
|
"also automatically detects input file types.\n",
|
|
"\n",
|
|
"To use `DedocAPIFileLoader`, you should run the `Dedoc` service, e.g. `Docker` container (please see [the documentation](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker) \n",
|
|
"for more details):\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"docker pull dedocproject/dedoc\n",
|
|
"docker run -p 1231:1231\n",
|
|
"```\n",
|
|
"\n",
|
|
"Please do not use our demo URL `https://dedoc-readme.hf.space` in your code."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "211fc0b5-6080-4974-a6c1-f982bafd87d6",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'\\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\n\\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\n\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\n\\n\\nWith a duty to one another to the American people to '"
|
|
]
|
|
},
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders import DedocAPIFileLoader\n",
|
|
"\n",
|
|
"loader = DedocAPIFileLoader(\n",
|
|
" \"./example_data/state_of_the_union.txt\",\n",
|
|
" url=\"https://dedoc-readme.hf.space\",\n",
|
|
")\n",
|
|
"\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"docs[0].page_content[:400]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "faaff475-5209-436f-bcde-97d58daed05c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.19"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|