unstructured, community, initialize langchain-unstructured package (#22779)

#### Update (2): 
A single `UnstructuredLoader` is added to handle both local and api
partitioning. This loader also handles single or multiple documents.

#### Changes in `community`:
Changes here do not affect users. In the initial process of using the
SDK for the API Loaders, the Loaders in community were refactored.
Other changes include:
The `UnstructuredBaseLoader` has a new check to see if both
`mode="paged"` and `chunking_strategy="by_page"`. It also now has
`Element.element_id` added to the `Document.metadata`.
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such,
now both directly inherit from `UnstructuredBaseLoader` and initialize
their `file_path`/`file` attributes respectively and implement their own
`_post_process_elements` methods.

--------
#### Update:
New SDK Loaders in a [partner
package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo)
are introduced to prevent breaking changes for users (see discussion
below).

##### TODO:
- [x] Test docstring examples
--------
- **Description:** UnstructuredAPIFileIOLoader and
UnstructuredAPIFileLoader calls to the unstructured api are now made
using the unstructured-client sdk.
- **New Dependencies:** unstructured-client

- [x] **Add tests and docs**: If you're adding a new integration, please
include
- [x] a test for the integration, preferably unit tests that do not rely
on network access,
- [x] update the description in
`docs/docs/integrations/providers/unstructured.mdx`
- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/

Additional guidelines:
- Make sure optional dependencies are imported within a function.
- Please do not add dependencies to pyproject.toml files (even optional
ones) unless they are required for unit tests.
- Most PRs should not touch more than one package.
- Changes should be backwards compatible.
- If you are adding something to community, do not re-import it in
langchain.

TODO:
- [x] Update
https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api
-
`langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb`
- The description here needs to indicate that users should install
`unstructured-client` instead of `unstructured`. Read over closely to
look for any other changes that need to be made.
- [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to
handle json responses from the API instead of just lists of elements.
- This method may need to be overwritten by the API loaders instead of
changing it in the `UnstructuredBaseLoader`.
- [x] Update the documentation links in the class docstrings (the
Unstructured documents have moved)
- [x] Update Document.metadata to include `element_id` (see thread
[here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419))

---------

Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com>
Co-authored-by: ChengZi <chen.zhang@zilliz.com>
This commit is contained in:
John 2024-07-24 19:21:20 -04:00 committed by GitHub
parent 2394807033
commit d59c656ea5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
23 changed files with 5929 additions and 347 deletions

View File

@ -5,7 +5,7 @@
"id": "20deed05",
"metadata": {},
"source": [
"# Unstructured File\n",
"# Unstructured\n",
"\n",
"This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n",
"\n",
@ -14,79 +14,69 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "2886982e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"# # Install package\n",
"%pip install --upgrade --quiet \"unstructured[all-docs]\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "54d62efd",
"metadata": {},
"outputs": [],
"source": [
"# # Install other dependencies\n",
"# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
"# !brew install libmagic\n",
"# !brew install poppler\n",
"# !brew install tesseract\n",
"# # If parsing xml / html documents:\n",
"# !brew install libxml2\n",
"# !brew install libxslt"
"# Install package, compatible with API partitioning\n",
"%pip install --upgrade --quiet \"langchain-unstructured\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "af6a64f5",
"cell_type": "markdown",
"id": "e75e2a6d",
"metadata": {},
"outputs": [],
"source": [
"# import nltk\n",
"# nltk.download('punkt')"
"### Local Partitioning (Optional)\n",
"\n",
"By default, `langchain-unstructured` installs a smaller footprint that requires\n",
"offloading of the partitioning logic to the Unstructured API.\n",
"\n",
"If you would like to run the partitioning logic locally, you will need to install\n",
"a combination of system dependencies, as outlined in the \n",
"[Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n",
"\n",
"For example, on Macs you can install the required dependencies with:\n",
"\n",
"```bash\n",
"# base dependencies\n",
"brew install libmagic poppler tesseract\n",
"\n",
"# If parsing xml / html documents:\n",
"brew install libxml2 libxslt\n",
"```\n",
"\n",
"You can install the required `pip` dependencies with:\n",
"\n",
"```bash\n",
"pip install \"langchain-unstructured[local]\"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a9c1c775",
"metadata": {},
"source": [
"### Quickstart\n",
"\n",
"To simply load a file as a document, you can use the LangChain `DocumentLoader.load` \n",
"interface:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "79d3e549",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")\n",
"loader = UnstructuredLoader(\"./example_data/state_of_the_union.txt\")\n",
"\n",
"docs = loader.load()\n",
"\n",
"docs[0].page_content[:400]"
"docs = loader.load()"
]
},
{
@ -99,113 +89,31 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "092d9a0b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\\n\\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\\n\\n1/23/23, 2:59 AM - User 1: How much do you want?\\n\\n1/23/23, 3:00 AM - User 2: Online is at least $100\\n\\n1/23/23, 3:01 AM - User 2: Here is $129\\n\\n1/23/23, 3:01 AM - User 2: <Media omitted>\\n\\n1/23/23, 3:01 AM - User 1: Im not int'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
"name": "stdout",
"output_type": "stream",
"text": [
"whatsapp_chat.txt : 1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are in\n",
"state_of_the_union.txt : May God bless you all. May God protect our troops.\n"
]
}
],
"source": [
"files = [\"./example_data/whatsapp_chat.txt\", \"./example_data/layout-parser-paper.pdf\"]\n",
"file_paths = [\n",
" \"./example_data/whatsapp_chat.txt\",\n",
" \"./example_data/state_of_the_union.txt\",\n",
"]\n",
"\n",
"loader = UnstructuredFileLoader(files)\n",
"loader = UnstructuredLoader(file_paths)\n",
"\n",
"docs = loader.load()\n",
"\n",
"docs[0].page_content[:400]"
]
},
{
"cell_type": "markdown",
"id": "7874d01d",
"metadata": {},
"source": [
"## Retain Elements\n",
"\n",
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ff5b616d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
" Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
" Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
" Document(page_content='With a duty to one another to the American people to the Constitution.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'UncategorizedText'}),\n",
" Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'})]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loader = UnstructuredFileLoader(\n",
" \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"docs[:5]"
]
},
{
"cell_type": "markdown",
"id": "672733fd",
"metadata": {},
"source": [
"## Define a Partitioning Strategy\n",
"\n",
"Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "767238a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
"\n",
"loader = UnstructuredFileLoader(\n",
" \"./example_data/layout-parser-paper.pdf\", strategy=\"fast\", mode=\"elements\"\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"docs[5:10]"
"print(docs[0].metadata.get(\"filename\"), \": \", docs[0].page_content[:100])\n",
"print(docs[-1].metadata.get(\"filename\"), \": \", docs[-1].page_content[:100])"
]
},
{
@ -215,37 +123,52 @@
"source": [
"## PDF Example\n",
"\n",
"Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n",
"- `single` all the text from all elements are combined into one (default)\n",
"- `elements` maintain individual elements\n",
"- `paged` texts from each page are only combined"
"Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements."
]
},
{
"cell_type": "markdown",
"id": "672733fd",
"metadata": {},
"source": [
"### Define a Partitioning Strategy\n",
"\n",
"Unstructured document loader allow users to pass in a `strategy` parameter that lets Unstructured\n",
"know how to partition pdf and other OCR'd documents. Currently supported strategies are `\"auto\"`,\n",
"`\"hi_res\"`, `\"ocr_only\"`, and `\"fast\"`. Learn more about the different strategies\n",
"[here](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf). \n",
"\n",
"Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is\n",
"ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing\n",
"(i.e. a model for document partitioning). You can see how to apply a strategy to an\n",
"`UnstructuredLoader` below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "686e5eb4",
"execution_count": 6,
"id": "60685353",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
"[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
]
},
"execution_count": 12,
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loader = UnstructuredFileLoader(\n",
" \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n",
")\n",
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"loader = UnstructuredLoader(\"./example_data/layout-parser-paper.pdf\", strategy=\"fast\")\n",
"\n",
"docs = loader.load()\n",
"\n",
@ -257,37 +180,39 @@
"id": "1cf27fc8",
"metadata": {},
"source": [
"If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example."
"## Post Processing\n",
"\n",
"If you need to post process the `unstructured` elements after extraction, you can pass in a list of\n",
"`str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredLoader`. This applies to other Unstructured loaders as well. Below is an example."
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 7,
"id": "112e5538",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
"[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
]
},
"execution_count": 14,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
"from langchain_unstructured import UnstructuredLoader\n",
"from unstructured.cleaners.core import clean_extra_whitespace\n",
"\n",
"loader = UnstructuredFileLoader(\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" mode=\"elements\",\n",
" post_processors=[clean_extra_whitespace],\n",
")\n",
"\n",
@ -303,34 +228,70 @@
"source": [
"## Unstructured API\n",
"\n",
"If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/unstructured/) page will have instructions on how to generate an API key once theyre available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if youd like to self-host the Unstructured API or run it locally."
"If you want to get up and running with smaller packages and get the most up-to-date partitioning you can `pip install\n",
"unstructured-client` and `pip install langchain-unstructured`. For\n",
"more information about the `UnstructuredLoader`, refer to the\n",
"[Unstructured provider page](https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/).\n",
"\n",
"The loader will process your document using the hosted Unstructured serverless API when you pass in\n",
"your `api_key` and set `partition_via_api=True`. You can generate a free\n",
"Unstructured API key [here](https://unstructured.io/api-key/).\n",
"\n",
"Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image)\n",
"if youd like to self-host the Unstructured API or run it locally."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "6e5fde16",
"metadata": {},
"outputs": [],
"source": [
"# Install package\n",
"%pip install \"langchain-unstructured\"\n",
"%pip install \"unstructured-client\"\n",
"\n",
"# Set API key\n",
"import os\n",
"\n",
"os.environ[\"UNSTRUCTURED_API_KEY\"] = \"FAKE_API_KEY\""
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "386eb63c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"data": {
"text/plain": [
"Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
"Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')"
]
},
"execution_count": 4,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.document_loaders import UnstructuredAPIFileLoader\n",
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]\n",
"\n",
"loader = UnstructuredAPIFileLoader(\n",
" file_path=filenames[0],\n",
" api_key=\"FAKE_API_KEY\",\n",
"loader = UnstructuredLoader(\n",
" file_path=\"example_data/fake.docx\",\n",
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
" partition_via_api=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
@ -342,43 +303,197 @@
"id": "94158999",
"metadata": {},
"source": [
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredLoader`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 10,
"id": "a3d7c846",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n",
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"fake.docx : Lorem ipsum dolor sit amet.\n",
"fake-email.eml : Violets are blue\n"
]
}
],
"source": [
"loader = UnstructuredAPIFileLoader(\n",
" file_path=filenames,\n",
" api_key=\"FAKE_API_KEY\",\n",
"loader = UnstructuredLoader(\n",
" file_path=[\"example_data/fake.docx\", \"example_data/fake-email.eml\"],\n",
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
" partition_via_api=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"docs[0]"
"\n",
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])\n",
"print(docs[-1].metadata[\"filename\"], \": \", docs[-1].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "a324a0db",
"metadata": {},
"source": [
"### Unstructured SDK Client\n",
"\n",
"Partitioning with the Unstructured API relies on the [Unstructured SDK\n",
"Client](https://docs.unstructured.io/api-reference/api-services/sdk).\n",
"\n",
"Below is an example showing how you can customize some features of the client and use your own\n",
"`requests.Session()`, pass in an alternative `server_url`, or customize the `RetryConfig` object for more control over how failed requests are handled."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e510495",
"execution_count": 11,
"id": "58e55264",
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Concurrency level set to 5\n",
"INFO: Splitting pages 1 to 16 (16 total)\n",
"INFO: Determined optimal split size of 4 pages.\n",
"INFO: Partitioning 4 files with 4 page(s) each.\n",
"INFO: Partitioning set #1 (pages 1-4).\n",
"INFO: Partitioning set #2 (pages 5-8).\n",
"INFO: Partitioning set #3 (pages 9-12).\n",
"INFO: Partitioning set #4 (pages 13-16).\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: Successfully partitioned set #1, elements added to the final result.\n",
"INFO: Successfully partitioned set #2, elements added to the final result.\n",
"INFO: Successfully partitioned set #3, elements added to the final result.\n",
"INFO: Successfully partitioned set #4, elements added to the final result.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n"
]
}
],
"source": [
"import requests\n",
"from langchain_unstructured import UnstructuredLoader\n",
"from unstructured_client import UnstructuredClient\n",
"from unstructured_client.utils import BackoffStrategy, RetryConfig\n",
"\n",
"client = UnstructuredClient(\n",
" api_key_auth=os.getenv(\n",
" \"UNSTRUCTURED_API_KEY\"\n",
" ), # Note: the client API param is \"api_key_auth\" instead of \"api_key\"\n",
" client=requests.Session(),\n",
" server_url=\"https://api.unstructuredapp.io/general/v0/general\",\n",
" retry_config=RetryConfig(\n",
" strategy=\"backoff\",\n",
" retry_connection_errors=True,\n",
" backoff=BackoffStrategy(\n",
" initial_interval=500,\n",
" max_interval=60000,\n",
" exponent=1.5,\n",
" max_elapsed_time=900000,\n",
" ),\n",
" ),\n",
")\n",
"\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" partition_via_api=True,\n",
" client=client,\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "c66fbeb3",
"metadata": {},
"source": [
"## Chunking\n",
"\n",
"The `UnstructuredLoader` does not support `mode` as parameter for grouping text like the older\n",
"loader `UnstructuredFileLoader` and others did. It instead supports \"chunking\". Chunking in\n",
"unstructured differs from other chunking mechanisms you may be familiar with that form chunks based\n",
"on plain-text features--character sequences like \"\\n\\n\" or \"\\n\" that might indicate a paragraph\n",
"boundary or list-item boundary. Instead, all documents are split using specific knowledge about each\n",
"document format to partition the document into semantic units (document elements) and we only need to\n",
"resort to text-splitting when a single element exceeds the desired maximum chunk size. In general,\n",
"chunking combines consecutive elements to form chunks as large as possible without exceeding the\n",
"maximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.\n",
"Each “chunk” is an instance of one of these three types.\n",
"\n",
"See this [page](https://docs.unstructured.io/open-source/core-functionality/chunking) for more\n",
"details about chunking options, but to reproduce the same behavior as `mode=\"single\"`, you can set\n",
"`chunking_strategy=\"basic\"`, `max_characters=<some-really-big-number>`, and `include_orig_elements=False`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e9f1c20d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Partitioning locally even though api_key is defined since partition_via_api=False.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of LangChain documents: 1\n",
"Length of text in the document: 42772\n"
]
}
],
"source": [
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" chunking_strategy=\"basic\",\n",
" max_characters=1000000,\n",
" include_orig_elements=False,\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"print(\"Number of LangChain documents:\", len(docs))\n",
"print(\"Length of text in the document:\", len(docs[0].page_content))"
]
}
],
"metadata": {
@ -397,7 +512,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
"version": "3.10.13"
}
},
"nbformat": 4,

View File

@ -40,6 +40,7 @@ These providers have standalone `langchain-{provider}` packages for improved ver
- [Qdrant](/docs/integrations/providers/qdrant)
- [Robocorp](/docs/integrations/providers/robocorp)
- [Together AI](/docs/integrations/providers/together)
- [Unstructured](/docs/integrations/providers/unstructured)
- [Upstage](/docs/integrations/providers/upstage)
- [Voyage AI](/docs/integrations/providers/voyageai)

View File

@ -8,11 +8,21 @@ ecosystem within LangChain.
## Installation and Setup
If you are using a loader that runs locally, use the following steps to get `unstructured` and
its dependencies running locally.
If you are using a loader that runs locally, use the following steps to get `unstructured` and its
dependencies running.
- Install the Python SDK with `pip install unstructured`.
- You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
- For the smallest installation footprint and to take advantage of features not available in the
open-source `unstructured` package, install the Python SDK with `pip install unstructured-client`
along with `pip install langchain-unstructured` to use the `UnstructuredLoader` and partition
remotely against the Unstructured API. This loader lives
in a LangChain partner repo instead of the `langchain-community` repo and you will need an
`api_key`, which you can generate a free key [here](https://unstructured.io/api-key/).
- Unstructured's documentation for the sdk can be found here:
https://docs.unstructured.io/api-reference/api-services/sdk
- To run everything locally, install the open-source python package with `pip install unstructured`
along with `pip install langchain-community` and use the same `UnstructuredLoader` as mentioned above.
- You can install document specific dependencies with extras, e.g. `pip install "unstructured[docx]"`.
- To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
- Install the following system dependencies if they are not already available on your system with e.g. `brew install` for Mac.
Depending on what document types you're parsing, you may not need all of these.
@ -22,16 +32,11 @@ its dependencies running locally.
- `qpdf` (PDFs)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs)
- When running locally, Unstructured also recommends using Docker [by following this
guide](https://docs.unstructured.io/open-source/installation/docker-installation) to ensure all
system dependencies are installed correctly.
When running locally, Unstructured also recommends using Docker [by following this guide](https://docs.unstructured.io/open-source/installation/docker-installation)
to ensure all system dependencies are installed correctly.
If you want to get up and running with less set up, you can
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
The `Unstructured API` requires API keys to make requests.
The Unstructured API requires API keys to make requests.
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
@ -42,30 +47,21 @@ Check out the instructions
## Data Loaders
The primary usage of the `Unstructured` is in data loaders.
The primary usage of `Unstructured` is in data loaders.
### UnstructuredAPIFileIOLoader
### UnstructuredLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
See a [usage example](/docs/integrations/document_loaders/unstructured_file) to see how you can use
this loader for both partitioning locally and remotely with the serverless Unstructured API.
```python
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
```
### UnstructuredAPIFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
```python
from langchain_community.document_loaders import UnstructuredAPIFileLoader
from langchain_unstructured import UnstructuredLoader
```
### UnstructuredCHMLoader
`CHM` means `Microsoft Compiled HTML Help`.
See a usage example in the API documentation.
```python
from langchain_community.document_loaders import UnstructuredCHMLoader
```
@ -119,15 +115,6 @@ See a [usage example](/docs/integrations/document_loaders/google_drive#passing-i
from langchain_community.document_loaders import UnstructuredFileIOLoader
```
### UnstructuredFileLoader
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
```python
from langchain_community.document_loaders import UnstructuredFileLoader
```
### UnstructuredHTMLLoader
See a [usage example](/docs/how_to/document_loader_html).

View File

@ -1,14 +1,23 @@
"""Loader that uses unstructured to load files."""
import collections
from __future__ import annotations
import logging
import os
from abc import ABC, abstractmethod
from pathlib import Path
from typing import IO, Any, Callable, Dict, Iterator, List, Optional, Sequence, Union
from typing import IO, Any, Callable, Iterator, List, Optional, Sequence, Union
from langchain_core._api.deprecation import deprecated
from langchain_core.documents import Document
from typing_extensions import TypeAlias
from langchain_community.document_loaders.base import BaseLoader
Element: TypeAlias = Any
logger = logging.getLogger(__file__)
def satisfies_min_unstructured_version(min_version: str) -> bool:
"""Check if the installed `Unstructured` version exceeds the minimum version
@ -41,8 +50,8 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
def __init__(
self,
mode: str = "single",
post_processors: Optional[List[Callable]] = None,
mode: str = "single", # deprecated
post_processors: Optional[List[Callable[[str], str]]] = None,
**unstructured_kwargs: Any,
):
"""Initialize with file path."""
@ -53,32 +62,41 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
"unstructured package not found, please install it with "
"`pip install unstructured`"
)
# `single` - elements are combined into one (default)
# `elements` - maintain individual elements
# `paged` - elements are combined by page
_valid_modes = {"single", "elements", "paged"}
if mode not in _valid_modes:
raise ValueError(
f"Got {mode} for `mode`, but should be one of `{_valid_modes}`"
)
self.mode = mode
if not satisfies_min_unstructured_version("0.5.4"):
if "strategy" in unstructured_kwargs:
unstructured_kwargs.pop("strategy")
self._check_if_both_mode_and_chunking_strategy_are_by_page(
mode, unstructured_kwargs
)
self.mode = mode
self.unstructured_kwargs = unstructured_kwargs
self.post_processors = post_processors or []
@abstractmethod
def _get_elements(self) -> List:
def _get_elements(self) -> List[Element]:
"""Get elements."""
@abstractmethod
def _get_metadata(self) -> dict:
"""Get metadata."""
def _get_metadata(self) -> dict[str, Any]:
"""Get file_path metadata if available."""
def _post_process_elements(self, elements: list) -> list:
"""Applies post processing functions to extracted unstructured elements.
Post processing functions are str -> str callables are passed
in using the post_processors kwarg when the loader is instantiated."""
def _post_process_elements(self, elements: List[Element]) -> List[Element]:
"""Apply post processing functions to extracted unstructured elements.
Post processing functions are str -> str callables passed
in using the post_processors kwarg when the loader is instantiated.
"""
for element in elements:
for post_processor in self.post_processors:
element.apply(post_processor)
@ -97,18 +115,25 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
metadata.update(element.metadata.to_dict())
if hasattr(element, "category"):
metadata["category"] = element.category
if element.to_dict().get("element_id"):
metadata["element_id"] = element.to_dict().get("element_id")
yield Document(page_content=str(element), metadata=metadata)
elif self.mode == "paged":
text_dict: Dict[int, str] = {}
meta_dict: Dict[int, Dict] = {}
logger.warning(
"`mode='paged'` is deprecated in favor of the 'by_page' chunking"
" strategy. Learn more about chunking here:"
" https://docs.unstructured.io/open-source/core-functionality/chunking"
)
text_dict: dict[int, str] = {}
meta_dict: dict[int, dict[str, Any]] = {}
for idx, element in enumerate(elements):
for element in elements:
metadata = self._get_metadata()
if hasattr(element, "metadata"):
metadata.update(element.metadata.to_dict())
page_number = metadata.get("page_number", 1)
# Check if this page_number already exists in docs_dict
# Check if this page_number already exists in text_dict
if page_number not in text_dict:
# If not, create new entry with initial text and metadata
text_dict[page_number] = str(element) + "\n\n"
@ -128,18 +153,37 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
else:
raise ValueError(f"mode of {self.mode} not supported.")
def _check_if_both_mode_and_chunking_strategy_are_by_page(
self, mode: str, unstructured_kwargs: dict[str, Any]
) -> None:
if (
mode == "paged"
and unstructured_kwargs.get("chunking_strategy") == "by_page"
):
raise ValueError(
"Only one of `chunking_strategy='by_page'` or `mode='paged'` may be"
" set. `chunking_strategy` is preferred."
)
@deprecated(
since="0.2.8",
removal="0.4.0",
alternative_import="langchain_unstructured.UnstructuredLoader",
)
class UnstructuredFileLoader(UnstructuredBaseLoader):
"""Load files using `Unstructured`.
The file loader uses the
unstructured partition function and will automatically detect the file
type. You can run the loader in one of two modes: "single" and "elements".
If you use "single" mode, the document will be returned as a single
langchain Document object. If you use "elements" mode, the unstructured
library will split the document into elements such as Title and NarrativeText.
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
The file loader uses the unstructured partition function and will automatically
detect the file type. You can run the loader in different modes: "single",
"elements", and "paged". The default "single" mode will return a single langchain
Document object. If you use "elements" mode, the unstructured library will split
the document into elements such as Title and NarrativeText and return those as
individual langchain Document objects. In addition to these post-processing modes
(which are specific to the LangChain Loaders), Unstructured has its own "chunking"
parameters for post-processing elements into more useful chunks for uses cases such
as Retrieval Augmented Generation (RAG). You can pass in additional unstructured
kwargs to configure different unstructured settings.
Examples
--------
@ -152,24 +196,27 @@ class UnstructuredFileLoader(UnstructuredBaseLoader):
References
----------
https://unstructured-io.github.io/unstructured/bricks.html#partition
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking
"""
def __init__(
self,
file_path: Union[str, List[str], Path, List[Path], None],
file_path: Union[str, List[str], Path, List[Path]],
*,
mode: str = "single",
**unstructured_kwargs: Any,
):
"""Initialize with file path."""
self.file_path = file_path
super().__init__(mode=mode, **unstructured_kwargs)
def _get_elements(self) -> List:
def _get_elements(self) -> List[Element]:
from unstructured.partition.auto import partition
if isinstance(self.file_path, list):
elements = []
elements: List[Element] = []
for file in self.file_path:
if isinstance(file, Path):
file = str(file)
@ -180,35 +227,33 @@ class UnstructuredFileLoader(UnstructuredBaseLoader):
self.file_path = str(self.file_path)
return partition(filename=self.file_path, **self.unstructured_kwargs)
def _get_metadata(self) -> dict:
def _get_metadata(self) -> dict[str, Any]:
return {"source": self.file_path}
def get_elements_from_api(
file_path: Union[str, List[str], Path, List[Path], None] = None,
file: Union[IO, Sequence[IO], None] = None,
api_url: str = "https://api.unstructured.io/general/v0/general",
file: Union[IO[bytes], Sequence[IO[bytes]], None] = None,
api_url: str = "https://api.unstructuredapp.io/general/v0/general",
api_key: str = "",
**unstructured_kwargs: Any,
) -> List:
) -> List[Element]:
"""Retrieve a list of elements from the `Unstructured API`."""
if is_list := isinstance(file_path, list):
file_path = [str(path) for path in file_path]
if isinstance(file, collections.abc.Sequence) or is_list:
if isinstance(file, Sequence) or is_list:
from unstructured.partition.api import partition_multiple_via_api
_doc_elements = partition_multiple_via_api(
filenames=file_path,
files=file,
filenames=file_path, # type: ignore
files=file, # type: ignore
api_key=api_key,
api_url=api_url,
**unstructured_kwargs,
)
elements = []
for _elements in _doc_elements:
elements.extend(_elements)
return elements
else:
from unstructured.partition.api import partition_via_api
@ -222,59 +267,69 @@ def get_elements_from_api(
)
class UnstructuredAPIFileLoader(UnstructuredFileLoader):
@deprecated(
since="0.2.8",
removal="0.4.0",
alternative_import="langchain_unstructured.UnstructuredLoader",
)
class UnstructuredAPIFileLoader(UnstructuredBaseLoader):
"""Load files using `Unstructured` API.
By default, the loader makes a call to the hosted Unstructured API.
If you are running the unstructured API locally, you can change the
API rule by passing in the url parameter when you initialize the loader.
The hosted Unstructured API requires an API key. See
https://www.unstructured.io/api-key/ if you need to generate a key.
By default, the loader makes a call to the hosted Unstructured API. If you are
running the unstructured API locally, you can change the API rule by passing in the
url parameter when you initialize the loader. The hosted Unstructured API requires
an API key. See the links below to learn more about our API offerings and get an
API key.
You can run the loader in one of two modes: "single" and "elements".
If you use "single" mode, the document will be returned as a single
langchain Document object. If you use "elements" mode, the unstructured
library will split the document into elements such as Title and NarrativeText.
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
You can run the loader in different modes: "single", "elements", and "paged". The
default "single" mode will return a single langchain Document object. If you use
"elements" mode, the unstructured library will split the document into elements such
as Title and NarrativeText and return those as individual langchain Document
objects. In addition to these post-processing modes (which are specific to the
LangChain Loaders), Unstructured has its own "chunking" parameters for
post-processing elements into more useful chunks for uses cases such as Retrieval
Augmented Generation (RAG). You can pass in additional unstructured kwargs to
configure different unstructured settings.
Examples
```python
from langchain_community.document_loaders import UnstructuredAPIFileLoader
loader = UnstructuredFileAPILoader(
loader = UnstructuredAPIFileLoader(
"example.pdf", mode="elements", strategy="fast", api_key="MY_API_KEY",
)
docs = loader.load()
References
----------
https://unstructured-io.github.io/unstructured/bricks.html#partition
https://www.unstructured.io/api-key/
https://github.com/Unstructured-IO/unstructured-api
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking
"""
def __init__(
self,
file_path: Union[str, List[str], None] = "",
file_path: Union[str, List[str]],
*,
mode: str = "single",
url: str = "https://api.unstructured.io/general/v0/general",
url: str = "https://api.unstructuredapp.io/general/v0/general",
api_key: str = "",
**unstructured_kwargs: Any,
):
"""Initialize with file path."""
validate_unstructured_version(min_unstructured_version="0.10.15")
self.file_path = file_path
self.url = url
self.api_key = api_key
self.api_key = os.getenv("UNSTRUCTURED_API_KEY") or api_key
super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
super().__init__(mode=mode, **unstructured_kwargs)
def _get_metadata(self) -> dict:
def _get_metadata(self) -> dict[str, Any]:
return {"source": self.file_path}
def _get_elements(self) -> List:
def _get_elements(self) -> List[Element]:
return get_elements_from_api(
file_path=self.file_path,
api_key=self.api_key,
@ -282,18 +337,36 @@ class UnstructuredAPIFileLoader(UnstructuredFileLoader):
**self.unstructured_kwargs,
)
def _post_process_elements(self, elements: List[Element]) -> List[Element]:
"""Apply post processing functions to extracted unstructured elements.
Post processing functions are str -> str callables passed
in using the post_processors kwarg when the loader is instantiated.
"""
for element in elements:
for post_processor in self.post_processors:
element.apply(post_processor)
return elements
@deprecated(
since="0.2.8",
removal="0.4.0",
alternative_import="langchain_unstructured.UnstructuredLoader",
)
class UnstructuredFileIOLoader(UnstructuredBaseLoader):
"""Load files using `Unstructured`.
"""Load file-like objects opened in read mode using `Unstructured`.
The file loader
uses the unstructured partition function and will automatically detect the file
type. You can run the loader in one of two modes: "single" and "elements".
If you use "single" mode, the document will be returned as a single
langchain Document object. If you use "elements" mode, the unstructured
library will split the document into elements such as Title and NarrativeText.
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
The file loader uses the unstructured partition function and will automatically
detect the file type. You can run the loader in different modes: "single",
"elements", and "paged". The default "single" mode will return a single langchain
Document object. If you use "elements" mode, the unstructured library will split
the document into elements such as Title and NarrativeText and return those as
individual langchain Document objects. In addition to these post-processing modes
(which are specific to the LangChain Loaders), Unstructured has its own "chunking"
parameters for post-processing elements into more useful chunks for uses cases
such as Retrieval Augmented Generation (RAG). You can pass in additional
unstructured kwargs to configure different unstructured settings.
Examples
--------
@ -308,12 +381,14 @@ class UnstructuredFileIOLoader(UnstructuredBaseLoader):
References
----------
https://unstructured-io.github.io/unstructured/bricks.html#partition
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking
"""
def __init__(
self,
file: Union[IO, Sequence[IO]],
file: IO[bytes],
*,
mode: str = "single",
**unstructured_kwargs: Any,
):
@ -321,72 +396,114 @@ class UnstructuredFileIOLoader(UnstructuredBaseLoader):
self.file = file
super().__init__(mode=mode, **unstructured_kwargs)
def _get_elements(self) -> List:
def _get_elements(self) -> List[Element]:
from unstructured.partition.auto import partition
return partition(file=self.file, **self.unstructured_kwargs)
def _get_metadata(self) -> dict:
def _get_metadata(self) -> dict[str, Any]:
return {}
def _post_process_elements(self, elements: List[Element]) -> List[Element]:
"""Apply post processing functions to extracted unstructured elements.
class UnstructuredAPIFileIOLoader(UnstructuredFileIOLoader):
"""Load files using `Unstructured` API.
Post processing functions are str -> str callables passed
in using the post_processors kwarg when the loader is instantiated.
"""
for element in elements:
for post_processor in self.post_processors:
element.apply(post_processor)
return elements
By default, the loader makes a call to the hosted Unstructured API.
If you are running the unstructured API locally, you can change the
API rule by passing in the url parameter when you initialize the loader.
The hosted Unstructured API requires an API key. See
https://www.unstructured.io/api-key/ if you need to generate a key.
You can run the loader in one of two modes: "single" and "elements".
If you use "single" mode, the document will be returned as a single
langchain Document object. If you use "elements" mode, the unstructured
library will split the document into elements such as Title and NarrativeText.
You can pass in additional unstructured kwargs after mode to apply
different unstructured settings.
@deprecated(
since="0.2.8",
removal="0.4.0",
alternative_import="langchain_unstructured.UnstructuredLoader",
)
class UnstructuredAPIFileIOLoader(UnstructuredBaseLoader):
"""Send file-like objects with `unstructured-client` sdk to the Unstructured API.
By default, the loader makes a call to the hosted Unstructured API. If you are
running the unstructured API locally, you can change the API rule by passing in the
url parameter when you initialize the loader. The hosted Unstructured API requires
an API key. See the links below to learn more about our API offerings and get an
API key.
You can run the loader in different modes: "single", "elements", and "paged". The
default "single" mode will return a single langchain Document object. If you use
"elements" mode, the unstructured library will split the document into elements
such as Title and NarrativeText and return those as individual langchain Document
objects. In addition to these post-processing modes (which are specific to the
LangChain Loaders), Unstructured has its own "chunking" parameters for
post-processing elements into more useful chunks for uses cases such as Retrieval
Augmented Generation (RAG). You can pass in additional unstructured kwargs to
configure different unstructured settings.
Examples
--------
from langchain_community.document_loaders import UnstructuredAPIFileLoader
with open("example.pdf", "rb") as f:
loader = UnstructuredFileAPILoader(
loader = UnstructuredAPIFileIOLoader(
f, mode="elements", strategy="fast", api_key="MY_API_KEY",
)
docs = loader.load()
References
----------
https://unstructured-io.github.io/unstructured/bricks.html#partition
https://www.unstructured.io/api-key/
https://github.com/Unstructured-IO/unstructured-api
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking
"""
def __init__(
self,
file: Union[IO, Sequence[IO]],
file: Union[IO[bytes], Sequence[IO[bytes]]],
*,
mode: str = "single",
url: str = "https://api.unstructured.io/general/v0/general",
url: str = "https://api.unstructuredapp.io/general/v0/general",
api_key: str = "",
**unstructured_kwargs: Any,
):
"""Initialize with file path."""
if isinstance(file, collections.abc.Sequence):
if isinstance(file, Sequence):
validate_unstructured_version(min_unstructured_version="0.6.3")
if file:
validate_unstructured_version(min_unstructured_version="0.6.2")
validate_unstructured_version(min_unstructured_version="0.6.2")
self.file = file
self.url = url
self.api_key = api_key
self.api_key = os.getenv("UNSTRUCTURED_API_KEY") or api_key
super().__init__(file=file, mode=mode, **unstructured_kwargs)
super().__init__(mode=mode, **unstructured_kwargs)
def _get_elements(self) -> List:
return get_elements_from_api(
file=self.file,
api_key=self.api_key,
api_url=self.url,
**self.unstructured_kwargs,
)
def _get_elements(self) -> List[Element]:
if self.unstructured_kwargs.get("metadata_filename"):
return get_elements_from_api(
file=self.file,
file_path=self.unstructured_kwargs.pop("metadata_filename"),
api_key=self.api_key,
api_url=self.url,
**self.unstructured_kwargs,
)
else:
raise ValueError(
"If partitioning a file via api,"
" metadata_filename must be specified as well.",
)
def _get_metadata(self) -> dict[str, Any]:
return {}
def _post_process_elements(self, elements: List[Element]) -> List[Element]:
"""Apply post processing functions to extracted unstructured elements.
Post processing functions are str -> str callables passed
in using the post_processors kwarg when the loader is instantiated.
"""
for element in elements:
for post_processor in self.post_processors:
element.apply(post_processor)
return elements

1
libs/partners/unstructured/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
__pycache__

View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 LangChain, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -0,0 +1,66 @@
.PHONY: all format lint test tests integration_tests docker_tests help extended_tests
# Default target executed when no arguments are given to make.
all: help
# Define a variable for the test file path.
TEST_FILE ?= tests/unit_tests/
integration_test integration_tests: TEST_FILE = tests/integration_tests/
# unit tests are run with the --disable-socket flag to prevent network calls
test tests:
poetry run pytest --disable-socket --allow-unix-socket $(TEST_FILE)
# integration tests are run without the --disable-socket flag to allow network calls
integration_test:
poetry run pytest $(TEST_FILE)
# skip tests marked as local in CI
integration_tests:
poetry run pytest $(TEST_FILE) -m "not local"
######################
# LINTING AND FORMATTING
######################
# Define a variable for Python and notebook files.
PYTHON_FILES=.
MYPY_CACHE=.mypy_cache
lint format: PYTHON_FILES=.
lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/unstructured --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$')
lint_package: PYTHON_FILES=langchain_unstructured
lint_tests: PYTHON_FILES=tests
lint_tests: MYPY_CACHE=.mypy_cache_test
lint lint_diff lint_package lint_tests:
poetry run ruff .
poetry run ruff format $(PYTHON_FILES) --diff
poetry run ruff --select I $(PYTHON_FILES)
mkdir -p $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE)
format format_diff:
poetry run ruff format $(PYTHON_FILES)
poetry run ruff --select I --fix $(PYTHON_FILES)
spell_check:
poetry run codespell --toml pyproject.toml
spell_fix:
poetry run codespell --toml pyproject.toml -w
check_imports: $(shell find langchain_unstructured -name '*.py')
poetry run python ./scripts/check_imports.py $^
######################
# HELP
######################
help:
@echo '----'
@echo 'check_imports - check imports'
@echo 'format - run code formatters'
@echo 'lint - run linters'
@echo 'test - run unit tests'
@echo 'tests - run unit tests'
@echo 'test TEST_FILE=<test_file> - run all tests in file'

View File

@ -0,0 +1,71 @@
# langchain-unstructured
This package contains the LangChain integration with Unstructured
## Installation
```bash
pip install -U langchain-unstructured
```
And you should configure credentials by setting the following environment variables:
```bash
export UNSTRUCTURED_API_KEY="your-api-key"
```
## Loaders
Partition and load files using either the `unstructured-client` sdk and the
Unstructured API or locally using the `unstructured` library.
API:
To partition via the Unstructured API `pip install unstructured-client` and set
`partition_via_api=True` and define `api_key`. If you are running the unstructured API
locally, you can change the API rule by defining `url` when you initialize the
loader. The hosted Unstructured API requires an API key. See the links below to
learn more about our API offerings and get an API key.
Local:
By default the file loader uses the Unstructured `partition` function and will
automatically detect the file type.
In addition to document specific partition parameters, Unstructured has a rich set
of "chunking" parameters for post-processing elements into more useful text segments
for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
Unstructured kwargs to the loader to configure different unstructured settings.
Setup:
```bash
pip install -U langchain-unstructured
pip install -U unstructured-client
export UNSTRUCTURED_API_KEY="your-api-key"
```
Instantiate:
```python
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path = ["example.pdf", "fake.pdf"],
api_key=UNSTRUCTURED_API_KEY,
partition_via_api=True,
chunking_strategy="by_title",
strategy="fast",
)
```
Load:
```python
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
```
References
----------
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking

View File

@ -0,0 +1,15 @@
from importlib import metadata
from langchain_unstructured.document_loaders import UnstructuredLoader
try:
__version__ = metadata.version(__package__)
except metadata.PackageNotFoundError:
# Case where package metadata is not available.
__version__ = ""
del metadata # optional, avoids polluting the results of dir(__package__)
__all__ = [
"UnstructuredLoader",
"__version__",
]

View File

@ -0,0 +1,280 @@
"""Unstructured document loader."""
from __future__ import annotations
import json
import logging
import os
from pathlib import Path
from typing import IO, Any, Callable, Iterator, Optional, cast
from langchain_core.document_loaders.base import BaseLoader
from langchain_core.documents import Document
from typing_extensions import TypeAlias
from unstructured_client import UnstructuredClient # type: ignore
from unstructured_client.models import operations, shared # type: ignore
Element: TypeAlias = Any
logger = logging.getLogger(__file__)
_DEFAULT_URL = "https://api.unstructuredapp.io/general/v0/general"
class UnstructuredLoader(BaseLoader):
"""Unstructured document loader interface.
Partition and load files using either the `unstructured-client` sdk and the
Unstructured API or locally using the `unstructured` library.
API:
This package is configured to work with the Unstructured API by default.
To use the Unstructured API, set
`partition_via_api=True` and define `api_key`. If you are running the unstructured
API locally, you can change the API rule by defining `url` when you initialize the
loader. The hosted Unstructured API requires an API key. See the links below to
learn more about our API offerings and get an API key.
Local:
To partition files locally, you must have the `unstructured` package installed.
You can install it with `pip install unstructured`.
By default the file loader uses the Unstructured `partition` function and will
automatically detect the file type.
In addition to document specific partition parameters, Unstructured has a rich set
of "chunking" parameters for post-processing elements into more useful text segments
for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
Unstructured kwargs to the loader to configure different unstructured settings.
Setup:
.. code-block:: bash
pip install -U langchain-unstructured
export UNSTRUCTURED_API_KEY="your-api-key"
Instantiate:
.. code-block:: python
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path = ["example.pdf", "fake.pdf"],
api_key=UNSTRUCTURED_API_KEY,
partition_via_api=True,
chunking_strategy="by_title",
strategy="fast",
)
Load:
.. code-block:: python
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
References
----------
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking
"""
def __init__(
self,
file_path: Optional[str | Path | list[str] | list[Path]] = None,
*,
file: Optional[IO[bytes] | list[IO[bytes]]] = None,
partition_via_api: bool = False,
post_processors: Optional[list[Callable[[str], str]]] = None,
# SDK parameters
api_key: Optional[str] = None,
client: Optional[UnstructuredClient] = None,
server_url: Optional[str] = None,
**kwargs: Any,
):
"""Initialize loader."""
if file_path is not None and file is not None:
raise ValueError("file_path and file cannot be defined simultaneously.")
if client is not None:
disallowed_params = [("api_key", api_key), ("server_url", server_url)]
bad_params = [
param for param, value in disallowed_params if value is not None
]
if bad_params:
raise ValueError(
"if you are passing a custom `client`, you cannot also pass these "
f"params: {', '.join(bad_params)}."
)
unstructured_api_key = api_key or os.getenv("UNSTRUCTURED_API_KEY")
unstructured_url = server_url or os.getenv("UNSTRUCTURED_URL") or _DEFAULT_URL
self.client = client or UnstructuredClient(
api_key_auth=unstructured_api_key, server_url=unstructured_url
)
self.file_path = file_path
self.file = file
self.partition_via_api = partition_via_api
self.post_processors = post_processors
self.unstructured_kwargs = kwargs
def lazy_load(self) -> Iterator[Document]:
"""Load file(s) to the _UnstructuredBaseLoader."""
def load_file(
f: Optional[IO[bytes]] = None, f_path: Optional[str | Path] = None
) -> Iterator[Document]:
"""Load an individual file to the _UnstructuredBaseLoader."""
return _SingleDocumentLoader(
file=f,
file_path=f_path,
partition_via_api=self.partition_via_api,
post_processors=self.post_processors,
# SDK parameters
client=self.client,
**self.unstructured_kwargs,
).lazy_load()
if isinstance(self.file, list):
for f in self.file:
yield from load_file(f=f)
return
if isinstance(self.file_path, list):
for f_path in self.file_path:
yield from load_file(f_path=f_path)
return
# Call _UnstructuredBaseLoader normally since file and file_path are not lists
yield from load_file(f=self.file, f_path=self.file_path)
class _SingleDocumentLoader(BaseLoader):
"""Provides loader functionality for individual document/file objects.
Encapsulates partitioning individual file objects (file or file_path) either
locally or via the Unstructured API.
"""
def __init__(
self,
file_path: Optional[str | Path] = None,
*,
client: UnstructuredClient,
file: Optional[IO[bytes]] = None,
partition_via_api: bool = False,
post_processors: Optional[list[Callable[[str], str]]] = None,
# SDK parameters
**kwargs: Any,
):
"""Initialize loader."""
self.file_path = str(file_path) if isinstance(file_path, Path) else file_path
self.file = file
self.partition_via_api = partition_via_api
self.post_processors = post_processors
# SDK parameters
self.client = client
self.unstructured_kwargs = kwargs
def lazy_load(self) -> Iterator[Document]:
"""Load file."""
elements_json = (
self._post_process_elements_json(self._elements_json)
if self.post_processors
else self._elements_json
)
for element in elements_json:
metadata = self._get_metadata()
metadata.update(element.get("metadata")) # type: ignore
metadata.update(
{"category": element.get("category") or element.get("type")}
)
metadata.update({"element_id": element.get("element_id")})
yield Document(
page_content=cast(str, element.get("text")), metadata=metadata
)
@property
def _elements_json(self) -> list[dict[str, Any]]:
"""Get elements as a list of dictionaries from local partition or via API."""
if self.partition_via_api:
return self._elements_via_api
return self._convert_elements_to_dicts(self._elements_via_local)
@property
def _elements_via_local(self) -> list[Element]:
try:
from unstructured.partition.auto import partition # type: ignore
except ImportError:
raise ImportError(
"unstructured package not found, please install it with "
"`pip install unstructured`"
)
if self.file and self.unstructured_kwargs.get("metadata_filename") is None:
raise ValueError(
"If partitioning a fileIO object, metadata_filename must be specified"
" as well.",
)
return partition(
file=self.file, filename=self.file_path, **self.unstructured_kwargs
) # type: ignore
@property
def _elements_via_api(self) -> list[dict[str, Any]]:
"""Retrieve a list of element dicts from the API using the SDK client."""
client = self.client
req = self._sdk_partition_request
response = client.general.partition(req) # type: ignore
if response.status_code == 200:
return json.loads(response.raw_response.text)
raise ValueError(
f"Receive unexpected status code {response.status_code} from the API.",
)
@property
def _file_content(self) -> bytes:
"""Get content from either file or file_path."""
if self.file is not None:
return self.file.read()
elif self.file_path:
with open(self.file_path, "rb") as f:
return f.read()
raise ValueError("file or file_path must be defined.")
@property
def _sdk_partition_request(self) -> operations.PartitionRequest:
return operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=self._file_content, file_name=str(self.file_path)
),
**self.unstructured_kwargs,
),
)
def _convert_elements_to_dicts(
self, elements: list[Element]
) -> list[dict[str, Any]]:
return [element.to_dict() for element in elements]
def _get_metadata(self) -> dict[str, Any]:
"""Get file_path metadata if available."""
return {"source": self.file_path} if self.file_path else {}
def _post_process_elements_json(
self, elements_json: list[dict[str, Any]]
) -> list[dict[str, Any]]:
"""Apply post processing functions to extracted unstructured elements.
Post processing functions are str -> str callables passed
in using the post_processors kwarg when the loader is instantiated.
"""
if self.post_processors:
for element in elements_json:
for post_processor in self.post_processors:
element["text"] = post_processor(str(element.get("text")))
return elements_json

4419
libs/partners/unstructured/poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,97 @@
[tool.poetry]
name = "langchain-unstructured"
version = "0.1.0"
description = "An integration package connecting Unstructured and LangChain"
authors = []
readme = "README.md"
repository = "https://github.com/langchain-ai/langchain"
license = "MIT"
[tool.poetry.urls]
"Source Code" = "https://github.com/langchain-ai/langchain/tree/master/libs/partners/unstructured"
"Release Notes" = "https://github.com/langchain-ai/langchain/releases?q=tag%3A%22langchain-unstructured%3D%3D0%22&expanded=true"
[tool.poetry.dependencies]
python = ">=3.9,<4.0"
langchain-core = "^0.2.23"
unstructured-client = { version = "^0.24.1" }
unstructured = { version = "^0.15.0", optional = true, python = "<3.13", extras = [
"all-docs",
] }
[tool.poetry.extras]
local = ["unstructured"]
[tool.poetry.group.test]
optional = true
[tool.poetry.group.test.dependencies]
pytest = "^7.4.3"
pytest-asyncio = "^0.23.2"
pytest-socket = "^0.7.0"
langchain-core = { path = "../../core", develop = true }
[tool.poetry.group.codespell]
optional = true
[tool.poetry.group.codespell.dependencies]
codespell = "^2.2.6"
[tool.poetry.group.test_integration]
optional = true
[tool.poetry.group.test_integration.dependencies]
[tool.poetry.group.lint]
optional = true
[tool.poetry.group.lint.dependencies]
ruff = "^0.1.8"
[tool.poetry.group.typing.dependencies]
mypy = "^1.7.1"
unstructured = { version = "^0.15.0", python = "<3.13", extras = ["all-docs"] }
langchain-core = { path = "../../core", develop = true }
[tool.poetry.group.dev]
optional = true
[tool.poetry.group.dev.dependencies]
langchain-core = { path = "../../core", develop = true }
[tool.ruff.lint]
select = [
"E", # pycodestyle
"F", # pyflakes
"I", # isort
"T201", # print
]
[tool.mypy]
disallow_untyped_defs = "True"
[tool.coverage.run]
omit = ["tests/*"]
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
[tool.pytest.ini_options]
# --strict-markers will raise errors on unknown marks.
# https://docs.pytest.org/en/7.1.x/how-to/mark.html#raising-errors-on-unknown-marks
#
# https://docs.pytest.org/en/7.1.x/reference/reference.html
# --strict-config any warnings encountered while parsing the `pytest`
# section of the configuration file raise errors.
#
# https://github.com/tophat/syrupy
# --snapshot-warn-unused Prints a warning on unused snapshots rather than fail the test suite.
addopts = "--strict-markers --strict-config --durations=5"
# Registering custom markers.
# https://docs.pytest.org/en/7.1.x/example/markers.html#registering-markers
markers = [
"compile: mark placeholder test used to compile integration tests without running them",
"local: mark tests as requiring a local install, which isn't compatible with CI currently",
]
asyncio_mode = "auto"

View File

@ -0,0 +1,17 @@
import sys
import traceback
from importlib.machinery import SourceFileLoader
if __name__ == "__main__":
files = sys.argv[1:]
has_failure = False
for file in files:
try:
SourceFileLoader("x", file).load_module()
except Exception:
has_faillure = True
print(file) # noqa: T201
traceback.print_exc()
print() # noqa: T201
sys.exit(1 if has_failure else 0)

View File

@ -0,0 +1,27 @@
#!/bin/bash
#
# This script searches for lines starting with "import pydantic" or "from pydantic"
# in tracked files within a Git repository.
#
# Usage: ./scripts/check_pydantic.sh /path/to/repository
# Check if a path argument is provided
if [ $# -ne 1 ]; then
echo "Usage: $0 /path/to/repository"
exit 1
fi
repository_path="$1"
# Search for lines matching the pattern within the specified repository
result=$(git -C "$repository_path" grep -E '^import pydantic|^from pydantic')
# Check if any matching lines were found
if [ -n "$result" ]; then
echo "ERROR: The following lines need to be updated:"
echo "$result"
echo "Please replace the code with an import from langchain_core.pydantic_v1."
echo "For example, replace 'from pydantic import BaseModel'"
echo "with 'from langchain_core.pydantic_v1 import BaseModel'"
exit 1
fi

View File

@ -0,0 +1,18 @@
#!/bin/bash
set -eu
# Initialize a variable to keep track of errors
errors=0
# make sure not importing from langchain, langchain_experimental, or langchain_community
git --no-pager grep '^from langchain\.' . && errors=$((errors+1))
git --no-pager grep '^from langchain_experimental\.' . && errors=$((errors+1))
git --no-pager grep '^from langchain_community\.' . && errors=$((errors+1))
# Decide on an exit status based on the errors
if [ "$errors" -gt 0 ]; then
exit 1
else
exit 0
fi

View File

@ -0,0 +1,7 @@
import pytest
@pytest.mark.compile
def test_placeholder() -> None:
"""Used for compiling integration tests without running any real tests."""
pass

View File

@ -0,0 +1,135 @@
import os
from pathlib import Path
from typing import Callable
import pytest
from langchain_unstructured import UnstructuredLoader
EXAMPLE_DOCS_DIRECTORY = str(
Path(__file__).parent.parent.parent.parent.parent
/ "community/tests/integration_tests/examples/"
)
UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
# -- Local partition --
@pytest.mark.local
def test_loader_partitions_locally() -> None:
file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
docs = UnstructuredLoader(
file_path=file_path,
# Unstructured kwargs
strategy="fast",
include_page_breaks=True,
).load()
assert all(
doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
)
assert any(doc.metadata.get("category") == "PageBreak" for doc in docs)
@pytest.mark.local
def test_loader_partition_ignores_invalid_arg() -> None:
file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
docs = UnstructuredLoader(
file_path=file_path,
# Unstructured kwargs
strategy="fast",
# mode is no longer a valid argument and is ignored when partitioning locally
mode="single",
).load()
assert len(docs) > 1
assert all(
doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
)
@pytest.mark.local
def test_loader_partitions_locally_and_applies_post_processors(
get_post_processor: Callable[[str], str],
) -> None:
file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
loader = UnstructuredLoader(
file_path=file_path,
post_processors=[get_post_processor],
strategy="fast",
)
docs = loader.load()
assert len(docs) > 1
assert docs[0].page_content.endswith("THE END!")
# -- API partition --
def test_loader_partitions_via_api() -> None:
file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
loader = UnstructuredLoader(
file_path=file_path,
partition_via_api=True,
# Unstructured kwargs
strategy="fast",
include_page_breaks=True,
)
docs = loader.load()
assert len(docs) > 1
assert any(doc.metadata.get("category") == "PageBreak" for doc in docs)
assert all(
doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
)
assert docs[0].metadata.get("element_id") is not None
def test_loader_partitions_multiple_via_api() -> None:
file_paths = [
os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf"),
os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-email-attachment.eml"),
]
loader = UnstructuredLoader(
file_path=file_paths,
api_key=UNSTRUCTURED_API_KEY,
partition_via_api=True,
# Unstructured kwargs
strategy="fast",
)
docs = loader.load()
assert len(docs) > 1
assert docs[0].metadata.get("filename") == "layout-parser-paper.pdf"
assert docs[-1].metadata.get("filename") == "fake-email-attachment.eml"
def test_loader_partition_via_api_raises_TypeError_with_invalid_arg() -> None:
file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
loader = UnstructuredLoader(
file_path=file_path,
api_key=UNSTRUCTURED_API_KEY,
partition_via_api=True,
mode="elements",
)
with pytest.raises(TypeError, match="unexpected keyword argument 'mode'"):
loader.load()
# -- fixtures ---
@pytest.fixture()
def get_post_processor() -> Callable[[str], str]:
def append_the_end(text: str) -> str:
return text + "THE END!"
return append_the_end

View File

@ -0,0 +1,178 @@
from pathlib import Path
from typing import Any, Callable
from unittest import mock
from unittest.mock import Mock, mock_open, patch
import pytest
from unstructured.documents.elements import Text # type: ignore
from langchain_unstructured.document_loaders import (
_SingleDocumentLoader, # type: ignore
)
EXAMPLE_DOCS_DIRECTORY = str(
Path(__file__).parent.parent.parent.parent.parent
/ "community/tests/integration_tests/examples/"
)
# --- _SingleDocumentLoader._get_content() ---
def test_it_gets_content_from_file() -> None:
mock_file = Mock()
mock_file.read.return_value = b"content from file"
loader = _SingleDocumentLoader(
client=Mock(), file=mock_file, metadata_filename="fake.txt"
)
content = loader._file_content # type: ignore
assert content == b"content from file"
mock_file.read.assert_called_once()
@patch("builtins.open", new_callable=mock_open, read_data=b"content from file_path")
def test_it_gets_content_from_file_path(mock_file: Mock) -> None:
loader = _SingleDocumentLoader(client=Mock(), file_path="dummy_path")
content = loader._file_content # type: ignore
assert content == b"content from file_path"
mock_file.assert_called_once_with("dummy_path", "rb")
handle = mock_file()
handle.read.assert_called_once()
def test_it_raises_value_error_without_file_or_file_path() -> None:
loader = _SingleDocumentLoader(
client=Mock(),
)
with pytest.raises(ValueError) as e:
loader._file_content # type: ignore
assert str(e.value) == "file or file_path must be defined."
# --- _SingleDocumentLoader._elements_json ---
def test_it_calls_elements_via_api_with_valid_args() -> None:
with patch.object(
_SingleDocumentLoader, "_elements_via_api", new_callable=mock.PropertyMock
) as mock_elements_via_api:
mock_elements_via_api.return_value = [{"element": "data"}]
loader = _SingleDocumentLoader(
client=Mock(),
# Minimum required args for self._elements_via_api to be called:
partition_via_api=True,
api_key="some_key",
)
result = loader._elements_json # type: ignore
mock_elements_via_api.assert_called_once()
assert result == [{"element": "data"}]
@patch.object(_SingleDocumentLoader, "_convert_elements_to_dicts")
def test_it_partitions_locally_by_default(mock_convert_elements_to_dicts: Mock) -> None:
mock_convert_elements_to_dicts.return_value = [{}]
with patch.object(
_SingleDocumentLoader, "_elements_via_local", new_callable=mock.PropertyMock
) as mock_elements_via_local:
mock_elements_via_local.return_value = [{}]
# Minimum required args for self._elements_via_api to be called:
loader = _SingleDocumentLoader(
client=Mock(),
)
result = loader._elements_json # type: ignore
mock_elements_via_local.assert_called_once_with()
mock_convert_elements_to_dicts.assert_called_once_with([{}])
assert result == [{}]
def test_it_partitions_locally_and_logs_warning_with_partition_via_api_False(
caplog: pytest.LogCaptureFixture,
) -> None:
with patch.object(
_SingleDocumentLoader, "_elements_via_local"
) as mock_get_elements_locally:
mock_get_elements_locally.return_value = [Text("Mock text element.")]
loader = _SingleDocumentLoader(
client=Mock(), partition_via_api=False, api_key="some_key"
)
_ = loader._elements_json # type: ignore
# -- fixtures -------------------------------
@pytest.fixture()
def get_post_processor() -> Callable[[str], str]:
def append_the_end(text: str) -> str:
return text + "THE END!"
return append_the_end
@pytest.fixture()
def fake_json_response() -> list[dict[str, Any]]:
return [
{
"type": "Title",
"element_id": "b7f58c2fd9c15949a55a62eb84e39575",
"text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document"
"Image Analysis",
"metadata": {
"languages": ["eng"],
"page_number": 1,
"filename": "layout-parser-paper.pdf",
"filetype": "application/pdf",
},
},
{
"type": "UncategorizedText",
"element_id": "e1c4facddf1f2eb1d0db5be34ad0de18",
"text": "1 2 0 2",
"metadata": {
"languages": ["eng"],
"page_number": 1,
"parent_id": "b7f58c2fd9c15949a55a62eb84e39575",
"filename": "layout-parser-paper.pdf",
"filetype": "application/pdf",
},
},
]
@pytest.fixture()
def fake_multiple_docs_json_response() -> list[dict[str, Any]]:
return [
{
"type": "Title",
"element_id": "b7f58c2fd9c15949a55a62eb84e39575",
"text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document"
" Image Analysis",
"metadata": {
"languages": ["eng"],
"page_number": 1,
"filename": "layout-parser-paper.pdf",
"filetype": "application/pdf",
},
},
{
"type": "NarrativeText",
"element_id": "3c4ac9e7f55f1e3dbd87d3a9364642fe",
"text": "6/29/23, 12:16\u202fam - User 4: This message was deleted",
"metadata": {
"filename": "whatsapp_chat.txt",
"languages": ["eng"],
"filetype": "text/plain",
},
},
]

View File

@ -0,0 +1,10 @@
from langchain_unstructured import __all__
EXPECTED_ALL = [
"UnstructuredLoader",
"__version__",
]
def test_all_imports() -> None:
assert sorted(EXPECTED_ALL) == sorted(__all__)