mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-06 13:33:37 +00:00
unstructured, community, initialize langchain-unstructured package (#22779)
#### Update (2): A single `UnstructuredLoader` is added to handle both local and api partitioning. This loader also handles single or multiple documents. #### Changes in `community`: Changes here do not affect users. In the initial process of using the SDK for the API Loaders, the Loaders in community were refactored. Other changes include: The `UnstructuredBaseLoader` has a new check to see if both `mode="paged"` and `chunking_strategy="by_page"`. It also now has `Element.element_id` added to the `Document.metadata`. `UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such, now both directly inherit from `UnstructuredBaseLoader` and initialize their `file_path`/`file` attributes respectively and implement their own `_post_process_elements` methods. -------- #### Update: New SDK Loaders in a [partner package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo) are introduced to prevent breaking changes for users (see discussion below). ##### TODO: - [x] Test docstring examples -------- - **Description:** UnstructuredAPIFileIOLoader and UnstructuredAPIFileLoader calls to the unstructured api are now made using the unstructured-client sdk. - **New Dependencies:** unstructured-client - [x] **Add tests and docs**: If you're adding a new integration, please include - [x] a test for the integration, preferably unit tests that do not rely on network access, - [x] update the description in `docs/docs/integrations/providers/unstructured.mdx` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. TODO: - [x] Update https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api - `langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb` - The description here needs to indicate that users should install `unstructured-client` instead of `unstructured`. Read over closely to look for any other changes that need to be made. - [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to handle json responses from the API instead of just lists of elements. - This method may need to be overwritten by the API loaders instead of changing it in the `UnstructuredBaseLoader`. - [x] Update the documentation links in the class docstrings (the Unstructured documents have moved) - [x] Update Document.metadata to include `element_id` (see thread [here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419)) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com>
This commit is contained in:
@@ -5,7 +5,7 @@
|
||||
"id": "20deed05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Unstructured File\n",
|
||||
"# Unstructured\n",
|
||||
"\n",
|
||||
"This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n",
|
||||
"\n",
|
||||
@@ -14,79 +14,69 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": null,
|
||||
"id": "2886982e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
|
||||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# # Install package\n",
|
||||
"%pip install --upgrade --quiet \"unstructured[all-docs]\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "54d62efd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# # Install other dependencies\n",
|
||||
"# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
|
||||
"# !brew install libmagic\n",
|
||||
"# !brew install poppler\n",
|
||||
"# !brew install tesseract\n",
|
||||
"# # If parsing xml / html documents:\n",
|
||||
"# !brew install libxml2\n",
|
||||
"# !brew install libxslt"
|
||||
"# Install package, compatible with API partitioning\n",
|
||||
"%pip install --upgrade --quiet \"langchain-unstructured\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"id": "af6a64f5",
|
||||
"cell_type": "markdown",
|
||||
"id": "e75e2a6d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# import nltk\n",
|
||||
"# nltk.download('punkt')"
|
||||
"### Local Partitioning (Optional)\n",
|
||||
"\n",
|
||||
"By default, `langchain-unstructured` installs a smaller footprint that requires\n",
|
||||
"offloading of the partitioning logic to the Unstructured API.\n",
|
||||
"\n",
|
||||
"If you would like to run the partitioning logic locally, you will need to install\n",
|
||||
"a combination of system dependencies, as outlined in the \n",
|
||||
"[Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n",
|
||||
"\n",
|
||||
"For example, on Macs you can install the required dependencies with:\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"# base dependencies\n",
|
||||
"brew install libmagic poppler tesseract\n",
|
||||
"\n",
|
||||
"# If parsing xml / html documents:\n",
|
||||
"brew install libxml2 libxslt\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"You can install the required `pip` dependencies with:\n",
|
||||
"\n",
|
||||
"```bash\n",
|
||||
"pip install \"langchain-unstructured[local]\"\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a9c1c775",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Quickstart\n",
|
||||
"\n",
|
||||
"To simply load a file as a document, you can use the LangChain `DocumentLoader.load` \n",
|
||||
"interface:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": null,
|
||||
"id": "79d3e549",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"\n",
|
||||
"loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")\n",
|
||||
"loader = UnstructuredLoader(\"./example_data/state_of_the_union.txt\")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"docs[0].page_content[:400]"
|
||||
"docs = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -99,113 +89,31 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 5,
|
||||
"id": "092d9a0b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\\n\\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\\n\\n1/23/23, 2:59 AM - User 1: How much do you want?\\n\\n1/23/23, 3:00 AM - User 2: Online is at least $100\\n\\n1/23/23, 3:01 AM - User 2: Here is $129\\n\\n1/23/23, 3:01 AM - User 2: <Media omitted>\\n\\n1/23/23, 3:01 AM - User 1: Im not int'"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"whatsapp_chat.txt : 1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are in\n",
|
||||
"state_of_the_union.txt : May God bless you all. May God protect our troops.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"files = [\"./example_data/whatsapp_chat.txt\", \"./example_data/layout-parser-paper.pdf\"]\n",
|
||||
"file_paths = [\n",
|
||||
" \"./example_data/whatsapp_chat.txt\",\n",
|
||||
" \"./example_data/state_of_the_union.txt\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"loader = UnstructuredFileLoader(files)\n",
|
||||
"loader = UnstructuredLoader(file_paths)\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"docs[0].page_content[:400]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7874d01d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Retain Elements\n",
|
||||
"\n",
|
||||
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "ff5b616d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
|
||||
" Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
|
||||
" Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
|
||||
" Document(page_content='With a duty to one another to the American people to the Constitution.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"loader = UnstructuredFileLoader(\n",
|
||||
" \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"docs[:5]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "672733fd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Define a Partitioning Strategy\n",
|
||||
"\n",
|
||||
"Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "767238a4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
|
||||
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
|
||||
"\n",
|
||||
"loader = UnstructuredFileLoader(\n",
|
||||
" \"./example_data/layout-parser-paper.pdf\", strategy=\"fast\", mode=\"elements\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"docs[5:10]"
|
||||
"print(docs[0].metadata.get(\"filename\"), \": \", docs[0].page_content[:100])\n",
|
||||
"print(docs[-1].metadata.get(\"filename\"), \": \", docs[-1].page_content[:100])"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -215,37 +123,52 @@
|
||||
"source": [
|
||||
"## PDF Example\n",
|
||||
"\n",
|
||||
"Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n",
|
||||
"- `single` all the text from all elements are combined into one (default)\n",
|
||||
"- `elements` maintain individual elements\n",
|
||||
"- `paged` texts from each page are only combined"
|
||||
"Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "672733fd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Define a Partitioning Strategy\n",
|
||||
"\n",
|
||||
"Unstructured document loader allow users to pass in a `strategy` parameter that lets Unstructured\n",
|
||||
"know how to partition pdf and other OCR'd documents. Currently supported strategies are `\"auto\"`,\n",
|
||||
"`\"hi_res\"`, `\"ocr_only\"`, and `\"fast\"`. Learn more about the different strategies\n",
|
||||
"[here](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf). \n",
|
||||
"\n",
|
||||
"Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is\n",
|
||||
"ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing\n",
|
||||
"(i.e. a model for document partitioning). You can see how to apply a strategy to an\n",
|
||||
"`UnstructuredLoader` below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "686e5eb4",
|
||||
"execution_count": 6,
|
||||
"id": "60685353",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
|
||||
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
|
||||
"[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"loader = UnstructuredFileLoader(\n",
|
||||
" \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n",
|
||||
")\n",
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"\n",
|
||||
"loader = UnstructuredLoader(\"./example_data/layout-parser-paper.pdf\", strategy=\"fast\")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
@@ -257,37 +180,39 @@
|
||||
"id": "1cf27fc8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example."
|
||||
"## Post Processing\n",
|
||||
"\n",
|
||||
"If you need to post process the `unstructured` elements after extraction, you can pass in a list of\n",
|
||||
"`str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredLoader`. This applies to other Unstructured loaders as well. Below is an example."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 7,
|
||||
"id": "112e5538",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
|
||||
" Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
|
||||
" Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
|
||||
"[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
|
||||
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders import UnstructuredFileLoader\n",
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"from unstructured.cleaners.core import clean_extra_whitespace\n",
|
||||
"\n",
|
||||
"loader = UnstructuredFileLoader(\n",
|
||||
"loader = UnstructuredLoader(\n",
|
||||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||||
" mode=\"elements\",\n",
|
||||
" post_processors=[clean_extra_whitespace],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
@@ -303,34 +228,70 @@
|
||||
"source": [
|
||||
"## Unstructured API\n",
|
||||
"\n",
|
||||
"If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/unstructured/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
|
||||
"If you want to get up and running with smaller packages and get the most up-to-date partitioning you can `pip install\n",
|
||||
"unstructured-client` and `pip install langchain-unstructured`. For\n",
|
||||
"more information about the `UnstructuredLoader`, refer to the\n",
|
||||
"[Unstructured provider page](https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/).\n",
|
||||
"\n",
|
||||
"The loader will process your document using the hosted Unstructured serverless API when you pass in\n",
|
||||
"your `api_key` and set `partition_via_api=True`. You can generate a free\n",
|
||||
"Unstructured API key [here](https://unstructured.io/api-key/).\n",
|
||||
"\n",
|
||||
"Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image)\n",
|
||||
"if you’d like to self-host the Unstructured API or run it locally."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": null,
|
||||
"id": "6e5fde16",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Install package\n",
|
||||
"%pip install \"langchain-unstructured\"\n",
|
||||
"%pip install \"unstructured-client\"\n",
|
||||
"\n",
|
||||
"# Set API key\n",
|
||||
"import os\n",
|
||||
"\n",
|
||||
"os.environ[\"UNSTRUCTURED_API_KEY\"] = \"FAKE_API_KEY\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"id": "386eb63c",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"INFO: Preparing to split document for partition.\n",
|
||||
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
|
||||
"INFO: Partitioning without split.\n",
|
||||
"INFO: Successfully partitioned the document.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
|
||||
"Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders import UnstructuredAPIFileLoader\n",
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"\n",
|
||||
"filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]\n",
|
||||
"\n",
|
||||
"loader = UnstructuredAPIFileLoader(\n",
|
||||
" file_path=filenames[0],\n",
|
||||
" api_key=\"FAKE_API_KEY\",\n",
|
||||
"loader = UnstructuredLoader(\n",
|
||||
" file_path=\"example_data/fake.docx\",\n",
|
||||
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
|
||||
" partition_via_api=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
@@ -342,43 +303,197 @@
|
||||
"id": "94158999",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
|
||||
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredLoader`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"execution_count": 10,
|
||||
"id": "a3d7c846",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"INFO: Preparing to split document for partition.\n",
|
||||
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
|
||||
"INFO: Partitioning without split.\n",
|
||||
"INFO: Successfully partitioned the document.\n",
|
||||
"INFO: Preparing to split document for partition.\n",
|
||||
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
|
||||
"INFO: Partitioning without split.\n",
|
||||
"INFO: Successfully partitioned the document.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"fake.docx : Lorem ipsum dolor sit amet.\n",
|
||||
"fake-email.eml : Violets are blue\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"loader = UnstructuredAPIFileLoader(\n",
|
||||
" file_path=filenames,\n",
|
||||
" api_key=\"FAKE_API_KEY\",\n",
|
||||
"loader = UnstructuredLoader(\n",
|
||||
" file_path=[\"example_data/fake.docx\", \"example_data/fake-email.eml\"],\n",
|
||||
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
|
||||
" partition_via_api=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"docs[0]"
|
||||
"\n",
|
||||
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])\n",
|
||||
"print(docs[-1].metadata[\"filename\"], \": \", docs[-1].page_content[:100])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a324a0db",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Unstructured SDK Client\n",
|
||||
"\n",
|
||||
"Partitioning with the Unstructured API relies on the [Unstructured SDK\n",
|
||||
"Client](https://docs.unstructured.io/api-reference/api-services/sdk).\n",
|
||||
"\n",
|
||||
"Below is an example showing how you can customize some features of the client and use your own\n",
|
||||
"`requests.Session()`, pass in an alternative `server_url`, or customize the `RetryConfig` object for more control over how failed requests are handled."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0e510495",
|
||||
"execution_count": 11,
|
||||
"id": "58e55264",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"INFO: Preparing to split document for partition.\n",
|
||||
"INFO: Concurrency level set to 5\n",
|
||||
"INFO: Splitting pages 1 to 16 (16 total)\n",
|
||||
"INFO: Determined optimal split size of 4 pages.\n",
|
||||
"INFO: Partitioning 4 files with 4 page(s) each.\n",
|
||||
"INFO: Partitioning set #1 (pages 1-4).\n",
|
||||
"INFO: Partitioning set #2 (pages 5-8).\n",
|
||||
"INFO: Partitioning set #3 (pages 9-12).\n",
|
||||
"INFO: Partitioning set #4 (pages 13-16).\n",
|
||||
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
|
||||
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
|
||||
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
|
||||
"INFO: Successfully partitioned set #1, elements added to the final result.\n",
|
||||
"INFO: Successfully partitioned set #2, elements added to the final result.\n",
|
||||
"INFO: Successfully partitioned set #3, elements added to the final result.\n",
|
||||
"INFO: Successfully partitioned set #4, elements added to the final result.\n",
|
||||
"INFO: Successfully partitioned the document.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import requests\n",
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"from unstructured_client import UnstructuredClient\n",
|
||||
"from unstructured_client.utils import BackoffStrategy, RetryConfig\n",
|
||||
"\n",
|
||||
"client = UnstructuredClient(\n",
|
||||
" api_key_auth=os.getenv(\n",
|
||||
" \"UNSTRUCTURED_API_KEY\"\n",
|
||||
" ), # Note: the client API param is \"api_key_auth\" instead of \"api_key\"\n",
|
||||
" client=requests.Session(),\n",
|
||||
" server_url=\"https://api.unstructuredapp.io/general/v0/general\",\n",
|
||||
" retry_config=RetryConfig(\n",
|
||||
" strategy=\"backoff\",\n",
|
||||
" retry_connection_errors=True,\n",
|
||||
" backoff=BackoffStrategy(\n",
|
||||
" initial_interval=500,\n",
|
||||
" max_interval=60000,\n",
|
||||
" exponent=1.5,\n",
|
||||
" max_elapsed_time=900000,\n",
|
||||
" ),\n",
|
||||
" ),\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"loader = UnstructuredLoader(\n",
|
||||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||||
" partition_via_api=True,\n",
|
||||
" client=client,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c66fbeb3",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Chunking\n",
|
||||
"\n",
|
||||
"The `UnstructuredLoader` does not support `mode` as parameter for grouping text like the older\n",
|
||||
"loader `UnstructuredFileLoader` and others did. It instead supports \"chunking\". Chunking in\n",
|
||||
"unstructured differs from other chunking mechanisms you may be familiar with that form chunks based\n",
|
||||
"on plain-text features--character sequences like \"\\n\\n\" or \"\\n\" that might indicate a paragraph\n",
|
||||
"boundary or list-item boundary. Instead, all documents are split using specific knowledge about each\n",
|
||||
"document format to partition the document into semantic units (document elements) and we only need to\n",
|
||||
"resort to text-splitting when a single element exceeds the desired maximum chunk size. In general,\n",
|
||||
"chunking combines consecutive elements to form chunks as large as possible without exceeding the\n",
|
||||
"maximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.\n",
|
||||
"Each “chunk” is an instance of one of these three types.\n",
|
||||
"\n",
|
||||
"See this [page](https://docs.unstructured.io/open-source/core-functionality/chunking) for more\n",
|
||||
"details about chunking options, but to reproduce the same behavior as `mode=\"single\"`, you can set\n",
|
||||
"`chunking_strategy=\"basic\"`, `max_characters=<some-really-big-number>`, and `include_orig_elements=False`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"id": "e9f1c20d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"WARNING: Partitioning locally even though api_key is defined since partition_via_api=False.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of LangChain documents: 1\n",
|
||||
"Length of text in the document: 42772\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain_unstructured import UnstructuredLoader\n",
|
||||
"\n",
|
||||
"loader = UnstructuredLoader(\n",
|
||||
" \"./example_data/layout-parser-paper.pdf\",\n",
|
||||
" chunking_strategy=\"basic\",\n",
|
||||
" max_characters=1000000,\n",
|
||||
" include_orig_elements=False,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"docs = loader.load()\n",
|
||||
"\n",
|
||||
"print(\"Number of LangChain documents:\", len(docs))\n",
|
||||
"print(\"Length of text in the document:\", len(docs[0].page_content))"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
@@ -397,7 +512,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.5"
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
@@ -40,6 +40,7 @@ These providers have standalone `langchain-{provider}` packages for improved ver
|
||||
- [Qdrant](/docs/integrations/providers/qdrant)
|
||||
- [Robocorp](/docs/integrations/providers/robocorp)
|
||||
- [Together AI](/docs/integrations/providers/together)
|
||||
- [Unstructured](/docs/integrations/providers/unstructured)
|
||||
- [Upstage](/docs/integrations/providers/upstage)
|
||||
- [Voyage AI](/docs/integrations/providers/voyageai)
|
||||
|
||||
|
@@ -8,11 +8,21 @@ ecosystem within LangChain.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
If you are using a loader that runs locally, use the following steps to get `unstructured` and
|
||||
its dependencies running locally.
|
||||
If you are using a loader that runs locally, use the following steps to get `unstructured` and its
|
||||
dependencies running.
|
||||
|
||||
- Install the Python SDK with `pip install unstructured`.
|
||||
- You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
|
||||
- For the smallest installation footprint and to take advantage of features not available in the
|
||||
open-source `unstructured` package, install the Python SDK with `pip install unstructured-client`
|
||||
along with `pip install langchain-unstructured` to use the `UnstructuredLoader` and partition
|
||||
remotely against the Unstructured API. This loader lives
|
||||
in a LangChain partner repo instead of the `langchain-community` repo and you will need an
|
||||
`api_key`, which you can generate a free key [here](https://unstructured.io/api-key/).
|
||||
- Unstructured's documentation for the sdk can be found here:
|
||||
https://docs.unstructured.io/api-reference/api-services/sdk
|
||||
|
||||
- To run everything locally, install the open-source python package with `pip install unstructured`
|
||||
along with `pip install langchain-community` and use the same `UnstructuredLoader` as mentioned above.
|
||||
- You can install document specific dependencies with extras, e.g. `pip install "unstructured[docx]"`.
|
||||
- To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
|
||||
- Install the following system dependencies if they are not already available on your system with e.g. `brew install` for Mac.
|
||||
Depending on what document types you're parsing, you may not need all of these.
|
||||
@@ -22,16 +32,11 @@ its dependencies running locally.
|
||||
- `qpdf` (PDFs)
|
||||
- `libreoffice` (MS Office docs)
|
||||
- `pandoc` (EPUBs)
|
||||
- When running locally, Unstructured also recommends using Docker [by following this
|
||||
guide](https://docs.unstructured.io/open-source/installation/docker-installation) to ensure all
|
||||
system dependencies are installed correctly.
|
||||
|
||||
When running locally, Unstructured also recommends using Docker [by following this guide](https://docs.unstructured.io/open-source/installation/docker-installation)
|
||||
to ensure all system dependencies are installed correctly.
|
||||
|
||||
If you want to get up and running with less set up, you can
|
||||
simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
|
||||
`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
|
||||
|
||||
|
||||
The `Unstructured API` requires API keys to make requests.
|
||||
The Unstructured API requires API keys to make requests.
|
||||
You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
|
||||
Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
|
||||
We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
|
||||
@@ -42,30 +47,21 @@ Check out the instructions
|
||||
|
||||
## Data Loaders
|
||||
|
||||
The primary usage of the `Unstructured` is in data loaders.
|
||||
The primary usage of `Unstructured` is in data loaders.
|
||||
|
||||
### UnstructuredAPIFileIOLoader
|
||||
### UnstructuredLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file) to see how you can use
|
||||
this loader for both partitioning locally and remotely with the serverless Unstructured API.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
|
||||
```
|
||||
|
||||
### UnstructuredAPIFileLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredAPIFileLoader
|
||||
from langchain_unstructured import UnstructuredLoader
|
||||
```
|
||||
|
||||
### UnstructuredCHMLoader
|
||||
|
||||
`CHM` means `Microsoft Compiled HTML Help`.
|
||||
|
||||
See a usage example in the API documentation.
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredCHMLoader
|
||||
```
|
||||
@@ -119,15 +115,6 @@ See a [usage example](/docs/integrations/document_loaders/google_drive#passing-i
|
||||
from langchain_community.document_loaders import UnstructuredFileIOLoader
|
||||
```
|
||||
|
||||
### UnstructuredFileLoader
|
||||
|
||||
See a [usage example](/docs/integrations/document_loaders/unstructured_file).
|
||||
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import UnstructuredFileLoader
|
||||
```
|
||||
|
||||
### UnstructuredHTMLLoader
|
||||
|
||||
See a [usage example](/docs/how_to/document_loader_html).
|
||||
|
Reference in New Issue
Block a user