Files
langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb

551 lines
24 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "20deed05",
"metadata": {},
"source": [
"# Unstructured\n",
"\n",
"This notebook covers how to use `Unstructured` [document loader](https://python.langchain.com/v0.2/docs/concepts/#document-loaders) to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n",
"\n",
"Please see [this guide](../../integrations/providers/unstructured.mdx) for more instructions on setting up Unstructured locally, including setting up required system dependencies.\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/v0.2/docs/integrations/document_loaders/file_loaders/unstructured/)|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [UnstructuredLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html) | [langchain_community](https://api.python.langchain.com/en/latest/unstructured_api_reference.html) | ✅ | ❌ | ✅ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support\n",
"| :---: | :---: | :---: | \n",
"| UnstructuredLoader | ✅ | ❌ | \n",
"\n",
"## Setup\n",
"\n",
"### Credentials\n",
"\n",
"By default, `langchain-unstructured` installs a smaller footprint that requires offloading of the partitioning logic to the Unstructured API, which requires an API key. If you use the local installation, you do not need an API key. To get your API key, head over to [this site](https://unstructured.io) and get an API key, and then set it in the cell below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2886982e",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"UNSTRUCTURED_API_KEY\"] = getpass.getpass(\n",
" \"Enter your Unstructured API key: \"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e75e2a6d",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"#### Normal Installation\n",
"\n",
"The following packages are required to run the rest of this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9de83b3",
"metadata": {},
"outputs": [],
"source": [
"# Install package, compatible with API partitioning\n",
"%pip install --upgrade --quiet langchain-unstructured unstructured-client unstructured \"unstructured[pdf]\" python-magic"
]
},
{
"cell_type": "markdown",
"id": "637eda35",
"metadata": {},
"source": [
"#### Installation for Local\n",
"\n",
"If you would like to run the partitioning logic locally, you will need to install a combination of system dependencies, as outlined in the [Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n",
"\n",
"For example, on Macs you can install the required dependencies with:\n",
"\n",
"```bash\n",
"# base dependencies\n",
"brew install libmagic poppler tesseract\n",
"\n",
"# If parsing xml / html documents:\n",
"brew install libxml2 libxslt\n",
"```\n",
"\n",
"You can install the required `pip` dependencies needed for local with:\n",
"\n",
"```bash\n",
"pip install \"langchain-unstructured[local]\"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a9c1c775",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"The `UnstructuredLoader` allows loading from a variety of different file types. To read all about the `unstructured` package please refer to their [documentation](https://docs.unstructured.io/open-source/introduction/overview)/. In this example, we show loading from both a text file and a PDF file."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "79d3e549",
"metadata": {},
"outputs": [],
"source": [
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"file_paths = [\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" \"./example_data/state_of_the_union.txt\",\n",
"]\n",
"\n",
"\n",
"loader = UnstructuredLoader(file_paths)"
]
},
{
"cell_type": "markdown",
"id": "8b68dcab",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8da59ef8",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: NumExpr defaulting to 12 threads.\n",
"INFO: pikepdf C++ to Python logger bridge initialized\n"
]
},
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "97f7aa1f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}\n"
]
}
],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"id": "0d7f991b",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b05604d2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 213.36), (16.34, 253.36), (36.34, 253.36), (36.34, 213.36)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-07-25T21:28:58', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'd3ce55f220dfb75891b4394a18bcb973'}, page_content='1 2 0 2')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pages = []\n",
"for doc in loader.lazy_load():\n",
" pages.append(doc)\n",
"\n",
"pages[0]"
]
},
{
"cell_type": "markdown",
"id": "1cf27fc8",
"metadata": {},
"source": [
"## Post Processing\n",
"\n",
"If you need to post process the `unstructured` elements after extraction, you can pass in a list of\n",
"`str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredLoader`. This applies to other Unstructured loaders as well. Below is an example."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "112e5538",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
" Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_unstructured import UnstructuredLoader\n",
"from unstructured.cleaners.core import clean_extra_whitespace\n",
"\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" post_processors=[clean_extra_whitespace],\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"docs[5:10]"
]
},
{
"cell_type": "markdown",
"id": "b066cb5a",
"metadata": {},
"source": [
"## Unstructured API\n",
"\n",
"If you want to get up and running with smaller packages and get the most up-to-date partitioning you can `pip install\n",
"unstructured-client` and `pip install langchain-unstructured`. For\n",
"more information about the `UnstructuredLoader`, refer to the\n",
"[Unstructured provider page](https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/).\n",
"\n",
"The loader will process your document using the hosted Unstructured serverless API when you pass in\n",
"your `api_key` and set `partition_via_api=True`. You can generate a free\n",
"Unstructured API key [here](https://unstructured.io/api-key/).\n",
"\n",
"Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image)\n",
"if youd like to self-host the Unstructured API or run it locally."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "386eb63c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"data": {
"text/plain": [
"Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"loader = UnstructuredLoader(\n",
" file_path=\"example_data/fake.docx\",\n",
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
" partition_via_api=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "94158999",
"metadata": {},
"source": [
"You can also batch multiple files through the Unstructured API in a single API using `UnstructuredLoader`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a3d7c846",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n",
"INFO: Preparing to split document for partition.\n",
"INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
"INFO: Partitioning without split.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"fake.docx : Lorem ipsum dolor sit amet.\n",
"fake-email.eml : Violets are blue\n"
]
}
],
"source": [
"loader = UnstructuredLoader(\n",
" file_path=[\"example_data/fake.docx\", \"example_data/fake-email.eml\"],\n",
" api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
" partition_via_api=True,\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])\n",
"print(docs[-1].metadata[\"filename\"], \": \", docs[-1].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "a324a0db",
"metadata": {},
"source": [
"### Unstructured SDK Client\n",
"\n",
"Partitioning with the Unstructured API relies on the [Unstructured SDK\n",
"Client](https://docs.unstructured.io/api-reference/api-services/sdk).\n",
"\n",
"Below is an example showing how you can customize some features of the client and use your own `requests.Session()`, pass in an alternative `server_url`, or customize the `RetryConfig` object for more control over how failed requests are handled.\n",
"\n",
"Note that the example below may not use the latest version of the UnstructuredClient and there could be breaking changes in future releases. For the latest examples, refer to the [Unstructured Python SDK](https://docs.unstructured.io/api-reference/api-services/sdk-python) docs."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "58e55264",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Preparing to split document for partition.\n",
"INFO: Concurrency level set to 5\n",
"INFO: Splitting pages 1 to 16 (16 total)\n",
"INFO: Determined optimal split size of 4 pages.\n",
"INFO: Partitioning 4 files with 4 page(s) each.\n",
"INFO: Partitioning set #1 (pages 1-4).\n",
"INFO: Partitioning set #2 (pages 5-8).\n",
"INFO: Partitioning set #3 (pages 9-12).\n",
"INFO: Partitioning set #4 (pages 13-16).\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
"INFO: Successfully partitioned set #1, elements added to the final result.\n",
"INFO: Successfully partitioned set #2, elements added to the final result.\n",
"INFO: Successfully partitioned set #3, elements added to the final result.\n",
"INFO: Successfully partitioned set #4, elements added to the final result.\n",
"INFO: Successfully partitioned the document.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"layout-parser-paper.pdf : LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis\n"
]
}
],
"source": [
"import requests\n",
"from langchain_unstructured import UnstructuredLoader\n",
"from unstructured_client import UnstructuredClient\n",
"from unstructured_client.utils import BackoffStrategy, RetryConfig\n",
"\n",
"client = UnstructuredClient(\n",
" api_key_auth=os.getenv(\n",
" \"UNSTRUCTURED_API_KEY\"\n",
" ), # Note: the client API param is \"api_key_auth\" instead of \"api_key\"\n",
" client=requests.Session(),\n",
" server_url=\"https://api.unstructuredapp.io/general/v0/general\",\n",
" retry_config=RetryConfig(\n",
" strategy=\"backoff\",\n",
" retry_connection_errors=True,\n",
" backoff=BackoffStrategy(\n",
" initial_interval=500,\n",
" max_interval=60000,\n",
" exponent=1.5,\n",
" max_elapsed_time=900000,\n",
" ),\n",
" ),\n",
")\n",
"\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" partition_via_api=True,\n",
" client=client,\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])"
]
},
{
"cell_type": "markdown",
"id": "c66fbeb3",
"metadata": {},
"source": [
"## Chunking\n",
"\n",
"The `UnstructuredLoader` does not support `mode` as parameter for grouping text like the older\n",
"loader `UnstructuredFileLoader` and others did. It instead supports \"chunking\". Chunking in\n",
"unstructured differs from other chunking mechanisms you may be familiar with that form chunks based\n",
"on plain-text features--character sequences like \"\\n\\n\" or \"\\n\" that might indicate a paragraph\n",
"boundary or list-item boundary. Instead, all documents are split using specific knowledge about each\n",
"document format to partition the document into semantic units (document elements) and we only need to\n",
"resort to text-splitting when a single element exceeds the desired maximum chunk size. In general,\n",
"chunking combines consecutive elements to form chunks as large as possible without exceeding the\n",
"maximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.\n",
"Each “chunk” is an instance of one of these three types.\n",
"\n",
"See this [page](https://docs.unstructured.io/open-source/core-functionality/chunking) for more\n",
"details about chunking options, but to reproduce the same behavior as `mode=\"single\"`, you can set\n",
"`chunking_strategy=\"basic\"`, `max_characters=<some-really-big-number>`, and `include_orig_elements=False`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e9f1c20d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Partitioning locally even though api_key is defined since partition_via_api=False.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of LangChain documents: 1\n",
"Length of text in the document: 42772\n"
]
}
],
"source": [
"from langchain_unstructured import UnstructuredLoader\n",
"\n",
"loader = UnstructuredLoader(\n",
" \"./example_data/layout-parser-paper.pdf\",\n",
" chunking_strategy=\"basic\",\n",
" max_characters=1000000,\n",
" include_orig_elements=False,\n",
")\n",
"\n",
"docs = loader.load()\n",
"\n",
"print(\"Number of LangChain documents:\", len(docs))\n",
"print(\"Length of text in the document:\", len(docs[0].page_content))"
]
},
{
"cell_type": "markdown",
"id": "ce01aa40",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all `UnstructuredLoader` features and configurations head to the API reference: https://api.python.langchain.com/en/latest/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}