unstructured, community, initialize langchain-unstructured package (#22779)

#### Update (2): A single `UnstructuredLoader` is added to handle both local and api partitioning. This loader also handles single or multiple documents. #### Changes in `community`: Changes here do not affect users. In the initial process of using the SDK for the API Loaders, the Loaders in community were refactored. Other changes include: The `UnstructuredBaseLoader` has a new check to see if both `mode="paged"` and `chunking_strategy="by_page"`. It also now has `Element.element_id` added to the `Document.metadata`. `UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. As such, now both directly inherit from `UnstructuredBaseLoader` and initialize their `file_path`/`file` attributes respectively and implement their own `_post_process_elements` methods. -------- #### Update: New SDK Loaders in a [partner package](https://python.langchain.com/v0.1/docs/contributing/integrations/#partner-package-in-langchain-repo) are introduced to prevent breaking changes for users (see discussion below). ##### TODO: - [x] Test docstring examples -------- - **Description:** UnstructuredAPIFileIOLoader and UnstructuredAPIFileLoader calls to the unstructured api are now made using the unstructured-client sdk. - **New Dependencies:** unstructured-client - [x] **Add tests and docs**: If you're adding a new integration, please include - [x] a test for the integration, preferably unit tests that do not rely on network access, - [x] update the description in `docs/docs/integrations/providers/unstructured.mdx` - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. TODO: - [x] Update https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/#unstructured-api - `langchain/docs/docs/integrations/document_loaders/unstructured_file.ipynb` - The description here needs to indicate that users should install `unstructured-client` instead of `unstructured`. Read over closely to look for any other changes that need to be made. - [x] Update the `lazy_load` method in `UnstructuredBaseLoader` to handle json responses from the API instead of just lists of elements. - This method may need to be overwritten by the API loaders instead of changing it in the `UnstructuredBaseLoader`. - [x] Update the documentation links in the class docstrings (the Unstructured documents have moved) - [x] Update Document.metadata to include `element_id` (see thread [here](https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1718187499818419)) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Isaac Francisco <78627776+isahers1@users.noreply.github.com> Co-authored-by: ChengZi <chen.zhang@zilliz.com>
2025-08-10 13:27:36 +00:00 · 2024-07-24 19:21:20 -04:00 · 2024-07-24 19:21:20 -04:00 · d59c656ea5
commit d59c656ea5
parent 2394807033
23 changed files with 5929 additions and 347 deletions
--- a/docs/docs/integrations/document_loaders/unstructured_file.ipynb
+++ b/docs/docs/integrations/document_loaders/unstructured_file.ipynb
@ -5,7 +5,7 @@
   "id": "20deed05",
   "metadata": {},
   "source": [
-    "# Unstructured File\n",
+    "# Unstructured\n",
    "\n",
    "This notebook covers how to use `Unstructured` package to load files of many types. `Unstructured` currently supports loading of text files, powerpoints, html, pdfs, images, and more.\n",
    "\n",
@ -14,79 +14,69 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "id": "2886982e",
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.1.1\u001b[0m\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
-      "Note: you may need to restart the kernel to use updated packages.\n"
-     ]
-    }
-   ],
-   "source": [
-    "# # Install package\n",
-    "%pip install --upgrade --quiet \"unstructured[all-docs]\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "54d62efd",
-   "metadata": {},
   "outputs": [],
   "source": [
-    "# # Install other dependencies\n",
-    "# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst\n",
-    "# !brew install libmagic\n",
-    "# !brew install poppler\n",
-    "# !brew install tesseract\n",
-    "# # If parsing xml / html documents:\n",
-    "# !brew install libxml2\n",
-    "# !brew install libxslt"
+    "# Install package, compatible with API partitioning\n",
+    "%pip install --upgrade --quiet \"langchain-unstructured\""
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "af6a64f5",
+   "cell_type": "markdown",
+   "id": "e75e2a6d",
   "metadata": {},
-   "outputs": [],
   "source": [
-    "# import nltk\n",
-    "# nltk.download('punkt')"
+    "### Local Partitioning (Optional)\n",
+    "\n",
+    "By default, `langchain-unstructured` installs a smaller footprint that requires\n",
+    "offloading of the partitioning logic to the Unstructured API.\n",
+    "\n",
+    "If you would like to run the partitioning logic locally, you will need to install\n",
+    "a combination of system dependencies, as outlined in the \n",
+    "[Unstructured documentation here](https://docs.unstructured.io/open-source/installation/full-installation).\n",
+    "\n",
+    "For example, on Macs you can install the required dependencies with:\n",
+    "\n",
+    "```bash\n",
+    "# base dependencies\n",
+    "brew install libmagic poppler tesseract\n",
+    "\n",
+    "# If parsing xml / html documents:\n",
+    "brew install libxml2 libxslt\n",
+    "```\n",
+    "\n",
+    "You can install the required `pip` dependencies with:\n",
+    "\n",
+    "```bash\n",
+    "pip install \"langchain-unstructured[local]\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9c1c775",
+   "metadata": {},
+   "source": [
+    "### Quickstart\n",
+    "\n",
+    "To simply load a file as a document, you can use the LangChain `DocumentLoader.load` \n",
+    "interface:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "id": "79d3e549",
   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\\n\\nLast year COVID-19 kept us apart. This year we are finally together again.\\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\\n\\nWith a duty to one another to the American people to the Constit'"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
   "source": [
-    "from langchain_community.document_loaders import UnstructuredFileLoader\n",
+    "from langchain_unstructured import UnstructuredLoader\n",
    "\n",
-    "loader = UnstructuredFileLoader(\"./example_data/state_of_the_union.txt\")\n",
+    "loader = UnstructuredLoader(\"./example_data/state_of_the_union.txt\")\n",
    "\n",
-    "docs = loader.load()\n",
-    "\n",
-    "docs[0].page_content[:400]"
+    "docs = loader.load()"
   ]
  },
  {
@ -99,113 +89,31 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 5,
   "id": "092d9a0b",
   "metadata": {},
   "outputs": [
    {
-     "data": {
-      "text/plain": [
-       "'1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!\\n\\n1/22/23, 8:24 PM - User 2: Goodmorning! $50 is too low.\\n\\n1/23/23, 2:59 AM - User 1: How much do you want?\\n\\n1/23/23, 3:00 AM - User 2: Online is at least $100\\n\\n1/23/23, 3:01 AM - User 2: Here is $129\\n\\n1/23/23, 3:01 AM - User 2: <Media omitted>\\n\\n1/23/23, 3:01 AM - User 1: Im not int'"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "whatsapp_chat.txt :  1/22/23, 6:30 PM - User 1: Hi! Im interested in your bag. Im offering $50. Let me know if you are in\n",
+      "state_of_the_union.txt :  May God bless you all. May God protect our troops.\n"
+     ]
    }
   ],
   "source": [
-    "files = [\"./example_data/whatsapp_chat.txt\", \"./example_data/layout-parser-paper.pdf\"]\n",
+    "file_paths = [\n",
+    "    \"./example_data/whatsapp_chat.txt\",\n",
+    "    \"./example_data/state_of_the_union.txt\",\n",
+    "]\n",
    "\n",
-    "loader = UnstructuredFileLoader(files)\n",
+    "loader = UnstructuredLoader(file_paths)\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
-    "docs[0].page_content[:400]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7874d01d",
-   "metadata": {},
-   "source": [
-    "## Retain Elements\n",
-    "\n",
-    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "id": "ff5b616d",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
-       " Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
-       " Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
-       " Document(page_content='With a duty to one another to the American people to the Constitution.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', metadata={'source': './example_data/state_of_the_union.txt', 'file_directory': './example_data', 'filename': 'state_of_the_union.txt', 'last_modified': '2024-07-01T11:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'})]"
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "loader = UnstructuredFileLoader(\n",
-    "    \"./example_data/state_of_the_union.txt\", mode=\"elements\"\n",
-    ")\n",
-    "\n",
-    "docs = loader.load()\n",
-    "\n",
-    "docs[:5]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "672733fd",
-   "metadata": {},
-   "source": [
-    "## Define a Partitioning Strategy\n",
-    "\n",
-    "Unstructured document loader allow users to pass in a `strategy` parameter that lets `unstructured` know how to partition the document. Currently supported strategies are `\"hi_res\"` (the default) and `\"fast\"`. Hi res partitioning strategies are more accurate, but take longer to process. Fast strategies partition the document more quickly, but trade-off accuracy. Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing (i.e. a model for document partitioning). You can see how to apply a strategy to an `UnstructuredFileLoader` below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "id": "767238a4",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
-       " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
-      ]
-     },
-     "execution_count": 9,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "from langchain_community.document_loaders import UnstructuredFileLoader\n",
-    "\n",
-    "loader = UnstructuredFileLoader(\n",
-    "    \"./example_data/layout-parser-paper.pdf\", strategy=\"fast\", mode=\"elements\"\n",
-    ")\n",
-    "\n",
-    "docs = loader.load()\n",
-    "\n",
-    "docs[5:10]"
+    "print(docs[0].metadata.get(\"filename\"), \": \", docs[0].page_content[:100])\n",
+    "print(docs[-1].metadata.get(\"filename\"), \": \", docs[-1].page_content[:100])"
   ]
  },
  {
@ -215,37 +123,52 @@
   "source": [
    "## PDF Example\n",
    "\n",
-    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements. Modes of operation are \n",
-    "- `single` all the text from all elements are combined into one (default)\n",
-    "- `elements` maintain individual elements\n",
-    "- `paged` texts from each page are only combined"
+    "Processing PDF documents works exactly the same way. Unstructured detects the file type and extracts the same types of elements."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "672733fd",
+   "metadata": {},
+   "source": [
+    "### Define a Partitioning Strategy\n",
+    "\n",
+    "Unstructured document loader allow users to pass in a `strategy` parameter that lets Unstructured\n",
+    "know how to partition pdf and other OCR'd documents. Currently supported strategies are `\"auto\"`,\n",
+    "`\"hi_res\"`, `\"ocr_only\"`, and `\"fast\"`. Learn more about the different strategies\n",
+    "[here](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-pdf). \n",
+    "\n",
+    "Not all document types have separate hi res and fast partitioning strategies. For those document types, the `strategy` kwarg is\n",
+    "ignored. In some cases, the high res strategy will fallback to fast if there is a dependency missing\n",
+    "(i.e. a model for document partitioning). You can see how to apply a strategy to an\n",
+    "`UnstructuredLoader` below."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
-   "id": "686e5eb4",
+   "execution_count": 6,
+   "id": "60685353",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
-       " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
+       "[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
      ]
     },
-     "execution_count": 12,
+     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "loader = UnstructuredFileLoader(\n",
-    "    \"./example_data/layout-parser-paper.pdf\", mode=\"elements\"\n",
-    ")\n",
+    "from langchain_unstructured import UnstructuredLoader\n",
+    "\n",
+    "loader = UnstructuredLoader(\"./example_data/layout-parser-paper.pdf\", strategy=\"fast\")\n",
    "\n",
    "docs = loader.load()\n",
    "\n",
@ -257,37 +180,39 @@
   "id": "1cf27fc8",
   "metadata": {},
   "source": [
-    "If you need to post process the `unstructured` elements after extraction, you can pass in a list of `str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredFileLoader`. This applies to other Unstructured loaders as well. Below is an example."
+    "## Post Processing\n",
+    "\n",
+    "If you need to post process the `unstructured` elements after extraction, you can pass in a list of\n",
+    "`str` -> `str` functions to the `post_processors` kwarg when you instantiate the `UnstructuredLoader`. This applies to other Unstructured loaders as well. Below is an example."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 7,
   "id": "112e5538",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "[Document(page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title'}),\n",
-       " Document(page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText'}),\n",
-       " Document(page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.', metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2023-12-19T13:42:18', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText'})]"
+       "[Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((16.34, 393.9), (16.34, 560.0), (36.34, 560.0), (36.34, 393.9)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': '89565df026a24279aaea20dc08cedbec', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'e9fa370aef7ee5c05744eb7bb7d9981b'}, page_content='2 v 8 4 3 5 1 . 3 0 1 2 : v i X r a'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((157.62199999999999, 114.23496279999995), (157.62199999999999, 146.5141628), (457.7358962799999, 146.5141628), (457.7358962799999, 114.23496279999995)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'filetype': 'application/pdf', 'category': 'Title', 'element_id': 'bde0b230a1aa488e3ce837d33015181b'}, page_content='LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((134.809, 168.64029940800003), (134.809, 192.2517444), (480.5464199080001, 192.2517444), (480.5464199080001, 168.64029940800003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': '54700f902899f0c8c90488fa8d825bce'}, page_content='Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain Lee4, Jacob Carlson3, and Weining Li5'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((207.23000000000002, 202.57205439999996), (207.23000000000002, 311.8195408), (408.12676, 311.8195408), (408.12676, 202.57205439999996)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'UncategorizedText', 'element_id': 'b650f5867bad9bb4e30384282c79bcfe'}, page_content='1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca'),\n",
+       " Document(metadata={'source': './example_data/layout-parser-paper.pdf', 'coordinates': {'points': ((162.779, 338.45008160000003), (162.779, 566.8455408), (454.0372021523199, 566.8455408), (454.0372021523199, 338.45008160000003)), 'system': 'PixelSpace', 'layout_width': 612, 'layout_height': 792}, 'file_directory': './example_data', 'filename': 'layout-parser-paper.pdf', 'languages': ['eng'], 'last_modified': '2024-02-27T15:49:27', 'links': [{'text': ':// layout - parser . github . io', 'url': 'https://layout-parser.github.io', 'start_index': 1477}], 'page_number': 1, 'parent_id': 'bde0b230a1aa488e3ce837d33015181b', 'filetype': 'application/pdf', 'category': 'NarrativeText', 'element_id': 'cfc957c94fe63c8fd7c7f4bcb56e75a7'}, page_content='Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applica- tions. The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility, LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io.')]"
      ]
     },
-     "execution_count": 14,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "from langchain_community.document_loaders import UnstructuredFileLoader\n",
+    "from langchain_unstructured import UnstructuredLoader\n",
    "from unstructured.cleaners.core import clean_extra_whitespace\n",
    "\n",
-    "loader = UnstructuredFileLoader(\n",
+    "loader = UnstructuredLoader(\n",
    "    \"./example_data/layout-parser-paper.pdf\",\n",
-    "    mode=\"elements\",\n",
    "    post_processors=[clean_extra_whitespace],\n",
    ")\n",
    "\n",
@ -303,34 +228,70 @@
   "source": [
    "## Unstructured API\n",
    "\n",
-    "If you want to get up and running with less set up, you can simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or `UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API. You can generate a free Unstructured API key [here](https://www.unstructured.io/api-key/). The [Unstructured documentation](https://unstructured-io.github.io/unstructured/) page will have instructions on how to generate an API key once they’re available. Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) if you’d like to self-host the Unstructured API or run it locally."
+    "If you want to get up and running with smaller packages and get the most up-to-date partitioning you can `pip install\n",
+    "unstructured-client` and `pip install langchain-unstructured`. For\n",
+    "more information about the `UnstructuredLoader`, refer to the\n",
+    "[Unstructured provider page](https://python.langchain.com/v0.1/docs/integrations/document_loaders/unstructured_file/).\n",
+    "\n",
+    "The loader will process your document using the hosted Unstructured serverless API when you pass in\n",
+    "your `api_key` and set `partition_via_api=True`. You can generate a free\n",
+    "Unstructured API key [here](https://unstructured.io/api-key/).\n",
+    "\n",
+    "Check out the instructions [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image)\n",
+    "if you’d like to self-host the Unstructured API or run it locally."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
+   "id": "6e5fde16",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install package\n",
+    "%pip install \"langchain-unstructured\"\n",
+    "%pip install \"unstructured-client\"\n",
+    "\n",
+    "# Set API key\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"UNSTRUCTURED_API_KEY\"] = \"FAKE_API_KEY\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
   "id": "386eb63c",
   "metadata": {},
   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO: Preparing to split document for partition.\n",
+      "INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
+      "INFO: Partitioning without split.\n",
+      "INFO: Successfully partitioned the document.\n"
+     ]
+    },
    {
     "data": {
      "text/plain": [
-       "Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})"
+       "Document(metadata={'source': 'example_data/fake.docx', 'category_depth': 0, 'filename': 'fake.docx', 'languages': ['por', 'cat'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': '56d531394823d81787d77a04462ed096'}, page_content='Lorem ipsum dolor sit amet.')"
      ]
     },
-     "execution_count": 4,
+     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "from langchain_community.document_loaders import UnstructuredAPIFileLoader\n",
+    "from langchain_unstructured import UnstructuredLoader\n",
    "\n",
-    "filenames = [\"example_data/fake.docx\", \"example_data/fake-email.eml\"]\n",
-    "\n",
-    "loader = UnstructuredAPIFileLoader(\n",
-    "    file_path=filenames[0],\n",
-    "    api_key=\"FAKE_API_KEY\",\n",
+    "loader = UnstructuredLoader(\n",
+    "    file_path=\"example_data/fake.docx\",\n",
+    "    api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
+    "    partition_via_api=True,\n",
    ")\n",
    "\n",
    "docs = loader.load()\n",
@ -342,43 +303,197 @@
   "id": "94158999",
   "metadata": {},
   "source": [
-    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredAPIFileLoader`."
+    "You can also batch multiple files through the Unstructured API in a single API using `UnstructuredLoader`."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 10,
   "id": "a3d7c846",
   "metadata": {},
   "outputs": [
    {
-     "data": {
-      "text/plain": [
-       "Document(page_content='Lorem ipsum dolor sit amet.\\n\\nThis is a test email to use for unit tests.\\n\\nImportant points:\\n\\nRoses are red\\n\\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO: Preparing to split document for partition.\n",
+      "INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
+      "INFO: Partitioning without split.\n",
+      "INFO: Successfully partitioned the document.\n",
+      "INFO: Preparing to split document for partition.\n",
+      "INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled.\n",
+      "INFO: Partitioning without split.\n",
+      "INFO: Successfully partitioned the document.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "fake.docx :  Lorem ipsum dolor sit amet.\n",
+      "fake-email.eml :  Violets are blue\n"
+     ]
    }
   ],
   "source": [
-    "loader = UnstructuredAPIFileLoader(\n",
-    "    file_path=filenames,\n",
-    "    api_key=\"FAKE_API_KEY\",\n",
+    "loader = UnstructuredLoader(\n",
+    "    file_path=[\"example_data/fake.docx\", \"example_data/fake-email.eml\"],\n",
+    "    api_key=os.getenv(\"UNSTRUCTURED_API_KEY\"),\n",
+    "    partition_via_api=True,\n",
    ")\n",
    "\n",
    "docs = loader.load()\n",
-    "docs[0]"
+    "\n",
+    "print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])\n",
+    "print(docs[-1].metadata[\"filename\"], \": \", docs[-1].page_content[:100])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a324a0db",
+   "metadata": {},
+   "source": [
+    "### Unstructured SDK Client\n",
+    "\n",
+    "Partitioning with the Unstructured API relies on the [Unstructured SDK\n",
+    "Client](https://docs.unstructured.io/api-reference/api-services/sdk).\n",
+    "\n",
+    "Below is an example showing how you can customize some features of the client and use your own\n",
+    "`requests.Session()`, pass in an alternative `server_url`, or customize the `RetryConfig` object for more control over how failed requests are handled."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "id": "0e510495",
+   "execution_count": 11,
+   "id": "58e55264",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO: Preparing to split document for partition.\n",
+      "INFO: Concurrency level set to 5\n",
+      "INFO: Splitting pages 1 to 16 (16 total)\n",
+      "INFO: Determined optimal split size of 4 pages.\n",
+      "INFO: Partitioning 4 files with 4 page(s) each.\n",
+      "INFO: Partitioning set #1 (pages 1-4).\n",
+      "INFO: Partitioning set #2 (pages 5-8).\n",
+      "INFO: Partitioning set #3 (pages 9-12).\n",
+      "INFO: Partitioning set #4 (pages 13-16).\n",
+      "INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
+      "INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
+      "INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general \"HTTP/1.1 200 OK\"\n",
+      "INFO: Successfully partitioned set #1, elements added to the final result.\n",
+      "INFO: Successfully partitioned set #2, elements added to the final result.\n",
+      "INFO: Successfully partitioned set #3, elements added to the final result.\n",
+      "INFO: Successfully partitioned set #4, elements added to the final result.\n",
+      "INFO: Successfully partitioned the document.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "layout-parser-paper.pdf :  LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis\n"
+     ]
+    }
+   ],
+   "source": [
+    "import requests\n",
+    "from langchain_unstructured import UnstructuredLoader\n",
+    "from unstructured_client import UnstructuredClient\n",
+    "from unstructured_client.utils import BackoffStrategy, RetryConfig\n",
+    "\n",
+    "client = UnstructuredClient(\n",
+    "    api_key_auth=os.getenv(\n",
+    "        \"UNSTRUCTURED_API_KEY\"\n",
+    "    ),  # Note: the client API param is \"api_key_auth\" instead of \"api_key\"\n",
+    "    client=requests.Session(),\n",
+    "    server_url=\"https://api.unstructuredapp.io/general/v0/general\",\n",
+    "    retry_config=RetryConfig(\n",
+    "        strategy=\"backoff\",\n",
+    "        retry_connection_errors=True,\n",
+    "        backoff=BackoffStrategy(\n",
+    "            initial_interval=500,\n",
+    "            max_interval=60000,\n",
+    "            exponent=1.5,\n",
+    "            max_elapsed_time=900000,\n",
+    "        ),\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "loader = UnstructuredLoader(\n",
+    "    \"./example_data/layout-parser-paper.pdf\",\n",
+    "    partition_via_api=True,\n",
+    "    client=client,\n",
+    ")\n",
+    "\n",
+    "docs = loader.load()\n",
+    "\n",
+    "print(docs[0].metadata[\"filename\"], \": \", docs[0].page_content[:100])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c66fbeb3",
+   "metadata": {},
+   "source": [
+    "## Chunking\n",
+    "\n",
+    "The `UnstructuredLoader` does not support `mode` as parameter for grouping text like the older\n",
+    "loader `UnstructuredFileLoader` and others did. It instead supports \"chunking\". Chunking in\n",
+    "unstructured differs from other chunking mechanisms you may be familiar with that form chunks based\n",
+    "on plain-text features--character sequences like \"\\n\\n\" or \"\\n\" that might indicate a paragraph\n",
+    "boundary or list-item boundary. Instead, all documents are split using specific knowledge about each\n",
+    "document format to partition the document into semantic units (document elements) and we only need to\n",
+    "resort to text-splitting when a single element exceeds the desired maximum chunk size. In general,\n",
+    "chunking combines consecutive elements to form chunks as large as possible without exceeding the\n",
+    "maximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.\n",
+    "Each “chunk” is an instance of one of these three types.\n",
+    "\n",
+    "See this [page](https://docs.unstructured.io/open-source/core-functionality/chunking) for more\n",
+    "details about chunking options, but to reproduce the same behavior as `mode=\"single\"`, you can set\n",
+    "`chunking_strategy=\"basic\"`, `max_characters=<some-really-big-number>`, and `include_orig_elements=False`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "e9f1c20d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "WARNING: Partitioning locally even though api_key is defined since partition_via_api=False.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of LangChain documents: 1\n",
+      "Length of text in the document: 42772\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_unstructured import UnstructuredLoader\n",
+    "\n",
+    "loader = UnstructuredLoader(\n",
+    "    \"./example_data/layout-parser-paper.pdf\",\n",
+    "    chunking_strategy=\"basic\",\n",
+    "    max_characters=1000000,\n",
+    "    include_orig_elements=False,\n",
+    ")\n",
+    "\n",
+    "docs = loader.load()\n",
+    "\n",
+    "print(\"Number of LangChain documents:\", len(docs))\n",
+    "print(\"Length of text in the document:\", len(docs[0].page_content))"
+   ]
  }
 ],
 "metadata": {
@ -397,7 +512,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.5"
+   "version": "3.10.13"
  }
 },
 "nbformat": 4,
--- a/docs/docs/integrations/platforms/index.mdx
+++ b/docs/docs/integrations/platforms/index.mdx
@ -40,6 +40,7 @@ These providers have standalone `langchain-{provider}` packages for improved ver
 - [Qdrant](/docs/integrations/providers/qdrant)
 - [Robocorp](/docs/integrations/providers/robocorp)
 - [Together AI](/docs/integrations/providers/together)
+- [Unstructured](/docs/integrations/providers/unstructured)
 - [Upstage](/docs/integrations/providers/upstage)
 - [Voyage AI](/docs/integrations/providers/voyageai)

--- a/docs/docs/integrations/providers/unstructured.mdx
+++ b/docs/docs/integrations/providers/unstructured.mdx
@ -8,11 +8,21 @@ ecosystem within LangChain.

 ## Installation and Setup

-If you are using a loader that runs locally, use the following steps to get `unstructured` and
-its dependencies running locally.
+If you are using a loader that runs locally, use the following steps to get `unstructured` and its
+dependencies running.

- Install the Python SDK with `pip install unstructured`.
-    - You can install document specific dependencies with extras, i.e. `pip install "unstructured[docx]"`.
+- For the smallest installation footprint and to take advantage of features not available in the
+  open-source `unstructured` package, install the Python SDK with `pip install unstructured-client`
+  along with `pip install langchain-unstructured` to use the `UnstructuredLoader` and partition
+  remotely against the Unstructured API. This loader lives
+  in a LangChain partner repo instead of the `langchain-community` repo and you will need an
+  `api_key`, which you can generate a free key [here](https://unstructured.io/api-key/).
+    - Unstructured's documentation for the sdk can be found here:
+      https://docs.unstructured.io/api-reference/api-services/sdk
+
+- To run everything locally, install the open-source python package with `pip install unstructured`
+  along with `pip install langchain-community` and use the same `UnstructuredLoader` as mentioned above.
+    - You can install document specific dependencies with extras, e.g. `pip install "unstructured[docx]"`.
    - To install the dependencies for all document types, use `pip install "unstructured[all-docs]"`.
 - Install the following system dependencies if they are not already available on your system with e.g. `brew install` for Mac.
  Depending on what document types you're parsing, you may not need all of these.
@ -22,16 +32,11 @@ its dependencies running locally.
    - `qpdf` (PDFs)
    - `libreoffice` (MS Office docs)
    - `pandoc` (EPUBs)
+- When running locally, Unstructured also recommends using Docker [by following this
+  guide](https://docs.unstructured.io/open-source/installation/docker-installation) to ensure all
+  system dependencies are installed correctly.

-When running locally, Unstructured also recommends using Docker [by following this guide](https://docs.unstructured.io/open-source/installation/docker-installation)
-to ensure all system dependencies are installed correctly.
-
-If you want to get up and running with less set up, you can
-simply run `pip install unstructured` and use `UnstructuredAPIFileLoader` or
-`UnstructuredAPIFileIOLoader`. That will process your document using the hosted Unstructured API.
-
-
-The `Unstructured API` requires API keys to make requests.
+The Unstructured API requires API keys to make requests.
 You can request an API key [here](https://unstructured.io/api-key-hosted) and start using it today!
 Checkout the README [here](https://github.com/Unstructured-IO/unstructured-api) here to get started making API calls.
 We'd love to hear your feedback, let us know how it goes in our [community slack](https://join.slack.com/t/unstructuredw-kbe4326/shared_invite/zt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).
@ -42,30 +47,21 @@ Check out the instructions

 ## Data Loaders

-The primary usage of the `Unstructured` is in data loaders.
+The primary usage of `Unstructured` is in data loaders.

-### UnstructuredAPIFileIOLoader
+### UnstructuredLoader

-See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
+See a [usage example](/docs/integrations/document_loaders/unstructured_file) to see how you can use
+this loader for both partitioning locally and remotely with the serverless Unstructured API.

 ```python
-from langchain_community.document_loaders import UnstructuredAPIFileIOLoader
-```
-
-### UnstructuredAPIFileLoader
-
-See a [usage example](/docs/integrations/document_loaders/unstructured_file#unstructured-api).
-
-```python
-from langchain_community.document_loaders import UnstructuredAPIFileLoader
+from langchain_unstructured import UnstructuredLoader
 ```

 ### UnstructuredCHMLoader

 `CHM` means `Microsoft Compiled HTML Help`.

-See a usage example in the API documentation.
-
 ```python
 from langchain_community.document_loaders import UnstructuredCHMLoader
 ```
@ -119,15 +115,6 @@ See a [usage example](/docs/integrations/document_loaders/google_drive#passing-i
 from langchain_community.document_loaders import UnstructuredFileIOLoader
 ```

-### UnstructuredFileLoader
-
-See a [usage example](/docs/integrations/document_loaders/unstructured_file).
-
-
-```python
-from langchain_community.document_loaders import UnstructuredFileLoader
-```
-
 ### UnstructuredHTMLLoader

 See a [usage example](/docs/how_to/document_loader_html).
--- a/libs/community/langchain_community/document_loaders/unstructured.py
+++ b/libs/community/langchain_community/document_loaders/unstructured.py
@ -1,14 +1,23 @@
 """Loader that uses unstructured to load files."""

-import collections
+from __future__ import annotations
+
+import logging
+import os
 from abc import ABC, abstractmethod
 from pathlib import Path
-from typing import IO, Any, Callable, Dict, Iterator, List, Optional, Sequence, Union
+from typing import IO, Any, Callable, Iterator, List, Optional, Sequence, Union

+from langchain_core._api.deprecation import deprecated
 from langchain_core.documents import Document
+from typing_extensions import TypeAlias

 from langchain_community.document_loaders.base import BaseLoader

+Element: TypeAlias = Any
+
+logger = logging.getLogger(__file__)
+

 def satisfies_min_unstructured_version(min_version: str) -> bool:
    """Check if the installed `Unstructured` version exceeds the minimum version
@ -41,8 +50,8 @@ class UnstructuredBaseLoader(BaseLoader, ABC):

    def __init__(
        self,
-        mode: str = "single",
-        post_processors: Optional[List[Callable]] = None,
+        mode: str = "single",  # deprecated
+        post_processors: Optional[List[Callable[[str], str]]] = None,
        **unstructured_kwargs: Any,
    ):
        """Initialize with file path."""
@ -53,32 +62,41 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
                "unstructured package not found, please install it with "
                "`pip install unstructured`"
            )
+
+        # `single` - elements are combined into one (default)
+        # `elements` - maintain individual elements
+        # `paged` - elements are combined by page
        _valid_modes = {"single", "elements", "paged"}
        if mode not in _valid_modes:
            raise ValueError(
                f"Got {mode} for `mode`, but should be one of `{_valid_modes}`"
            )
-        self.mode = mode

        if not satisfies_min_unstructured_version("0.5.4"):
            if "strategy" in unstructured_kwargs:
                unstructured_kwargs.pop("strategy")

+        self._check_if_both_mode_and_chunking_strategy_are_by_page(
+            mode, unstructured_kwargs
+        )
+        self.mode = mode
        self.unstructured_kwargs = unstructured_kwargs
        self.post_processors = post_processors or []

    @abstractmethod
-    def _get_elements(self) -> List:
+    def _get_elements(self) -> List[Element]:
        """Get elements."""

    @abstractmethod
-    def _get_metadata(self) -> dict:
-        """Get metadata."""
+    def _get_metadata(self) -> dict[str, Any]:
+        """Get file_path metadata if available."""

-    def _post_process_elements(self, elements: list) -> list:
-        """Applies post processing functions to extracted unstructured elements.
-        Post processing functions are str -> str callables are passed
-        in using the post_processors kwarg when the loader is instantiated."""
+    def _post_process_elements(self, elements: List[Element]) -> List[Element]:
+        """Apply post processing functions to extracted unstructured elements.
+
+        Post processing functions are str -> str callables passed
+        in using the post_processors kwarg when the loader is instantiated.
+        """
        for element in elements:
            for post_processor in self.post_processors:
                element.apply(post_processor)
@ -97,18 +115,25 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
                    metadata.update(element.metadata.to_dict())
                if hasattr(element, "category"):
                    metadata["category"] = element.category
+                if element.to_dict().get("element_id"):
+                    metadata["element_id"] = element.to_dict().get("element_id")
                yield Document(page_content=str(element), metadata=metadata)
        elif self.mode == "paged":
-            text_dict: Dict[int, str] = {}
-            meta_dict: Dict[int, Dict] = {}
+            logger.warning(
+                "`mode='paged'` is deprecated in favor of the 'by_page' chunking"
+                " strategy. Learn more about chunking here:"
+                " https://docs.unstructured.io/open-source/core-functionality/chunking"
+            )
+            text_dict: dict[int, str] = {}
+            meta_dict: dict[int, dict[str, Any]] = {}

-            for idx, element in enumerate(elements):
+            for element in elements:
                metadata = self._get_metadata()
                if hasattr(element, "metadata"):
                    metadata.update(element.metadata.to_dict())
                page_number = metadata.get("page_number", 1)

-                # Check if this page_number already exists in docs_dict
+                # Check if this page_number already exists in text_dict
                if page_number not in text_dict:
                    # If not, create new entry with initial text and metadata
                    text_dict[page_number] = str(element) + "\n\n"
@ -128,18 +153,37 @@ class UnstructuredBaseLoader(BaseLoader, ABC):
        else:
            raise ValueError(f"mode of {self.mode} not supported.")

+    def _check_if_both_mode_and_chunking_strategy_are_by_page(
+        self, mode: str, unstructured_kwargs: dict[str, Any]
+    ) -> None:
+        if (
+            mode == "paged"
+            and unstructured_kwargs.get("chunking_strategy") == "by_page"
+        ):
+            raise ValueError(
+                "Only one of `chunking_strategy='by_page'` or `mode='paged'` may be"
+                " set. `chunking_strategy` is preferred."
+            )

+
+@deprecated(
+    since="0.2.8",
+    removal="0.4.0",
+    alternative_import="langchain_unstructured.UnstructuredLoader",
+)
 class UnstructuredFileLoader(UnstructuredBaseLoader):
    """Load files using `Unstructured`.

-    The file loader uses the
-    unstructured partition function and will automatically detect the file
-    type. You can run the loader in one of two modes: "single" and "elements".
-    If you use "single" mode, the document will be returned as a single
-    langchain Document object. If you use "elements" mode, the unstructured
-    library will split the document into elements such as Title and NarrativeText.
-    You can pass in additional unstructured kwargs after mode to apply
-    different unstructured settings.
+    The file loader uses the unstructured partition function and will automatically
+    detect the file type. You can run the loader in different modes: "single",
+    "elements", and "paged". The default "single" mode will return a single langchain
+    Document object. If you use "elements" mode, the unstructured library will split
+    the document into elements such as Title and NarrativeText and return those as
+    individual langchain Document objects. In addition to these post-processing modes
+    (which are specific to the LangChain Loaders), Unstructured has its own "chunking"
+    parameters for post-processing elements into more useful chunks for uses cases such
+    as Retrieval Augmented Generation (RAG). You can pass in additional unstructured
+    kwargs to configure different unstructured settings.

    Examples
    --------
@ -152,24 +196,27 @@ class UnstructuredFileLoader(UnstructuredBaseLoader):

    References
    ----------
-    https://unstructured-io.github.io/unstructured/bricks.html#partition
+    https://docs.unstructured.io/open-source/core-functionality/partitioning
+    https://docs.unstructured.io/open-source/core-functionality/chunking
    """

    def __init__(
        self,
-        file_path: Union[str, List[str], Path, List[Path], None],
+        file_path: Union[str, List[str], Path, List[Path]],
+        *,
        mode: str = "single",
        **unstructured_kwargs: Any,
    ):
        """Initialize with file path."""
        self.file_path = file_path
+
        super().__init__(mode=mode, **unstructured_kwargs)

-    def _get_elements(self) -> List:
+    def _get_elements(self) -> List[Element]:
        from unstructured.partition.auto import partition

        if isinstance(self.file_path, list):
-            elements = []
+            elements: List[Element] = []
            for file in self.file_path:
                if isinstance(file, Path):
                    file = str(file)
@ -180,35 +227,33 @@ class UnstructuredFileLoader(UnstructuredBaseLoader):
                self.file_path = str(self.file_path)
            return partition(filename=self.file_path, **self.unstructured_kwargs)

-    def _get_metadata(self) -> dict:
+    def _get_metadata(self) -> dict[str, Any]:
        return {"source": self.file_path}


 def get_elements_from_api(
    file_path: Union[str, List[str], Path, List[Path], None] = None,
-    file: Union[IO, Sequence[IO], None] = None,
-    api_url: str = "https://api.unstructured.io/general/v0/general",
+    file: Union[IO[bytes], Sequence[IO[bytes]], None] = None,
+    api_url: str = "https://api.unstructuredapp.io/general/v0/general",
    api_key: str = "",
    **unstructured_kwargs: Any,
-) -> List:
+) -> List[Element]:
    """Retrieve a list of elements from the `Unstructured API`."""
    if is_list := isinstance(file_path, list):
        file_path = [str(path) for path in file_path]
-    if isinstance(file, collections.abc.Sequence) or is_list:
+    if isinstance(file, Sequence) or is_list:
        from unstructured.partition.api import partition_multiple_via_api

        _doc_elements = partition_multiple_via_api(
-            filenames=file_path,
-            files=file,
+            filenames=file_path,  # type: ignore
+            files=file,  # type: ignore
            api_key=api_key,
            api_url=api_url,
            **unstructured_kwargs,
        )
-
        elements = []
        for _elements in _doc_elements:
            elements.extend(_elements)
-
        return elements
    else:
        from unstructured.partition.api import partition_via_api
@ -222,59 +267,69 @@ def get_elements_from_api(
        )


-class UnstructuredAPIFileLoader(UnstructuredFileLoader):
+@deprecated(
+    since="0.2.8",
+    removal="0.4.0",
+    alternative_import="langchain_unstructured.UnstructuredLoader",
+)
+class UnstructuredAPIFileLoader(UnstructuredBaseLoader):
    """Load files using `Unstructured` API.

-    By default, the loader makes a call to the hosted Unstructured API.
-    If you are running the unstructured API locally, you can change the
-    API rule by passing in the url parameter when you initialize the loader.
-    The hosted Unstructured API requires an API key. See
-    https://www.unstructured.io/api-key/ if you need to generate a key.
+    By default, the loader makes a call to the hosted Unstructured API. If you are
+    running the unstructured API locally, you can change the API rule by passing in the
+    url parameter when you initialize the loader. The hosted Unstructured API requires
+    an API key. See the links below to learn more about our API offerings and get an
+    API key.

-    You can run the loader in one of two modes: "single" and "elements".
-    If you use "single" mode, the document will be returned as a single
-    langchain Document object. If you use "elements" mode, the unstructured
-    library will split the document into elements such as Title and NarrativeText.
-    You can pass in additional unstructured kwargs after mode to apply
-    different unstructured settings.
+    You can run the loader in different modes: "single", "elements", and "paged". The
+    default "single" mode will return a single langchain Document object. If you use
+    "elements" mode, the unstructured library will split the document into elements such
+    as Title and NarrativeText and return those as individual langchain Document
+    objects. In addition to these post-processing modes (which are specific to the
+    LangChain Loaders), Unstructured has its own "chunking" parameters for
+    post-processing elements into more useful chunks for uses cases such as Retrieval
+    Augmented Generation (RAG). You can pass in additional unstructured kwargs to
+    configure different unstructured settings.

    Examples
    ```python
    from langchain_community.document_loaders import UnstructuredAPIFileLoader

-    loader = UnstructuredFileAPILoader(
+    loader = UnstructuredAPIFileLoader(
        "example.pdf", mode="elements", strategy="fast", api_key="MY_API_KEY",
    )
    docs = loader.load()

    References
    ----------
-    https://unstructured-io.github.io/unstructured/bricks.html#partition
-    https://www.unstructured.io/api-key/
-    https://github.com/Unstructured-IO/unstructured-api
+    https://docs.unstructured.io/api-reference/api-services/sdk
+    https://docs.unstructured.io/api-reference/api-services/overview
+    https://docs.unstructured.io/open-source/core-functionality/partitioning
+    https://docs.unstructured.io/open-source/core-functionality/chunking
    """

    def __init__(
        self,
-        file_path: Union[str, List[str], None] = "",
+        file_path: Union[str, List[str]],
+        *,
        mode: str = "single",
-        url: str = "https://api.unstructured.io/general/v0/general",
+        url: str = "https://api.unstructuredapp.io/general/v0/general",
        api_key: str = "",
        **unstructured_kwargs: Any,
    ):
        """Initialize with file path."""
-
        validate_unstructured_version(min_unstructured_version="0.10.15")

+        self.file_path = file_path
        self.url = url
-        self.api_key = api_key
+        self.api_key = os.getenv("UNSTRUCTURED_API_KEY") or api_key

-        super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
+        super().__init__(mode=mode, **unstructured_kwargs)

-    def _get_metadata(self) -> dict:
+    def _get_metadata(self) -> dict[str, Any]:
        return {"source": self.file_path}

-    def _get_elements(self) -> List:
+    def _get_elements(self) -> List[Element]:
        return get_elements_from_api(
            file_path=self.file_path,
            api_key=self.api_key,
@ -282,18 +337,36 @@ class UnstructuredAPIFileLoader(UnstructuredFileLoader):
            **self.unstructured_kwargs,
        )

+    def _post_process_elements(self, elements: List[Element]) -> List[Element]:
+        """Apply post processing functions to extracted unstructured elements.

+        Post processing functions are str -> str callables passed
+        in using the post_processors kwarg when the loader is instantiated.
+        """
+        for element in elements:
+            for post_processor in self.post_processors:
+                element.apply(post_processor)
+        return elements
+
+
+@deprecated(
+    since="0.2.8",
+    removal="0.4.0",
+    alternative_import="langchain_unstructured.UnstructuredLoader",
+)
 class UnstructuredFileIOLoader(UnstructuredBaseLoader):
-    """Load files using `Unstructured`.
+    """Load file-like objects opened in read mode using `Unstructured`.

-    The file loader
-    uses the unstructured partition function and will automatically detect the file
-    type. You can run the loader in one of two modes: "single" and "elements".
-    If you use "single" mode, the document will be returned as a single
-    langchain Document object. If you use "elements" mode, the unstructured
-    library will split the document into elements such as Title and NarrativeText.
-    You can pass in additional unstructured kwargs after mode to apply
-    different unstructured settings.
+    The file loader uses the unstructured partition function and will automatically
+    detect the file type. You can run the loader in different modes: "single",
+    "elements", and "paged". The default "single" mode will return a single langchain
+    Document object. If you use "elements" mode, the unstructured library will split
+    the document into elements such as Title and NarrativeText and return those as
+    individual langchain Document objects. In addition to these post-processing modes
+    (which are specific to the LangChain Loaders), Unstructured has its own "chunking"
+    parameters for post-processing elements into more useful chunks for uses cases
+    such as Retrieval Augmented Generation (RAG). You can pass in additional
+    unstructured kwargs to configure different unstructured settings.

    Examples
    --------
@ -308,12 +381,14 @@ class UnstructuredFileIOLoader(UnstructuredBaseLoader):

    References
    ----------
-    https://unstructured-io.github.io/unstructured/bricks.html#partition
+    https://docs.unstructured.io/open-source/core-functionality/partitioning
+    https://docs.unstructured.io/open-source/core-functionality/chunking
    """

    def __init__(
        self,
-        file: Union[IO, Sequence[IO]],
+        file: IO[bytes],
+        *,
        mode: str = "single",
        **unstructured_kwargs: Any,
    ):
@ -321,72 +396,114 @@ class UnstructuredFileIOLoader(UnstructuredBaseLoader):
        self.file = file
        super().__init__(mode=mode, **unstructured_kwargs)

-    def _get_elements(self) -> List:
+    def _get_elements(self) -> List[Element]:
        from unstructured.partition.auto import partition

        return partition(file=self.file, **self.unstructured_kwargs)

-    def _get_metadata(self) -> dict:
+    def _get_metadata(self) -> dict[str, Any]:
        return {}

+    def _post_process_elements(self, elements: List[Element]) -> List[Element]:
+        """Apply post processing functions to extracted unstructured elements.

-class UnstructuredAPIFileIOLoader(UnstructuredFileIOLoader):
-    """Load files using `Unstructured` API.
+        Post processing functions are str -> str callables passed
+        in using the post_processors kwarg when the loader is instantiated.
+        """
+        for element in elements:
+            for post_processor in self.post_processors:
+                element.apply(post_processor)
+        return elements

-    By default, the loader makes a call to the hosted Unstructured API.
-    If you are running the unstructured API locally, you can change the
-    API rule by passing in the url parameter when you initialize the loader.
-    The hosted Unstructured API requires an API key. See
-    https://www.unstructured.io/api-key/ if you need to generate a key.

-    You can run the loader in one of two modes: "single" and "elements".
-    If you use "single" mode, the document will be returned as a single
-    langchain Document object. If you use "elements" mode, the unstructured
-    library will split the document into elements such as Title and NarrativeText.
-    You can pass in additional unstructured kwargs after mode to apply
-    different unstructured settings.
+@deprecated(
+    since="0.2.8",
+    removal="0.4.0",
+    alternative_import="langchain_unstructured.UnstructuredLoader",
+)
+class UnstructuredAPIFileIOLoader(UnstructuredBaseLoader):
+    """Send file-like objects with `unstructured-client` sdk to the Unstructured API.
+
+    By default, the loader makes a call to the hosted Unstructured API. If you are
+    running the unstructured API locally, you can change the API rule by passing in the
+    url parameter when you initialize the loader. The hosted Unstructured API requires
+    an API key. See the links below to learn more about our API offerings and get an
+    API key.
+
+    You can run the loader in different modes: "single", "elements", and "paged". The
+    default "single" mode will return a single langchain Document object. If you use
+    "elements" mode, the unstructured library will split the document into elements
+    such as Title and NarrativeText and return those as individual langchain Document
+    objects. In addition to these post-processing modes (which are specific to the
+    LangChain Loaders), Unstructured has its own "chunking" parameters for
+    post-processing elements into more useful chunks for uses cases such as Retrieval
+    Augmented Generation (RAG). You can pass in additional unstructured kwargs to
+    configure different unstructured settings.

    Examples
    --------
    from langchain_community.document_loaders import UnstructuredAPIFileLoader

    with open("example.pdf", "rb") as f:
-        loader = UnstructuredFileAPILoader(
+        loader = UnstructuredAPIFileIOLoader(
            f, mode="elements", strategy="fast", api_key="MY_API_KEY",
        )
        docs = loader.load()

    References
    ----------
-    https://unstructured-io.github.io/unstructured/bricks.html#partition
-    https://www.unstructured.io/api-key/
-    https://github.com/Unstructured-IO/unstructured-api
+    https://docs.unstructured.io/api-reference/api-services/sdk
+    https://docs.unstructured.io/api-reference/api-services/overview
+    https://docs.unstructured.io/open-source/core-functionality/partitioning
+    https://docs.unstructured.io/open-source/core-functionality/chunking
    """

    def __init__(
        self,
-        file: Union[IO, Sequence[IO]],
+        file: Union[IO[bytes], Sequence[IO[bytes]]],
+        *,
        mode: str = "single",
-        url: str = "https://api.unstructured.io/general/v0/general",
+        url: str = "https://api.unstructuredapp.io/general/v0/general",
        api_key: str = "",
        **unstructured_kwargs: Any,
    ):
        """Initialize with file path."""

-        if isinstance(file, collections.abc.Sequence):
+        if isinstance(file, Sequence):
            validate_unstructured_version(min_unstructured_version="0.6.3")
-        if file:
-            validate_unstructured_version(min_unstructured_version="0.6.2")
+        validate_unstructured_version(min_unstructured_version="0.6.2")

+        self.file = file
        self.url = url
-        self.api_key = api_key
+        self.api_key = os.getenv("UNSTRUCTURED_API_KEY") or api_key

-        super().__init__(file=file, mode=mode, **unstructured_kwargs)
+        super().__init__(mode=mode, **unstructured_kwargs)

-    def _get_elements(self) -> List:
-        return get_elements_from_api(
-            file=self.file,
-            api_key=self.api_key,
-            api_url=self.url,
-            **self.unstructured_kwargs,
-        )
+    def _get_elements(self) -> List[Element]:
+        if self.unstructured_kwargs.get("metadata_filename"):
+            return get_elements_from_api(
+                file=self.file,
+                file_path=self.unstructured_kwargs.pop("metadata_filename"),
+                api_key=self.api_key,
+                api_url=self.url,
+                **self.unstructured_kwargs,
+            )
+        else:
+            raise ValueError(
+                "If partitioning a file via api,"
+                " metadata_filename must be specified as well.",
+            )
+
+    def _get_metadata(self) -> dict[str, Any]:
+        return {}
+
+    def _post_process_elements(self, elements: List[Element]) -> List[Element]:
+        """Apply post processing functions to extracted unstructured elements.
+
+        Post processing functions are str -> str callables passed
+        in using the post_processors kwarg when the loader is instantiated.
+        """
+        for element in elements:
+            for post_processor in self.post_processors:
+                element.apply(post_processor)
+        return elements
--- a/libs/partners/unstructured/.gitignore
+++ b/libs/partners/unstructured/.gitignore
@ -0,0 +1 @@
+__pycache__
--- a/libs/partners/unstructured/LICENSE
+++ b/libs/partners/unstructured/LICENSE
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 LangChain, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/libs/partners/unstructured/Makefile
+++ b/libs/partners/unstructured/Makefile
@ -0,0 +1,66 @@
+.PHONY: all format lint test tests integration_tests docker_tests help extended_tests
+
+# Default target executed when no arguments are given to make.
+all: help
+
+# Define a variable for the test file path.
+TEST_FILE ?= tests/unit_tests/
+integration_test integration_tests: TEST_FILE = tests/integration_tests/
+
+
+# unit tests are run with the --disable-socket flag to prevent network calls
+test tests:
+	poetry run pytest --disable-socket --allow-unix-socket $(TEST_FILE)
+
+# integration tests are run without the --disable-socket flag to allow network calls
+integration_test:
+	poetry run pytest $(TEST_FILE)
+
+# skip tests marked as local in CI
+integration_tests:
+	poetry run pytest $(TEST_FILE) -m "not local"
+
+######################
+# LINTING AND FORMATTING
+######################
+
+# Define a variable for Python and notebook files.
+PYTHON_FILES=.
+MYPY_CACHE=.mypy_cache
+lint format: PYTHON_FILES=.
+lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/unstructured --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$')
+lint_package: PYTHON_FILES=langchain_unstructured
+lint_tests: PYTHON_FILES=tests
+lint_tests: MYPY_CACHE=.mypy_cache_test
+
+lint lint_diff lint_package lint_tests:
+	poetry run ruff .
+	poetry run ruff format $(PYTHON_FILES) --diff
+	poetry run ruff --select I $(PYTHON_FILES)
+	mkdir -p $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE)
+
+format format_diff:
+	poetry run ruff format $(PYTHON_FILES)
+	poetry run ruff --select I --fix $(PYTHON_FILES)
+
+spell_check:
+	poetry run codespell --toml pyproject.toml
+
+spell_fix:
+	poetry run codespell --toml pyproject.toml -w
+
+check_imports: $(shell find langchain_unstructured -name '*.py')
+	poetry run python ./scripts/check_imports.py $^
+
+######################
+# HELP
+######################
+
+help:
+	@echo '----'
+	@echo 'check_imports				- check imports'
+	@echo 'format                       - run code formatters'
+	@echo 'lint                         - run linters'
+	@echo 'test                         - run unit tests'
+	@echo 'tests                        - run unit tests'
+	@echo 'test TEST_FILE=<test_file>   - run all tests in file'
--- a/libs/partners/unstructured/README.md
+++ b/libs/partners/unstructured/README.md
@ -0,0 +1,71 @@
+# langchain-unstructured
+
+This package contains the LangChain integration with Unstructured
+
+## Installation
+
+```bash
+pip install -U langchain-unstructured
+```
+
+And you should configure credentials by setting the following environment variables:
+
+```bash
+export UNSTRUCTURED_API_KEY="your-api-key"
+```
+
+## Loaders
+
+Partition and load files using either the `unstructured-client` sdk and the
+Unstructured API or locally using the `unstructured` library.
+
+API:
+To partition via the Unstructured API `pip install unstructured-client` and set
+`partition_via_api=True` and define `api_key`. If you are running the unstructured API
+locally, you can change the API rule by defining `url` when you initialize the
+loader. The hosted Unstructured API requires an API key. See the links below to
+learn more about our API offerings and get an API key.
+
+Local:
+By default the file loader uses the Unstructured `partition` function and will
+automatically detect the file type.
+
+In addition to document specific partition parameters, Unstructured has a rich set
+of "chunking" parameters for post-processing elements into more useful text segments
+for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
+Unstructured kwargs to the loader to configure different unstructured settings.
+
+Setup:
+```bash
+    pip install -U langchain-unstructured
+    pip install -U unstructured-client
+    export UNSTRUCTURED_API_KEY="your-api-key"
+```
+
+Instantiate:
+```python
+from langchain_unstructured import UnstructuredLoader
+
+loader = UnstructuredLoader(
+    file_path = ["example.pdf", "fake.pdf"],
+    api_key=UNSTRUCTURED_API_KEY,
+    partition_via_api=True,
+    chunking_strategy="by_title",
+    strategy="fast",
+)
+```
+
+Load:
+```python
+docs = loader.load()
+
+print(docs[0].page_content[:100])
+print(docs[0].metadata)
+```
+
+References
+----------
+https://docs.unstructured.io/api-reference/api-services/sdk
+https://docs.unstructured.io/api-reference/api-services/overview
+https://docs.unstructured.io/open-source/core-functionality/partitioning
+https://docs.unstructured.io/open-source/core-functionality/chunking
--- a/libs/partners/unstructured/langchain_unstructured/init.py
+++ b/libs/partners/unstructured/langchain_unstructured/init.py
@ -0,0 +1,15 @@
+from importlib import metadata
+
+from langchain_unstructured.document_loaders import UnstructuredLoader
+
+try:
+    __version__ = metadata.version(__package__)
+except metadata.PackageNotFoundError:
+    # Case where package metadata is not available.
+    __version__ = ""
+del metadata  # optional, avoids polluting the results of dir(__package__)
+
+__all__ = [
+    "UnstructuredLoader",
+    "__version__",
+]
--- a/libs/partners/unstructured/langchain_unstructured/document_loaders.py
+++ b/libs/partners/unstructured/langchain_unstructured/document_loaders.py
@ -0,0 +1,280 @@
+"""Unstructured document loader."""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+from pathlib import Path
+from typing import IO, Any, Callable, Iterator, Optional, cast
+
+from langchain_core.document_loaders.base import BaseLoader
+from langchain_core.documents import Document
+from typing_extensions import TypeAlias
+from unstructured_client import UnstructuredClient  # type: ignore
+from unstructured_client.models import operations, shared  # type: ignore
+
+Element: TypeAlias = Any
+
+logger = logging.getLogger(__file__)
+
+_DEFAULT_URL = "https://api.unstructuredapp.io/general/v0/general"
+
+
+class UnstructuredLoader(BaseLoader):
+    """Unstructured document loader interface.
+
+    Partition and load files using either the `unstructured-client` sdk and the
+    Unstructured API or locally using the `unstructured` library.
+
+    API:
+    This package is configured to work with the Unstructured API by default.
+    To use the Unstructured API, set
+    `partition_via_api=True` and define `api_key`. If you are running the unstructured
+    API locally, you can change the API rule by defining `url` when you initialize the
+    loader. The hosted Unstructured API requires an API key. See the links below to
+    learn more about our API offerings and get an API key.
+
+    Local:
+    To partition files locally, you must have the `unstructured` package installed.
+    You can install it with `pip install unstructured`.
+    By default the file loader uses the Unstructured `partition` function and will
+    automatically detect the file type.
+
+    In addition to document specific partition parameters, Unstructured has a rich set
+    of "chunking" parameters for post-processing elements into more useful text segments
+    for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional
+    Unstructured kwargs to the loader to configure different unstructured settings.
+
+    Setup:
+        .. code-block:: bash
+            pip install -U langchain-unstructured
+            export UNSTRUCTURED_API_KEY="your-api-key"
+
+    Instantiate:
+        .. code-block:: python
+            from langchain_unstructured import UnstructuredLoader
+
+            loader = UnstructuredLoader(
+                file_path = ["example.pdf", "fake.pdf"],
+                api_key=UNSTRUCTURED_API_KEY,
+                partition_via_api=True,
+                chunking_strategy="by_title",
+                strategy="fast",
+            )
+
+    Load:
+        .. code-block:: python
+            docs = loader.load()
+
+            print(docs[0].page_content[:100])
+            print(docs[0].metadata)
+
+    References
+    ----------
+    https://docs.unstructured.io/api-reference/api-services/sdk
+    https://docs.unstructured.io/api-reference/api-services/overview
+    https://docs.unstructured.io/open-source/core-functionality/partitioning
+    https://docs.unstructured.io/open-source/core-functionality/chunking
+    """
+
+    def __init__(
+        self,
+        file_path: Optional[str | Path | list[str] | list[Path]] = None,
+        *,
+        file: Optional[IO[bytes] | list[IO[bytes]]] = None,
+        partition_via_api: bool = False,
+        post_processors: Optional[list[Callable[[str], str]]] = None,
+        # SDK parameters
+        api_key: Optional[str] = None,
+        client: Optional[UnstructuredClient] = None,
+        server_url: Optional[str] = None,
+        **kwargs: Any,
+    ):
+        """Initialize loader."""
+        if file_path is not None and file is not None:
+            raise ValueError("file_path and file cannot be defined simultaneously.")
+        if client is not None:
+            disallowed_params = [("api_key", api_key), ("server_url", server_url)]
+            bad_params = [
+                param for param, value in disallowed_params if value is not None
+            ]
+
+            if bad_params:
+                raise ValueError(
+                    "if you are passing a custom `client`, you cannot also pass these "
+                    f"params: {', '.join(bad_params)}."
+                )
+
+        unstructured_api_key = api_key or os.getenv("UNSTRUCTURED_API_KEY")
+        unstructured_url = server_url or os.getenv("UNSTRUCTURED_URL") or _DEFAULT_URL
+
+        self.client = client or UnstructuredClient(
+            api_key_auth=unstructured_api_key, server_url=unstructured_url
+        )
+
+        self.file_path = file_path
+        self.file = file
+        self.partition_via_api = partition_via_api
+        self.post_processors = post_processors
+        self.unstructured_kwargs = kwargs
+
+    def lazy_load(self) -> Iterator[Document]:
+        """Load file(s) to the _UnstructuredBaseLoader."""
+
+        def load_file(
+            f: Optional[IO[bytes]] = None, f_path: Optional[str | Path] = None
+        ) -> Iterator[Document]:
+            """Load an individual file to the _UnstructuredBaseLoader."""
+            return _SingleDocumentLoader(
+                file=f,
+                file_path=f_path,
+                partition_via_api=self.partition_via_api,
+                post_processors=self.post_processors,
+                # SDK parameters
+                client=self.client,
+                **self.unstructured_kwargs,
+            ).lazy_load()
+
+        if isinstance(self.file, list):
+            for f in self.file:
+                yield from load_file(f=f)
+            return
+
+        if isinstance(self.file_path, list):
+            for f_path in self.file_path:
+                yield from load_file(f_path=f_path)
+            return
+
+        # Call _UnstructuredBaseLoader normally since file and file_path are not lists
+        yield from load_file(f=self.file, f_path=self.file_path)
+
+
+class _SingleDocumentLoader(BaseLoader):
+    """Provides loader functionality for individual document/file objects.
+
+    Encapsulates partitioning individual file objects (file or file_path) either
+    locally or via the Unstructured API.
+    """
+
+    def __init__(
+        self,
+        file_path: Optional[str | Path] = None,
+        *,
+        client: UnstructuredClient,
+        file: Optional[IO[bytes]] = None,
+        partition_via_api: bool = False,
+        post_processors: Optional[list[Callable[[str], str]]] = None,
+        # SDK parameters
+        **kwargs: Any,
+    ):
+        """Initialize loader."""
+        self.file_path = str(file_path) if isinstance(file_path, Path) else file_path
+        self.file = file
+        self.partition_via_api = partition_via_api
+        self.post_processors = post_processors
+        # SDK parameters
+        self.client = client
+        self.unstructured_kwargs = kwargs
+
+    def lazy_load(self) -> Iterator[Document]:
+        """Load file."""
+        elements_json = (
+            self._post_process_elements_json(self._elements_json)
+            if self.post_processors
+            else self._elements_json
+        )
+        for element in elements_json:
+            metadata = self._get_metadata()
+            metadata.update(element.get("metadata"))  # type: ignore
+            metadata.update(
+                {"category": element.get("category") or element.get("type")}
+            )
+            metadata.update({"element_id": element.get("element_id")})
+            yield Document(
+                page_content=cast(str, element.get("text")), metadata=metadata
+            )
+
+    @property
+    def _elements_json(self) -> list[dict[str, Any]]:
+        """Get elements as a list of dictionaries from local partition or via API."""
+        if self.partition_via_api:
+            return self._elements_via_api
+
+        return self._convert_elements_to_dicts(self._elements_via_local)
+
+    @property
+    def _elements_via_local(self) -> list[Element]:
+        try:
+            from unstructured.partition.auto import partition  # type: ignore
+        except ImportError:
+            raise ImportError(
+                "unstructured package not found, please install it with "
+                "`pip install unstructured`"
+            )
+
+        if self.file and self.unstructured_kwargs.get("metadata_filename") is None:
+            raise ValueError(
+                "If partitioning a fileIO object, metadata_filename must be specified"
+                " as well.",
+            )
+
+        return partition(
+            file=self.file, filename=self.file_path, **self.unstructured_kwargs
+        )  # type: ignore
+
+    @property
+    def _elements_via_api(self) -> list[dict[str, Any]]:
+        """Retrieve a list of element dicts from the API using the SDK client."""
+        client = self.client
+        req = self._sdk_partition_request
+        response = client.general.partition(req)  # type: ignore
+        if response.status_code == 200:
+            return json.loads(response.raw_response.text)
+        raise ValueError(
+            f"Receive unexpected status code {response.status_code} from the API.",
+        )
+
+    @property
+    def _file_content(self) -> bytes:
+        """Get content from either file or file_path."""
+        if self.file is not None:
+            return self.file.read()
+        elif self.file_path:
+            with open(self.file_path, "rb") as f:
+                return f.read()
+        raise ValueError("file or file_path must be defined.")
+
+    @property
+    def _sdk_partition_request(self) -> operations.PartitionRequest:
+        return operations.PartitionRequest(
+            partition_parameters=shared.PartitionParameters(
+                files=shared.Files(
+                    content=self._file_content, file_name=str(self.file_path)
+                ),
+                **self.unstructured_kwargs,
+            ),
+        )
+
+    def _convert_elements_to_dicts(
+        self, elements: list[Element]
+    ) -> list[dict[str, Any]]:
+        return [element.to_dict() for element in elements]
+
+    def _get_metadata(self) -> dict[str, Any]:
+        """Get file_path metadata if available."""
+        return {"source": self.file_path} if self.file_path else {}
+
+    def _post_process_elements_json(
+        self, elements_json: list[dict[str, Any]]
+    ) -> list[dict[str, Any]]:
+        """Apply post processing functions to extracted unstructured elements.
+
+        Post processing functions are str -> str callables passed
+        in using the post_processors kwarg when the loader is instantiated.
+        """
+        if self.post_processors:
+            for element in elements_json:
+                for post_processor in self.post_processors:
+                    element["text"] = post_processor(str(element.get("text")))
+        return elements_json
--- a/libs/partners/unstructured/langchain_unstructured/py.typed
+++ b/libs/partners/unstructured/langchain_unstructured/py.typed
--- a/libs/partners/unstructured/poetry.lock
+++ b/libs/partners/unstructured/poetry.lock
--- a/libs/partners/unstructured/pyproject.toml
+++ b/libs/partners/unstructured/pyproject.toml
@ -0,0 +1,97 @@
+[tool.poetry]
+name = "langchain-unstructured"
+version = "0.1.0"
+description = "An integration package connecting Unstructured and LangChain"
+authors = []
+readme = "README.md"
+repository = "https://github.com/langchain-ai/langchain"
+license = "MIT"
+
+[tool.poetry.urls]
+"Source Code" = "https://github.com/langchain-ai/langchain/tree/master/libs/partners/unstructured"
+"Release Notes" = "https://github.com/langchain-ai/langchain/releases?q=tag%3A%22langchain-unstructured%3D%3D0%22&expanded=true"
+
+[tool.poetry.dependencies]
+python = ">=3.9,<4.0"
+langchain-core = "^0.2.23"
+unstructured-client = { version = "^0.24.1" }
+unstructured = { version = "^0.15.0", optional = true, python = "<3.13", extras = [
+  "all-docs",
+] }
+
+[tool.poetry.extras]
+local = ["unstructured"]
+
+[tool.poetry.group.test]
+optional = true
+
+[tool.poetry.group.test.dependencies]
+pytest = "^7.4.3"
+pytest-asyncio = "^0.23.2"
+pytest-socket = "^0.7.0"
+langchain-core = { path = "../../core", develop = true }
+
+[tool.poetry.group.codespell]
+optional = true
+
+[tool.poetry.group.codespell.dependencies]
+codespell = "^2.2.6"
+
+[tool.poetry.group.test_integration]
+optional = true
+
+[tool.poetry.group.test_integration.dependencies]
+
+[tool.poetry.group.lint]
+optional = true
+
+[tool.poetry.group.lint.dependencies]
+ruff = "^0.1.8"
+
+[tool.poetry.group.typing.dependencies]
+mypy = "^1.7.1"
+unstructured = { version = "^0.15.0", python = "<3.13", extras = ["all-docs"] }
+langchain-core = { path = "../../core", develop = true }
+
+[tool.poetry.group.dev]
+optional = true
+
+[tool.poetry.group.dev.dependencies]
+langchain-core = { path = "../../core", develop = true }
+
+[tool.ruff.lint]
+select = [
+  "E",    # pycodestyle
+  "F",    # pyflakes
+  "I",    # isort
+  "T201", # print
+]
+
+[tool.mypy]
+disallow_untyped_defs = "True"
+
+[tool.coverage.run]
+omit = ["tests/*"]
+
+[build-system]
+requires = ["poetry-core>=1.0.0"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.pytest.ini_options]
+# --strict-markers will raise errors on unknown marks.
+# https://docs.pytest.org/en/7.1.x/how-to/mark.html#raising-errors-on-unknown-marks
+#
+# https://docs.pytest.org/en/7.1.x/reference/reference.html
+# --strict-config       any warnings encountered while parsing the `pytest`
+#                       section of the configuration file raise errors.
+#
+# https://github.com/tophat/syrupy
+# --snapshot-warn-unused    Prints a warning on unused snapshots rather than fail the test suite.
+addopts = "--strict-markers --strict-config --durations=5"
+# Registering custom markers.
+# https://docs.pytest.org/en/7.1.x/example/markers.html#registering-markers
+markers = [
+  "compile: mark placeholder test used to compile integration tests without running them",
+  "local: mark tests as requiring a local install, which isn't compatible with CI currently",
+]
+asyncio_mode = "auto"
--- a/libs/partners/unstructured/scripts/check_imports.py
+++ b/libs/partners/unstructured/scripts/check_imports.py
@ -0,0 +1,17 @@
+import sys
+import traceback
+from importlib.machinery import SourceFileLoader
+
+if __name__ == "__main__":
+    files = sys.argv[1:]
+    has_failure = False
+    for file in files:
+        try:
+            SourceFileLoader("x", file).load_module()
+        except Exception:
+            has_faillure = True
+            print(file)  # noqa: T201
+            traceback.print_exc()
+            print()  # noqa: T201
+
+    sys.exit(1 if has_failure else 0)
--- a/libs/partners/unstructured/scripts/check_pydantic.sh
+++ b/libs/partners/unstructured/scripts/check_pydantic.sh
@ -0,0 +1,27 @@
+#!/bin/bash
+#
+# This script searches for lines starting with "import pydantic" or "from pydantic"
+# in tracked files within a Git repository.
+#
+# Usage: ./scripts/check_pydantic.sh /path/to/repository
+
+# Check if a path argument is provided
+if [ $# -ne 1 ]; then
+  echo "Usage: $0 /path/to/repository"
+  exit 1
+fi
+
+repository_path="$1"
+
+# Search for lines matching the pattern within the specified repository
+result=$(git -C "$repository_path" grep -E '^import pydantic|^from pydantic')
+
+# Check if any matching lines were found
+if [ -n "$result" ]; then
+  echo "ERROR: The following lines need to be updated:"
+  echo "$result"
+  echo "Please replace the code with an import from langchain_core.pydantic_v1."
+  echo "For example, replace 'from pydantic import BaseModel'"
+  echo "with 'from langchain_core.pydantic_v1 import BaseModel'"
+  exit 1
+fi
--- a/libs/partners/unstructured/scripts/lint_imports.sh
+++ b/libs/partners/unstructured/scripts/lint_imports.sh
@ -0,0 +1,18 @@
+#!/bin/bash
+
+set -eu
+
+# Initialize a variable to keep track of errors
+errors=0
+
+# make sure not importing from langchain, langchain_experimental, or langchain_community
+git --no-pager grep '^from langchain\.' . && errors=$((errors+1))
+git --no-pager grep '^from langchain_experimental\.' . && errors=$((errors+1))
+git --no-pager grep '^from langchain_community\.' . && errors=$((errors+1))
+
+# Decide on an exit status based on the errors
+if [ "$errors" -gt 0 ]; then
+    exit 1
+else
+    exit 0
+fi
--- a/libs/partners/unstructured/tests/init.py
+++ b/libs/partners/unstructured/tests/init.py
--- a/libs/partners/unstructured/tests/integration_tests/init.py
+++ b/libs/partners/unstructured/tests/integration_tests/init.py
--- a/libs/partners/unstructured/tests/integration_tests/test_compile.py
+++ b/libs/partners/unstructured/tests/integration_tests/test_compile.py
@ -0,0 +1,7 @@
+import pytest
+
+
+@pytest.mark.compile
+def test_placeholder() -> None:
+    """Used for compiling integration tests without running any real tests."""
+    pass
--- a/libs/partners/unstructured/tests/integration_tests/test_document_loaders.py
+++ b/libs/partners/unstructured/tests/integration_tests/test_document_loaders.py
@ -0,0 +1,135 @@
+import os
+from pathlib import Path
+from typing import Callable
+
+import pytest
+
+from langchain_unstructured import UnstructuredLoader
+
+EXAMPLE_DOCS_DIRECTORY = str(
+    Path(__file__).parent.parent.parent.parent.parent
+    / "community/tests/integration_tests/examples/"
+)
+UNSTRUCTURED_API_KEY = os.getenv("UNSTRUCTURED_API_KEY")
+
+
+# -- Local partition --
+
+
+@pytest.mark.local
+def test_loader_partitions_locally() -> None:
+    file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
+
+    docs = UnstructuredLoader(
+        file_path=file_path,
+        # Unstructured kwargs
+        strategy="fast",
+        include_page_breaks=True,
+    ).load()
+
+    assert all(
+        doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
+    )
+    assert any(doc.metadata.get("category") == "PageBreak" for doc in docs)
+
+
+@pytest.mark.local
+def test_loader_partition_ignores_invalid_arg() -> None:
+    file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
+
+    docs = UnstructuredLoader(
+        file_path=file_path,
+        # Unstructured kwargs
+        strategy="fast",
+        # mode is no longer a valid argument and is ignored when partitioning locally
+        mode="single",
+    ).load()
+
+    assert len(docs) > 1
+    assert all(
+        doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
+    )
+
+
+@pytest.mark.local
+def test_loader_partitions_locally_and_applies_post_processors(
+    get_post_processor: Callable[[str], str],
+) -> None:
+    file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
+    loader = UnstructuredLoader(
+        file_path=file_path,
+        post_processors=[get_post_processor],
+        strategy="fast",
+    )
+
+    docs = loader.load()
+
+    assert len(docs) > 1
+    assert docs[0].page_content.endswith("THE END!")
+
+
+# -- API partition --
+
+
+def test_loader_partitions_via_api() -> None:
+    file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
+    loader = UnstructuredLoader(
+        file_path=file_path,
+        partition_via_api=True,
+        # Unstructured kwargs
+        strategy="fast",
+        include_page_breaks=True,
+    )
+
+    docs = loader.load()
+
+    assert len(docs) > 1
+    assert any(doc.metadata.get("category") == "PageBreak" for doc in docs)
+    assert all(
+        doc.metadata.get("filename") == "layout-parser-paper.pdf" for doc in docs
+    )
+    assert docs[0].metadata.get("element_id") is not None
+
+
+def test_loader_partitions_multiple_via_api() -> None:
+    file_paths = [
+        os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf"),
+        os.path.join(EXAMPLE_DOCS_DIRECTORY, "fake-email-attachment.eml"),
+    ]
+    loader = UnstructuredLoader(
+        file_path=file_paths,
+        api_key=UNSTRUCTURED_API_KEY,
+        partition_via_api=True,
+        # Unstructured kwargs
+        strategy="fast",
+    )
+
+    docs = loader.load()
+
+    assert len(docs) > 1
+    assert docs[0].metadata.get("filename") == "layout-parser-paper.pdf"
+    assert docs[-1].metadata.get("filename") == "fake-email-attachment.eml"
+
+
+def test_loader_partition_via_api_raises_TypeError_with_invalid_arg() -> None:
+    file_path = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper.pdf")
+    loader = UnstructuredLoader(
+        file_path=file_path,
+        api_key=UNSTRUCTURED_API_KEY,
+        partition_via_api=True,
+        mode="elements",
+    )
+
+    with pytest.raises(TypeError, match="unexpected keyword argument 'mode'"):
+        loader.load()
+
+
+# -- fixtures ---
+
+
+@pytest.fixture()
+def get_post_processor() -> Callable[[str], str]:
+    def append_the_end(text: str) -> str:
+        return text + "THE END!"
+
+    return append_the_end
--- a/libs/partners/unstructured/tests/unit_tests/init.py
+++ b/libs/partners/unstructured/tests/unit_tests/init.py
--- a/libs/partners/unstructured/tests/unit_tests/test_document_loaders.py
+++ b/libs/partners/unstructured/tests/unit_tests/test_document_loaders.py
@ -0,0 +1,178 @@
+from pathlib import Path
+from typing import Any, Callable
+from unittest import mock
+from unittest.mock import Mock, mock_open, patch
+
+import pytest
+from unstructured.documents.elements import Text  # type: ignore
+
+from langchain_unstructured.document_loaders import (
+    _SingleDocumentLoader,  # type: ignore
+)
+
+EXAMPLE_DOCS_DIRECTORY = str(
+    Path(__file__).parent.parent.parent.parent.parent
+    / "community/tests/integration_tests/examples/"
+)
+
+
+# --- _SingleDocumentLoader._get_content() ---
+
+
+def test_it_gets_content_from_file() -> None:
+    mock_file = Mock()
+    mock_file.read.return_value = b"content from file"
+    loader = _SingleDocumentLoader(
+        client=Mock(), file=mock_file, metadata_filename="fake.txt"
+    )
+
+    content = loader._file_content  # type: ignore
+
+    assert content == b"content from file"
+    mock_file.read.assert_called_once()
+
+
+@patch("builtins.open", new_callable=mock_open, read_data=b"content from file_path")
+def test_it_gets_content_from_file_path(mock_file: Mock) -> None:
+    loader = _SingleDocumentLoader(client=Mock(), file_path="dummy_path")
+
+    content = loader._file_content  # type: ignore
+
+    assert content == b"content from file_path"
+    mock_file.assert_called_once_with("dummy_path", "rb")
+    handle = mock_file()
+    handle.read.assert_called_once()
+
+
+def test_it_raises_value_error_without_file_or_file_path() -> None:
+    loader = _SingleDocumentLoader(
+        client=Mock(),
+    )
+
+    with pytest.raises(ValueError) as e:
+        loader._file_content  # type: ignore
+
+    assert str(e.value) == "file or file_path must be defined."
+
+
+# --- _SingleDocumentLoader._elements_json ---
+
+
+def test_it_calls_elements_via_api_with_valid_args() -> None:
+    with patch.object(
+        _SingleDocumentLoader, "_elements_via_api", new_callable=mock.PropertyMock
+    ) as mock_elements_via_api:
+        mock_elements_via_api.return_value = [{"element": "data"}]
+        loader = _SingleDocumentLoader(
+            client=Mock(),
+            # Minimum required args for self._elements_via_api to be called:
+            partition_via_api=True,
+            api_key="some_key",
+        )
+
+        result = loader._elements_json  # type: ignore
+
+    mock_elements_via_api.assert_called_once()
+    assert result == [{"element": "data"}]
+
+
+@patch.object(_SingleDocumentLoader, "_convert_elements_to_dicts")
+def test_it_partitions_locally_by_default(mock_convert_elements_to_dicts: Mock) -> None:
+    mock_convert_elements_to_dicts.return_value = [{}]
+    with patch.object(
+        _SingleDocumentLoader, "_elements_via_local", new_callable=mock.PropertyMock
+    ) as mock_elements_via_local:
+        mock_elements_via_local.return_value = [{}]
+        # Minimum required args for self._elements_via_api to be called:
+        loader = _SingleDocumentLoader(
+            client=Mock(),
+        )
+
+        result = loader._elements_json  # type: ignore
+
+    mock_elements_via_local.assert_called_once_with()
+    mock_convert_elements_to_dicts.assert_called_once_with([{}])
+    assert result == [{}]
+
+
+def test_it_partitions_locally_and_logs_warning_with_partition_via_api_False(
+    caplog: pytest.LogCaptureFixture,
+) -> None:
+    with patch.object(
+        _SingleDocumentLoader, "_elements_via_local"
+    ) as mock_get_elements_locally:
+        mock_get_elements_locally.return_value = [Text("Mock text element.")]
+        loader = _SingleDocumentLoader(
+            client=Mock(), partition_via_api=False, api_key="some_key"
+        )
+
+        _ = loader._elements_json  # type: ignore
+
+
+# -- fixtures -------------------------------
+
+
+@pytest.fixture()
+def get_post_processor() -> Callable[[str], str]:
+    def append_the_end(text: str) -> str:
+        return text + "THE END!"
+
+    return append_the_end
+
+
+@pytest.fixture()
+def fake_json_response() -> list[dict[str, Any]]:
+    return [
+        {
+            "type": "Title",
+            "element_id": "b7f58c2fd9c15949a55a62eb84e39575",
+            "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document"
+            "Image Analysis",
+            "metadata": {
+                "languages": ["eng"],
+                "page_number": 1,
+                "filename": "layout-parser-paper.pdf",
+                "filetype": "application/pdf",
+            },
+        },
+        {
+            "type": "UncategorizedText",
+            "element_id": "e1c4facddf1f2eb1d0db5be34ad0de18",
+            "text": "1 2 0 2",
+            "metadata": {
+                "languages": ["eng"],
+                "page_number": 1,
+                "parent_id": "b7f58c2fd9c15949a55a62eb84e39575",
+                "filename": "layout-parser-paper.pdf",
+                "filetype": "application/pdf",
+            },
+        },
+    ]
+
+
+@pytest.fixture()
+def fake_multiple_docs_json_response() -> list[dict[str, Any]]:
+    return [
+        {
+            "type": "Title",
+            "element_id": "b7f58c2fd9c15949a55a62eb84e39575",
+            "text": "LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document"
+            " Image Analysis",
+            "metadata": {
+                "languages": ["eng"],
+                "page_number": 1,
+                "filename": "layout-parser-paper.pdf",
+                "filetype": "application/pdf",
+            },
+        },
+        {
+            "type": "NarrativeText",
+            "element_id": "3c4ac9e7f55f1e3dbd87d3a9364642fe",
+            "text": "6/29/23, 12:16\u202fam - User 4: This message was deleted",
+            "metadata": {
+                "filename": "whatsapp_chat.txt",
+                "languages": ["eng"],
+                "filetype": "text/plain",
+            },
+        },
+    ]
--- a/libs/partners/unstructured/tests/unit_tests/test_imports.py
+++ b/libs/partners/unstructured/tests/unit_tests/test_imports.py
@ -0,0 +1,10 @@
+from langchain_unstructured import __all__
+
+EXPECTED_ALL = [
+    "UnstructuredLoader",
+    "__version__",
+]
+
+
+def test_all_imports() -> None:
+    assert sorted(EXPECTED_ALL) == sorted(__all__)