unstructured[patch]: support loading URLs (#26670)

`unstructured.partition.auto.partition` supports a `url` kwarg, but `url` in `UnstructuredLoader.__init__` is reserved for the server URL. Here we add a `web_url` kwarg that is passed to the partition kwargs: ```python self.unstructured_kwargs["url"] = web_url ```
2026-01-05 07:55:18 +00:00 · 2024-09-19 14:40:25 -04:00
parent 311f861547
commit eef18dec44
3 changed files with 76 additions and 2 deletions
--- a/docs/docs/integrations/document_loaders/unstructured_file.ipynb
+++ b/docs/docs/integrations/document_loaders/unstructured_file.ipynb
@@ -16,7 +16,7 @@
    "\n",
    "| Class | Package | Local | Serializable | [JS support](https://js.langchain.com/docs/integrations/document_loaders/file_loaders/unstructured/)|\n",
    "| :--- | :--- | :---: | :---: |  :---: |\n",
-    "| [UnstructuredLoader](https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html) | [langchain_community](https://python.langchain.com/api_reference/unstructured/index.html) | ✅ | ❌ | ✅ | \n",
+    "| [UnstructuredLoader](https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html) | [langchain_unstructured](https://python.langchain.com/api_reference/unstructured/index.html) | ✅ | ❌ | ✅ | \n",
    "### Loader features\n",
    "| Source | Document Lazy Loading | Native Async Support\n",
    "| :---: | :---: | :---: | \n",
@@ -519,6 +519,47 @@
    "print(\"Length of text in the document:\", len(docs[0].page_content))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "3ec3c22d-02cd-498b-921f-b839d1404f32",
+   "metadata": {},
+   "source": [
+    "## Loading web pages\n",
+    "\n",
+    "`UnstructuredLoader` accepts a `web_url` kwarg when run locally that populates the `url` parameter of the underlying Unstructured [partition](https://docs.unstructured.io/open-source/core-functionality/partitioning). This allows for the parsing of remotely hosted documents, such as HTML web pages.\n",
+    "\n",
+    "Example usage:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "bf9a8546-659d-4861-bff2-fdf1ad93ac65",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "page_content='Example Domain' metadata={'category_depth': 0, 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': 'fdaa78d856f9d143aeeed85bf23f58f8'}\n",
+      "\n",
+      "page_content='This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.' metadata={'languages': ['eng'], 'parent_id': 'fdaa78d856f9d143aeeed85bf23f58f8', 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'NarrativeText', 'element_id': '3652b8458b0688639f973fe36253c992'}\n",
+      "\n",
+      "page_content='More information...' metadata={'category_depth': 0, 'link_texts': ['More information...'], 'link_urls': ['https://www.iana.org/domains/example'], 'languages': ['eng'], 'filetype': 'text/html', 'url': 'https://www.example.com', 'category': 'Title', 'element_id': '793ab98565d6f6d6f3a6d614e3ace2a9'}\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_unstructured import UnstructuredLoader\n",
+    "\n",
+    "loader = UnstructuredLoader(web_url=\"https://www.example.com\")\n",
+    "docs = loader.load()\n",
+    "\n",
+    "for doc in docs:\n",
+    "    print(f\"{doc}\\n\")"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "ce01aa40",
@@ -546,7 +587,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.10.4"
  }
 },
 "nbformat": 4,