Harrison/unstructured structured (#1004)

2025-10-09 16:08:24 +00:00 · 2023-02-12 07:36:11 -08:00
parent bbb06ca4cf
commit 0998577dfe
11 changed files with 363 additions and 121 deletions
--- a/docs/modules/document_loaders/examples/pdf.ipynb
+++ b/docs/modules/document_loaders/examples/pdf.ipynb
@@ -139,7 +139,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 3,
   "id": "0cc0cd42",
   "metadata": {},
   "outputs": [],
@@ -149,7 +149,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 4,
   "id": "082d557c",
   "metadata": {},
   "outputs": [],
@@ -159,14 +159,54 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
-   "id": "5c41106f",
+   "execution_count": null,
+   "id": "df11c953",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = loader.load()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "09957371",
+   "metadata": {},
+   "source": [
+    "### Retain Elements\n",
+    "\n",
+    "Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fab833b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = UnstructuredPDFLoader(\"example_data/layout-parser-paper.pdf\", mode=\"elements\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c3e8ff1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "43c23d2d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data[0]"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "21998d18",
@@ -177,7 +217,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 7,
   "id": "2f0cc9ff",
   "metadata": {},
   "outputs": [],
@@ -187,7 +227,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 8,
   "id": "42b531e8",
   "metadata": {},
   "outputs": [],
@@ -197,7 +237,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 9,
   "id": "010d5cdd",
   "metadata": {},
   "outputs": [],