Harrison/unstructured structured (#1004)

This commit is contained in:
Harrison Chase
2023-02-12 07:36:11 -08:00
committed by GitHub
parent bbb06ca4cf
commit 0998577dfe
11 changed files with 363 additions and 121 deletions

View File

@@ -139,7 +139,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"id": "0cc0cd42",
"metadata": {},
"outputs": [],
@@ -149,7 +149,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"id": "082d557c",
"metadata": {},
"outputs": [],
@@ -159,14 +159,54 @@
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5c41106f",
"execution_count": null,
"id": "df11c953",
"metadata": {},
"outputs": [],
"source": [
"data = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "09957371",
"metadata": {},
"source": [
"### Retain Elements\n",
"\n",
"Under the hood, Unstructured creates different \"elements\" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode=\"elements\"`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fab833b",
"metadata": {},
"outputs": [],
"source": [
"loader = UnstructuredPDFLoader(\"example_data/layout-parser-paper.pdf\", mode=\"elements\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3e8ff1b",
"metadata": {},
"outputs": [],
"source": [
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43c23d2d",
"metadata": {},
"outputs": [],
"source": [
"data[0]"
]
},
{
"cell_type": "markdown",
"id": "21998d18",
@@ -177,7 +217,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 7,
"id": "2f0cc9ff",
"metadata": {},
"outputs": [],
@@ -187,7 +227,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 8,
"id": "42b531e8",
"metadata": {},
"outputs": [],
@@ -197,7 +237,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 9,
"id": "010d5cdd",
"metadata": {},
"outputs": [],