community[patch]: adding linearization config to AmazonTextractPDFLoader (#17489)

- **Description:** Adding an optional parameter `linearization_config` to the `AmazonTextractPDFLoader` so the caller can define how the output will be linearized, instead of forcing a predefined set of linearization configs. It will still have a default configuration as this will be an optional parameter. - **Issue:** #17457 - **Dependencies:** The same ones that already exist for `AmazonTextractPDFLoader` - **Twitter handle:** [@lvieirajr19](https://twitter.com/lvieirajr19) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
2025-09-11 16:01:33 +00:00 · 2024-03-08 17:25:22 -08:00
parent 37e89ba5b1
commit 67c880af74
3 changed files with 86 additions and 16 deletions
--- a/docs/docs/integrations/document_loaders/amazon_textract.ipynb
+++ b/docs/docs/integrations/document_loaders/amazon_textract.ipynb
@@ -206,6 +206,42 @@
    "len(documents)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "a56ba97505c8d140",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "## Sample 4\n",
+    "\n",
+    "You have the option to pass an additional parameter called `linearization_config` to the AmazonTextractPDFLoader which will determine how the the text output will be linearized by the parser after Textract runs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1efbc4b6-f3cb-45c5-bbe8-16e7df060b92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import AmazonTextractPDFLoader\n",
+    "from textractor.data.text_linearization_config import TextLinearizationConfig\n",
+    "\n",
+    "loader = AmazonTextractPDFLoader(\n",
+    "    \"s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf\",\n",
+    "    linearization_config=TextLinearizationConfig(\n",
+    "        hide_header_layout=True,\n",
+    "        hide_footer_layout=True,\n",
+    "        hide_figure_layout=True,\n",
+    "    ),\n",
+    ")\n",
+    "documents = loader.load()"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "b3e41b4d-b159-4274-89be-80d8159134ef",
@@ -276,11 +312,14 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1a09d18b-ab7b-468e-ae66-f92abf666b9b",
-   "metadata": {},
-   "outputs": [],
+   "cell_type": "markdown",
+   "id": "bd97f1c90aff6a83",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
   "source": []
  }
 ],
@@ -876,7 +915,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.13"
  }
 },
 "nbformat": 4,