mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-11 16:01:33 +00:00
community[patch]: adding linearization config to AmazonTextractPDFLoader (#17489)
- **Description:** Adding an optional parameter `linearization_config` to the `AmazonTextractPDFLoader` so the caller can define how the output will be linearized, instead of forcing a predefined set of linearization configs. It will still have a default configuration as this will be an optional parameter. - **Issue:** #17457 - **Dependencies:** The same ones that already exist for `AmazonTextractPDFLoader` - **Twitter handle:** [@lvieirajr19](https://twitter.com/lvieirajr19) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
committed by
GitHub
parent
37e89ba5b1
commit
67c880af74
@@ -206,6 +206,42 @@
|
||||
"len(documents)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a56ba97505c8d140",
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Sample 4\n",
|
||||
"\n",
|
||||
"You have the option to pass an additional parameter called `linearization_config` to the AmazonTextractPDFLoader which will determine how the the text output will be linearized by the parser after Textract runs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1efbc4b6-f3cb-45c5-bbe8-16e7df060b92",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain_community.document_loaders import AmazonTextractPDFLoader\n",
|
||||
"from textractor.data.text_linearization_config import TextLinearizationConfig\n",
|
||||
"\n",
|
||||
"loader = AmazonTextractPDFLoader(\n",
|
||||
" \"s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf\",\n",
|
||||
" linearization_config=TextLinearizationConfig(\n",
|
||||
" hide_header_layout=True,\n",
|
||||
" hide_footer_layout=True,\n",
|
||||
" hide_figure_layout=True,\n",
|
||||
" ),\n",
|
||||
")\n",
|
||||
"documents = loader.load()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b3e41b4d-b159-4274-89be-80d8159134ef",
|
||||
@@ -276,11 +312,14 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1a09d18b-ab7b-468e-ae66-f92abf666b9b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"cell_type": "markdown",
|
||||
"id": "bd97f1c90aff6a83",
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"jupyter": {
|
||||
"outputs_hidden": false
|
||||
}
|
||||
},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
@@ -876,7 +915,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
"version": "3.10.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
Reference in New Issue
Block a user