docs: adding context for Textract linearization-config param (#32064)

Before jumping into tech implementation, I added a context for
linearization-config param, and explained what's linealization in this
context.
I also linked an AWS blog for more advanced use cases, as this single
example doesn't cover all use cases.

---------

Co-authored-by: Mason Daugherty <mason@langchain.dev>
This commit is contained in:
Ahmad Elmalah 2025-07-16 17:17:20 +03:00 committed by GitHub
parent 2ab2cab203
commit 1892a67eef
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -216,7 +216,13 @@
"source": [
"## Example 4: Customizing the output format\n",
"\n",
"You have the option to pass an additional parameter called `linearization_config` to the AmazonTextractPDFLoader which will determine how the text output will be linearized by the parser after Textract runs."
"When Amazon Textract processes a PDF, it extracts all text, including elements like headers, footers, and page numbers. This extra information can be \"noisy\" and reduce the effectiveness of the output.\n",
"\n",
"The process of converting a document's 2D layout into a clean, one-dimensional string of text is called linearization.\n",
"\n",
"The AmazonTextractPDFLoader gives you precise control over this process with the `linearization_config` parameter. You can use it to specify which elements to exclude from the final output.\n",
"\n",
"The following example shows how to hide headers, footers, and figures, resulting in a much cleaner text block, for more advanced use cases see this [AWS blog post](https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/)."
]
},
{