diff --git a/docs/docs/integrations/document_loaders/amazon_textract.ipynb b/docs/docs/integrations/document_loaders/amazon_textract.ipynb index 9eb475180be..318a22e29ae 100644 --- a/docs/docs/integrations/document_loaders/amazon_textract.ipynb +++ b/docs/docs/integrations/document_loaders/amazon_textract.ipynb @@ -216,7 +216,13 @@ "source": [ "## Example 4: Customizing the output format\n", "\n", - "You have the option to pass an additional parameter called `linearization_config` to the AmazonTextractPDFLoader which will determine how the text output will be linearized by the parser after Textract runs." + "When Amazon Textract processes a PDF, it extracts all text, including elements like headers, footers, and page numbers. This extra information can be \"noisy\" and reduce the effectiveness of the output.\n", + "\n", + "The process of converting a document's 2D layout into a clean, one-dimensional string of text is called linearization.\n", + "\n", + "The AmazonTextractPDFLoader gives you precise control over this process with the `linearization_config` parameter. You can use it to specify which elements to exclude from the final output.\n", + "\n", + "The following example shows how to hide headers, footers, and figures, resulting in a much cleaner text block, for more advanced use cases see this [AWS blog post](https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/)." ] }, {