AmazonTextractPDFLoader documentation updates (#9415)

Description: Updating documentation to add AmazonTextractPDFLoader according to [comment](https://github.com/langchain-ai/langchain/pull/8661#issuecomment-1666572992) from [baskaryan](https://github.com/baskaryan) Adding one notebook and instructions to the modules/data_connection/document_loaders/pdf.mdx --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2025-10-23 11:16:58 +00:00 · 2023-08-21 01:40:15 +02:00
parent 08feed3332
commit 0c8a88b3fa
2 changed files with 893 additions and 1 deletions
--- a/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
+++ b/docs/snippets/modules/data_connection/document_loaders/how_to/pdf.mdx
@@ -1,4 +1,4 @@
-## Using PyPDF
+# Using PyPDF

 Load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with `page` number.

@@ -389,3 +389,17 @@ data[0]
 ```

 </CodeOutputBlock>
+
+## Using AmazonTextractPDFParser
+
+The AmazonTextractPDFLoader calls the [Amazon Textract Service](https://aws.amazon.com/textract/) to convert PDFs into a Document structure. The loader does pure OCR at the moment, with more features like layout support planned, depending on demand.  Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.
+
+For the call to be successful an AWS account is required, similar to the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) requirements.
+
+Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.
+
+```python
+from langchain.document_loaders import AmazonTextractPDFLoader
+loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
+documents = loader.load()
+```