mirror of
https://github.com/hwchase17/langchain.git
synced 2025-09-10 07:21:03 +00:00
feat: Update Google Document AI Parser (#11413)
- **Description:** Code Refactoring, Documentation Improvements for Google Document AI PDF Parser - Adds Online (synchronous) processing option. - Adds default field mask to limit payload size. - Skips Human review by default. - **Issue:** Fixes #10589 --------- Co-authored-by: Erick Friis <erick@langchain.dev>
This commit is contained in:
@@ -2,39 +2,45 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "310fce10-e051-40db-89b0-5b5bb85cd145",
|
||||
"id": "b317191d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Document AI\n"
|
||||
"# Google Cloud Document AI\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f95ac25b-f025-40c3-95b8-77919fc4da7f",
|
||||
"id": "a19e6f94",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
">[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform` service to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume. "
|
||||
"Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.\n",
|
||||
"\n",
|
||||
"Learn more:\n",
|
||||
"\n",
|
||||
"- [Document AI overview](https://cloud.google.com/document-ai/docs/overview)\n",
|
||||
"- [Document AI videos and labs](https://cloud.google.com/document-ai/docs/videos)\n",
|
||||
"- [Try it!](https://cloud.google.com/document-ai/docs/drag-and-drop)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "275f2193-248f-4565-a872-93a89589cf2b",
|
||||
"id": "184c0af8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The module contains a `PDF` parser based on DocAI from Google Cloud.\n",
|
||||
"\n",
|
||||
"You need to install two libraries to use this parser:"
|
||||
"You need to install two libraries to use this parser:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "34132fab-0069-4942-b68b-5b093ccfc92a",
|
||||
"id": "c86b2f59",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install google-cloud-documentai\n",
|
||||
"!pip install google-cloud-documentai-toolbox"
|
||||
"%pip install google-cloud-documentai\n",
|
||||
"%pip install google-cloud-documentai-toolbox\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -42,8 +48,9 @@
|
||||
"id": "51946817-798c-4d11-abd6-db2ae53a0270",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"First, you need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor) \n",
|
||||
"The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console."
|
||||
"First, you need to set up a Google Cloud Storage (GCS) bucket and create your own Optical Character Recognition (OCR) processor as described here: https://cloud.google.com/document-ai/docs/create-processor\n",
|
||||
"\n",
|
||||
"The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a `PROCESSOR_NAME` should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID` or `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -53,9 +60,8 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"PROJECT = \"PUT_SOMETHING_HERE\"\n",
|
||||
"GCS_OUTPUT_PATH = \"PUT_SOMETHING_HERE\"\n",
|
||||
"PROCESSOR_NAME = \"PUT_SOMETHING_HERE\""
|
||||
"GCS_OUTPUT_PATH = \"gs://BUCKET_NAME/FOLDER_PATH\"\n",
|
||||
"PROCESSOR_NAME = \"projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID\"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -66,7 +72,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders.blob_loaders import Blob\n",
|
||||
"from langchain.document_loaders.parsers import DocAIParser"
|
||||
"from langchain.document_loaders.parsers import DocAIParser\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -74,7 +80,7 @@
|
||||
"id": "fad2bcca-1c0e-4888-b82d-15823ba57e60",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now, let's create a parser:"
|
||||
"Now, create a `DocAIParser`.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -84,7 +90,8 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"parser = DocAIParser(location=\"us\", processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH)"
|
||||
"parser = DocAIParser(\n",
|
||||
" location=\"us\", processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -92,7 +99,11 @@
|
||||
"id": "b8b5a3ff-650a-4ad3-a73a-395f86e4c9e1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's go and parse an Alphabet's take from here: https://abc.xyz/assets/a7/5b/9e5ae0364b12b4c883f3cf748226/goog-exhibit-99-1-q1-2023-19.pdf. Copy it to your GCS bucket first, and adjust the path below."
|
||||
"For this example, you can use an Alphabet earnings report that's uploaded to a public GCS bucket.\n",
|
||||
"\n",
|
||||
"[2022Q1_alphabet_earnings_release.pdf](https://storage.googleapis.com/cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2022Q1_alphabet_earnings_release.pdf)\n",
|
||||
"\n",
|
||||
"Pass the document to the `lazy_parse()` method to\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -102,17 +113,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"blob = Blob(path=\"gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"id": "6ef84fad-2981-456d-a6b4-3a6a1a46d511",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = list(parser.lazy_parse(blob))"
|
||||
"blob = Blob(path=\"gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2022Q1_alphabet_earnings_release.pdf\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -120,7 +121,7 @@
|
||||
"id": "3f8e4ee1-e07d-4c29-a120-4d56aae91859",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We'll get one document per page, 11 in total:"
|
||||
"We'll get one document per page, 11 in total:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -138,7 +139,8 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(len(docs))"
|
||||
"docs = list(parser.lazy_parse(blob))\n",
|
||||
"print(len(docs))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -146,7 +148,7 @@
|
||||
"id": "b104ae56-011b-4abe-ac07-e999c69494c5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing."
|
||||
"You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -165,7 +167,7 @@
|
||||
],
|
||||
"source": [
|
||||
"operations = parser.docai_parse([blob])\n",
|
||||
"print([op.operation.name for op in operations])"
|
||||
"print([op.operation.name for op in operations])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -173,7 +175,7 @@
|
||||
"id": "a2d24d63-c2c7-454c-9df3-2a9cf51309a6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"You can check whether operations are finished:"
|
||||
"You can check whether operations are finished:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -194,7 +196,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"parser.is_running(operations)"
|
||||
"parser.is_running(operations)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -202,7 +204,7 @@
|
||||
"id": "602ca0bc-080a-4a4e-a413-0e705aeab189",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And when they're finished, you can parse the results:"
|
||||
"And when they're finished, you can parse the results:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -223,7 +225,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"parser.is_running(operations)"
|
||||
"parser.is_running(operations)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -242,7 +244,7 @@
|
||||
],
|
||||
"source": [
|
||||
"results = parser.get_results(operations)\n",
|
||||
"print(results[0])"
|
||||
"print(results[0])\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -250,7 +252,7 @@
|
||||
"id": "87e5b606-1679-46c7-9577-4cf9bc93a752",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And now we can finally generate Documents from parsed results:"
|
||||
"And now we can finally generate Documents from parsed results:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -260,7 +262,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"docs = list(parser.parse_from_results(results))"
|
||||
"docs = list(parser.parse_from_results(results))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -278,7 +280,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(len(docs))"
|
||||
"print(len(docs))\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -290,7 +292,7 @@
|
||||
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m109"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -304,7 +306,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.12"
|
||||
"version": "3.10.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
Reference in New Issue
Block a user