langchain/docs/docs/integrations/document_transformers/google_docai.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b317191d",
   "metadata": {},
   "source": [
    "# Google Cloud Document AI\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a19e6f94",
   "metadata": {},
   "source": [
    "Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.\n",
    "\n",
    "Learn more:\n",
    "\n",
    "- [Document AI overview](https://cloud.google.com/document-ai/docs/overview)\n",
    "- [Document AI videos and labs](https://cloud.google.com/document-ai/docs/videos)\n",
    "- [Try it!](https://cloud.google.com/document-ai/docs/drag-and-drop)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184c0af8",
   "metadata": {},
   "source": [
    "The module contains a `PDF` parser based on DocAI from Google Cloud.\n",
    "\n",
    "You need to install two libraries to use this parser:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c86b2f59",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  langchain-google-community[docai]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51946817-798c-4d11-abd6-db2ae53a0270",
   "metadata": {},
   "source": [
    "First, you need to set up a Google Cloud Storage (GCS) bucket and create your own Optical Character Recognition (OCR) processor as described here: https://cloud.google.com/document-ai/docs/create-processor\n",
    "\n",
    "The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) and a `PROCESSOR_NAME` should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID` or `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION_ID`. You can get it either programmatically or copy from the `Prediction endpoint` section of the `Processor details` tab in the Google Cloud Console.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ac85f7f3-3ef6-41d5-920a-b55f2939c202",
   "metadata": {},
   "outputs": [],
   "source": [
    "GCS_OUTPUT_PATH = \"gs://BUCKET_NAME/FOLDER_PATH\"\n",
    "PROCESSOR_NAME = \"projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "48438efb-9f0d-473b-a91c-9f1e29c2539d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.document_loaders.blob_loaders import Blob\n",
    "from langchain_google_community import DocAIParser"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fad2bcca-1c0e-4888-b82d-15823ba57e60",
   "metadata": {},
   "source": [
    "Now, create a `DocAIParser`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dcc0c65a-86c5-448d-8b21-2e564b1903b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "parser = DocAIParser(\n",
    "    location=\"us\", processor_name=PROCESSOR_NAME, gcs_output_path=GCS_OUTPUT_PATH\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8b5a3ff-650a-4ad3-a73a-395f86e4c9e1",
   "metadata": {},
   "source": [
    "For this example, you can use an Alphabet earnings report that's uploaded to a public GCS bucket.\n",
    "\n",
    "[2022Q1_alphabet_earnings_release.pdf](https://storage.googleapis.com/cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2022Q1_alphabet_earnings_release.pdf)\n",
    "\n",
    "Pass the document to the `lazy_parse()` method to\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "373cc18e-a311-4c8d-8180-47e4ade1d2ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "blob = Blob(\n",
    "    path=\"gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2022Q1_alphabet_earnings_release.pdf\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f8e4ee1-e07d-4c29-a120-4d56aae91859",
   "metadata": {},
   "source": [
    "We'll get one document per page, 11 in total:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "343919f5-35d2-47fb-9790-de464649ebdf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11\n"
     ]
    }
   ],
   "source": [
    "docs = list(parser.lazy_parse(blob))\n",
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b104ae56-011b-4abe-ac07-e999c69494c5",
   "metadata": {},
   "source": [
    "You can run end-to-end parsing of a blob one-by-one. If you have many documents, it might be a better approach to batch them together and maybe even detach parsing from handling the results of parsing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9ecc1b99-5cef-47b0-a125-dbb2c41d2224",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['projects/543079149601/locations/us/operations/16447136779727347991']\n"
     ]
    }
   ],
   "source": [
    "operations = parser.docai_parse([blob])\n",
    "print([op.operation.name for op in operations])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2d24d63-c2c7-454c-9df3-2a9cf51309a6",
   "metadata": {},
   "source": [
    "You can check whether operations are finished:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ab11efb0-e514-4f44-9ba5-3d638a59c9e6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parser.is_running(operations)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "602ca0bc-080a-4a4e-a413-0e705aeab189",
   "metadata": {},
   "source": [
    "And when they're finished, you can parse the results:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ec1e6041-bc10-47d4-ba64-d09055c14f27",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parser.is_running(operations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "95d89da4-1c8a-413d-8473-ddd4a39375a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DocAIParsingResults(source_path='gs://vertex-pgt/examples/goog-exhibit-99-1-q1-2023-19.pdf', parsed_path='gs://vertex-pgt/test/run1/16447136779727347991/0')\n"
     ]
    }
   ],
   "source": [
    "results = parser.get_results(operations)\n",
    "print(results[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87e5b606-1679-46c7-9577-4cf9bc93a752",
   "metadata": {},
   "source": [
    "And now we can finally generate Documents from parsed results:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "08e8878d-889b-41ad-9500-2f772d38782f",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = list(parser.parse_from_results(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "c59525fb-448d-444b-8f12-c4aea791e19b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11\n"
     ]
    }
   ],
   "source": [
    "print(len(docs))"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "python3",
   "name": "common-cpu.m109",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/base-cpu:m109"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}