langchain/docs/docs/integrations/document_loaders/google_spanner.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google Spanner\n",
    "\n",
    "> [Spanner](https://cloud.google.com/spanner) is a highly scalable database that combines unlimited scalability with relational semantics, such as secondary indexes, strong consistency, schemas, and SQL providing 99.999% availability in one easy solution. Extend your database application to build AI-powered experiences leveraging Spanner's Langchain integrations.\n",
    "\n",
    "This notebook goes over how to use [Spanner](https://cloud.google.com/spanner) to [save, load and delete langchain documents](https://python.langchain.com/docs/modules/data_connection/document_loaders/) with `SpannerLoader` and `SpannerDocumentSaver`.\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/langchain-google-spanner-python/blob/main/docs/document_loader.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Before You Begin\n",
    "\n",
    "To run this notebook, you will need to do the following:\n",
    "\n",
    "* [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)\n",
    "* [Create a Spanner instance](https://cloud.google.com/spanner/docs/create-manage-instances)\n",
    "* [Create a Spanner database](https://cloud.google.com/spanner/docs/create-manage-databases)\n",
    "* [Create a Spanner table](https://cloud.google.com/spanner/docs/create-query-database-console#create-schema)\n",
    "\n",
    "After confirmed access to database in the runtime environment of this notebook, filling the following values and run the cell before running example scripts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# @markdown Please specify an instance id, a database, and a table for demo purpose.\n",
    "INSTANCE_ID = \"test_instance\"  # @param {type:\"string\"}\n",
    "DATABASE_ID = \"test_database\"  # @param {type:\"string\"}\n",
    "TABLE_NAME = \"test_table\"  # @param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🦜🔗 Library Installation\n",
    "\n",
    "The integration lives in its own `langchain-google-spanner` package, so we need to install it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install -upgrade --quiet langchain-google-spanner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Colab only**: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Automatically restart kernel after installs so that your environment can access the new packages\n",
    "# import IPython\n",
    "\n",
    "# app = IPython.Application.instance()\n",
    "# app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ☁ Set Your Google Cloud Project\n",
    "Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.\n",
    "\n",
    "If you don't know your project ID, try the following:\n",
    "\n",
    "* Run `gcloud config list`.\n",
    "* Run `gcloud projects list`.\n",
    "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n",
    "\n",
    "PROJECT_ID = \"my-project-id\"  # @param {type:\"string\"}\n",
    "\n",
    "# Set the project id\n",
    "!gcloud config set project {PROJECT_ID}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🔐 Authentication\n",
    "\n",
    "Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n",
    "\n",
    "- If you are using Colab to run this notebook, use the cell below and continue.\n",
    "- If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.colab import auth\n",
    "\n",
    "auth.authenticate_user()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save documents\n",
    "\n",
    "Save langchain documents with `SpannerDocumentSaver.add_documents(<documents>)`. To initialize `SpannerDocumentSaver` class you need to provide 3 things:\n",
    "\n",
    "1. `instance_id` - An instance of Spanner to load data from.\n",
    "1. `database_id` - An instance of Spanner database to load data from.\n",
    "1. `table_name` - The name of the table within the Spanner database to store langchain documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.documents import Document\n",
    "from langchain_google_spanner import SpannerDocumentSaver\n",
    "\n",
    "test_docs = [\n",
    "    Document(\n",
    "        page_content=\"Apple Granny Smith 150 0.99 1\",\n",
    "        metadata={\"fruit_id\": 1},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Banana Cavendish 200 0.59 0\",\n",
    "        metadata={\"fruit_id\": 2},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Orange Navel 80 1.29 1\",\n",
    "        metadata={\"fruit_id\": 3},\n",
    "    ),\n",
    "]\n",
    "\n",
    "saver = SpannerDocumentSaver(\n",
    "    instance_id=INSTANCE_ID,\n",
    "    database_id=DATABASE_ID,\n",
    "    table_name=TABLE_NAME,\n",
    ")\n",
    "saver.add_documents(test_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Querying for Documents from Spanner\n",
    "\n",
    "For more details on connecting to a Spanner table, please check the [Python SDK documentation](https://cloud.google.com/python/docs/reference/spanner/latest)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load documents from table\n",
    "\n",
    "Load langchain documents with `SpannerLoader.load()` or `SpannerLoader.lazy_load()`. `lazy_load` returns a generator that only queries database during the iteration. To initialize `SpannerLoader` class you need to provide:\n",
    "\n",
    "1. `instance_id` - An instance of Spanner to load data from.\n",
    "1. `database_id` - An instance of Spanner database to load data from.\n",
    "1. `query` - A query of the database dialect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_spanner import SpannerLoader\n",
    "\n",
    "query = f\"SELECT * from {TABLE_NAME}\"\n",
    "loader = SpannerLoader(\n",
    "    instance_id=INSTANCE_ID,\n",
    "    database_id=DATABASE_ID,\n",
    "    query=query,\n",
    ")\n",
    "\n",
    "for doc in loader.lazy_load():\n",
    "    print(doc)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete documents\n",
    "\n",
    "Delete a list of langchain documents from the table with `SpannerDocumentSaver.delete(<documents>)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = loader.load()\n",
    "print(\"Documents before delete:\", docs)\n",
    "\n",
    "doc = test_docs[0]\n",
    "saver.delete([doc])\n",
    "print(\"Documents after delete:\", loader.load())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Customize Document Page Content & Metadata\n",
    "\n",
    "The loader will returns a list of Documents with page content from a specific data columns. All other data columns will be added to metadata. Each row becomes a document."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Customize page content format\n",
    "\n",
    "The SpannerLoader assumes there is a column called `page_content`. These defaults can be changed like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_content_loader = SpannerLoader(\n",
    "    INSTANCE_ID, DATABASE_ID, query, content_columns=[\"custom_content\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If multiple columns are specified, the page content's string format will default to `text` (space-separated string concatenation). There are other format that user can specify, including `text`, `JSON`, `YAML`, `CSV`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Customize metadata format\n",
    "\n",
    "The SpannerLoader assumes there is a metadata column called `langchain_metadata` that store JSON data. The metadata column will be used as the base dictionary. By default, all other column data will be added and may overwrite the original value. These defaults can be changed like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_metadata_loader = SpannerLoader(\n",
    "    INSTANCE_ID, DATABASE_ID, query, metadata_columns=[\"column1\", \"column2\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Customize JSON metadata column name\n",
    "\n",
    "By default, the loader uses `langchain_metadata` as the base dictionary. This can be customized to select a JSON column to use as base dictionary for the Document's metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_metadata_json_loader = SpannerLoader(\n",
    "    INSTANCE_ID, DATABASE_ID, query, metadata_json_column=\"another-json-column\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom staleness\n",
    "\n",
    "The default [staleness](https://cloud.google.com/python/docs/reference/spanner/latest/snapshot-usage#beginning-a-snapshot) is 15s. This can be customized by specifying a weaker bound (which can either be to perform all reads as of a given timestamp), or as of a given duration in the past."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "\n",
    "timestamp = datetime.datetime.utcnow()\n",
    "custom_timestamp_loader = SpannerLoader(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    query,\n",
    "    staleness=timestamp,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "duration = 20.0\n",
    "custom_duration_loader = SpannerLoader(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    query,\n",
    "    staleness=duration,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Turn on data boost\n",
    "\n",
    "By default, the loader will not use [data boost](https://cloud.google.com/spanner/docs/databoost/databoost-overview) since it has additional costs associated, and require additional IAM permissions. However, user can choose to turn it on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_databoost_loader = SpannerLoader(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    query,\n",
    "    databoost=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom client\n",
    "\n",
    "The client created by default is the default client. To pass in `credentials` and `project` explicitly, a custom client can be passed to the constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import spanner\n",
    "from google.oauth2 import service_account\n",
    "\n",
    "creds = service_account.Credentials.from_service_account_file(\"/path/to/key.json\")\n",
    "custom_client = spanner.Client(project=\"my-project\", credentials=creds)\n",
    "saver = SpannerDocumentSaver(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    TABLE_NAME,\n",
    "    client=custom_client,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom initialization for SpannerDocumentSaver\n",
    "\n",
    "The SpannerDocumentSaver allows custom initialization. This allows user to specify how the Document is saved into the table.\n",
    "\n",
    "\n",
    "content_column: This will be used as the column name for the Document's page content. Defaulted to `page_content`.\n",
    "\n",
    "metadata_columns: These metadata will be saved into specific columns if the key exists in the Document's metadata.\n",
    "\n",
    "metadata_json_column: This will be the column name for the spcial JSON column. Defaulted to `langchain_metadata`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_saver = SpannerDocumentSaver(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    TABLE_NAME,\n",
    "    content_column=\"my-content\",\n",
    "    metadata_columns=[\"foo\"],\n",
    "    metadata_json_column=\"my-special-json-column\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize custom schema for Spanner\n",
    "\n",
    "The SpannerDocumentSaver will have a `init_document_table` method to create a new table to store docs with custom schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_spanner import Column\n",
    "\n",
    "new_table_name = \"my_new_table\"\n",
    "\n",
    "SpannerDocumentSaver.init_document_table(\n",
    "    INSTANCE_ID,\n",
    "    DATABASE_ID,\n",
    "    new_table_name,\n",
    "    content_column=\"my-page-content\",\n",
    "    metadata_columns=[\n",
    "        Column(\"category\", \"STRING(36)\", True),\n",
    "        Column(\"price\", \"FLOAT64\", False),\n",
    "    ],\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}