Files
langchain/docs/versioned_docs/version-0.2.x/integrations/document_loaders/google_bigtable.ipynb
Jacob Lee aff771923a Jacob/new docs (#20570)
Use docusaurus versioning with a callout, merged master as well

@hwchase17 @baskaryan

---------

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru>
Co-authored-by: Averi Kitsch <akitsch@google.com>
Co-authored-by: Erick Friis <erick@langchain.dev>
Co-authored-by: Nuno Campos <nuno@langchain.dev>
Co-authored-by: Nuno Campos <nuno@boringbits.io>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com>
Co-authored-by: Fayfox <admin@fayfox.com>
Co-authored-by: Eugene Yurtsev <eugene@langchain.dev>
Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com>
Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com>
Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com>
Co-authored-by: ccurme <chester.curme@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com>
Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com>
Co-authored-by: Kartik Sarangmath <kartik@thirdai.com>
Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai>
Co-authored-by: MacanPN <martin.triska@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
Co-authored-by: Hyeongchan Kim <kozistr@gmail.com>
Co-authored-by: sdan <git@sdan.io>
Co-authored-by: Guangdong Liu <liugddx@gmail.com>
Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com>
Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com>
Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com>
Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com>
Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com>
Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com>
Co-authored-by: Tomer Cagan <tomer@tomercagan.com>
Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
2024-04-18 11:10:55 -07:00

473 lines
15 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Google Bigtable\n",
"\n",
"> [Bigtable](https://cloud.google.com/bigtable) is a key-value and wide-column store, ideal for fast access to structured, semi-structured, or unstructured data. Extend your database application to build AI-powered experiences leveraging Bigtable's Langchain integrations.\n",
"\n",
"This notebook goes over how to use [Bigtable](https://cloud.google.com/bigtable) to [save, load and delete langchain documents](/docs/modules/data_connection/document_loaders/) with `BigtableLoader` and `BigtableSaver`.\n",
"\n",
"Learn more about the package on [GitHub](https://github.com/googleapis/langchain-google-bigtable-python/).\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/langchain-google-bigtable-python/blob/main/docs/document_loader.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Before You Begin\n",
"\n",
"To run this notebook, you will need to do the following:\n",
"\n",
"* [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)\n",
"* [Enable the Bigtable API](https://console.cloud.google.com/flows/enableapi?apiid=bigtable.googleapis.com)\n",
"* [Create a Bigtable instance](https://cloud.google.com/bigtable/docs/creating-instance)\n",
"* [Create a Bigtable table](https://cloud.google.com/bigtable/docs/managing-tables)\n",
"* [Create Bigtable access credentials](https://developers.google.com/workspace/guides/create-credentials)\n",
"\n",
"After confirmed access to database in the runtime environment of this notebook, filling the following values and run the cell before running example scripts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# @markdown Please specify an instance and a table for demo purpose.\n",
"INSTANCE_ID = \"my_instance\" # @param {type:\"string\"}\n",
"TABLE_ID = \"my_table\" # @param {type:\"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🦜🔗 Library Installation\n",
"\n",
"The integration lives in its own `langchain-google-bigtable` package, so we need to install it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -upgrade --quiet langchain-google-bigtable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Colab only**: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # Automatically restart kernel after installs so that your environment can access the new packages\n",
"# import IPython\n",
"\n",
"# app = IPython.Application.instance()\n",
"# app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ☁ Set Your Google Cloud Project\n",
"Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.\n",
"\n",
"If you don't know your project ID, try the following:\n",
"\n",
"* Run `gcloud config list`.\n",
"* Run `gcloud projects list`.\n",
"* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n",
"\n",
"PROJECT_ID = \"my-project-id\" # @param {type:\"string\"}\n",
"\n",
"# Set the project id\n",
"!gcloud config set project {PROJECT_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🔐 Authentication\n",
"\n",
"Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n",
"\n",
"- If you are using Colab to run this notebook, use the cell below and continue.\n",
"- If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.colab import auth\n",
"\n",
"auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the saver\n",
"\n",
"Save langchain documents with `BigtableSaver.add_documents(<documents>)`. To initialize `BigtableSaver` class you need to provide 2 things:\n",
"\n",
"1. `instance_id` - An instance of Bigtable.\n",
"1. `table_id` - The name of the table within the Bigtable to store langchain documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.documents import Document\n",
"from langchain_google_bigtable import BigtableSaver\n",
"\n",
"test_docs = [\n",
" Document(\n",
" page_content=\"Apple Granny Smith 150 0.99 1\",\n",
" metadata={\"fruit_id\": 1},\n",
" ),\n",
" Document(\n",
" page_content=\"Banana Cavendish 200 0.59 0\",\n",
" metadata={\"fruit_id\": 2},\n",
" ),\n",
" Document(\n",
" page_content=\"Orange Navel 80 1.29 1\",\n",
" metadata={\"fruit_id\": 3},\n",
" ),\n",
"]\n",
"\n",
"saver = BigtableSaver(\n",
" instance_id=INSTANCE_ID,\n",
" table_id=TABLE_ID,\n",
")\n",
"\n",
"saver.add_documents(test_docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Querying for Documents from Bigtable\n",
"For more details on connecting to a Bigtable table, please check the [Python SDK documentation](https://cloud.google.com/python/docs/reference/bigtable/latest/client)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load documents from table\n",
"\n",
"Load langchain documents with `BigtableLoader.load()` or `BigtableLoader.lazy_load()`. `lazy_load` returns a generator that only queries database during the iteration. To initialize `BigtableLoader` class you need to provide:\n",
"\n",
"1. `instance_id` - An instance of Bigtable.\n",
"1. `table_id` - The name of the table within the Bigtable to store langchain documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_google_bigtable import BigtableLoader\n",
"\n",
"loader = BigtableLoader(\n",
" instance_id=INSTANCE_ID,\n",
" table_id=TABLE_ID,\n",
")\n",
"\n",
"for doc in loader.lazy_load():\n",
" print(doc)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Delete documents\n",
"\n",
"Delete a list of langchain documents from Bigtable table with `BigtableSaver.delete(<documents>)`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_google_bigtable import BigtableSaver\n",
"\n",
"docs = loader.load()\n",
"print(\"Documents before delete: \", docs)\n",
"\n",
"onedoc = test_docs[0]\n",
"saver.delete([onedoc])\n",
"print(\"Documents after delete: \", loader.load())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Limiting the returned rows\n",
"There are two ways to limit the returned rows:\n",
"\n",
"1. Using a [filter](https://cloud.google.com/python/docs/reference/bigtable/latest/row-filters)\n",
"2. Using a [row_set](https://cloud.google.com/python/docs/reference/bigtable/latest/row-set#google.cloud.bigtable.row_set.RowSet)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import google.cloud.bigtable.row_filters as row_filters\n",
"\n",
"filter_loader = BigtableLoader(\n",
" INSTANCE_ID, TABLE_ID, filter=row_filters.ColumnQualifierRegexFilter(b\"os_build\")\n",
")\n",
"\n",
"\n",
"from google.cloud.bigtable.row_set import RowSet\n",
"\n",
"row_set = RowSet()\n",
"row_set.add_row_range_from_keys(\n",
" start_key=\"phone#4c410523#20190501\", end_key=\"phone#4c410523#201906201\"\n",
")\n",
"\n",
"row_set_loader = BigtableLoader(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" row_set=row_set,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom client\n",
"The client created by default is the default client, using only admin=True option. To use a non-default, a [custom client](https://cloud.google.com/python/docs/reference/bigtable/latest/client#class-googlecloudbigtableclientclientprojectnone-credentialsnone-readonlyfalse-adminfalse-clientinfonone-clientoptionsnone-adminclientoptionsnone-channelnone) can be passed to the constructor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.cloud import bigtable\n",
"\n",
"custom_client_loader = BigtableLoader(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" client=bigtable.Client(...),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom content\n",
"The BigtableLoader assumes there is a column family called `langchain`, that has a column called `content`, that contains values encoded in UTF-8. These defaults can be changed like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_google_bigtable import Encoding\n",
"\n",
"custom_content_loader = BigtableLoader(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" content_encoding=Encoding.ASCII,\n",
" content_column_family=\"my_content_family\",\n",
" content_column_name=\"my_content_column_name\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metadata mapping\n",
"By default, the `metadata` map on the `Document` object will contain a single key, `rowkey`, with the value of the row's rowkey value. To add more items to that map, use metadata_mapping."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"from langchain_google_bigtable import MetadataMapping\n",
"\n",
"metadata_mapping_loader = BigtableLoader(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" metadata_mappings=[\n",
" MetadataMapping(\n",
" column_family=\"my_int_family\",\n",
" column_name=\"my_int_column\",\n",
" metadata_key=\"key_in_metadata_map\",\n",
" encoding=Encoding.INT_BIG_ENDIAN,\n",
" ),\n",
" MetadataMapping(\n",
" column_family=\"my_custom_family\",\n",
" column_name=\"my_custom_column\",\n",
" metadata_key=\"custom_key\",\n",
" encoding=Encoding.CUSTOM,\n",
" custom_decoding_func=lambda input: json.loads(input.decode()),\n",
" custom_encoding_func=lambda input: str.encode(json.dumps(input)),\n",
" ),\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metadata as JSON\n",
"\n",
"If there is a column in Bigtable that contains a JSON string that you would like to have added to the output document metadata, it is possible to add the following parameters to BigtableLoader. Note, the default value for `metadata_as_json_encoding` is UTF-8."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metadata_as_json_loader = BigtableLoader(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" metadata_as_json_encoding=Encoding.ASCII,\n",
" metadata_as_json_family=\"my_metadata_as_json_family\",\n",
" metadata_as_json_name=\"my_metadata_as_json_column_name\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customize BigtableSaver\n",
"\n",
"The BigtableSaver is also customizable similar to BigtableLoader."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"saver = BigtableSaver(\n",
" INSTANCE_ID,\n",
" TABLE_ID,\n",
" client=bigtable.Client(...),\n",
" content_encoding=Encoding.ASCII,\n",
" content_column_family=\"my_content_family\",\n",
" content_column_name=\"my_content_column_name\",\n",
" metadata_mappings=[\n",
" MetadataMapping(\n",
" column_family=\"my_int_family\",\n",
" column_name=\"my_int_column\",\n",
" metadata_key=\"key_in_metadata_map\",\n",
" encoding=Encoding.INT_BIG_ENDIAN,\n",
" ),\n",
" MetadataMapping(\n",
" column_family=\"my_custom_family\",\n",
" column_name=\"my_custom_column\",\n",
" metadata_key=\"custom_key\",\n",
" encoding=Encoding.CUSTOM,\n",
" custom_decoding_func=lambda input: json.loads(input.decode()),\n",
" custom_encoding_func=lambda input: str.encode(json.dumps(input)),\n",
" ),\n",
" ],\n",
" metadata_as_json_encoding=Encoding.ASCII,\n",
" metadata_as_json_family=\"my_metadata_as_json_family\",\n",
" metadata_as_json_name=\"my_metadata_as_json_column_name\",\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}