mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 06:33:41 +00:00
Thank you for contributing to LangChain! Follow these steps to mark your pull request as ready for review. **If any of these steps are not completed, your PR will not be considered for review.** - [x] **feat(docs)**: add Bigtable Key-value store doc - [X] **feat(docs)**: add Bigtable Vector store doc This PR adds a doc for Bigtable and LangChain Key-value store integration. It contains guides on how to add, delete, get, and yield key-value pairs from Bigtable Key-value Store for LangChain. - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. **We will not consider a PR unless these three are passing in CI.** See [contribution guidelines](https://python.langchain.com/docs/contributing/) for more. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to `pyproject.toml` files (even optional ones) unless they are **required** for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. --------- Co-authored-by: Mason Daugherty <mason@langchain.dev>
484 lines
15 KiB
Plaintext
484 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "raw",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n",
|
|
"sidebar_label: Google Bigtable\n",
|
|
"---"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# BigtableByteStore\n",
|
|
"\n",
|
|
"This guide covers how to use Google Cloud Bigtable as a key-value store.\n",
|
|
"\n",
|
|
"[Bigtable](https://cloud.google.com/bigtable) is a key-value and wide-column store, ideal for fast access to structured, semi-structured, or unstructured data. \n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/googleapis/langchain-google-bigtable-python/blob/main/docs/key_value_store.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Overview\n",
|
|
"\n",
|
|
"The `BigtableByteStore` uses Google Cloud Bigtable as a backend for a key-value store. It supports synchronous and asynchronous operations for setting, getting, and deleting key-value pairs.\n",
|
|
"\n",
|
|
"### Integration details\n",
|
|
"| Class | Package | Local | JS support | Package downloads | Package latest |\n",
|
|
"| :--- | :--- | :---: | :---: | :---: | :---: |\n",
|
|
"| [BigtableByteStore](https://github.com/googleapis/langchain-google-bigtable-python/blob/main/src/langchain_google_bigtable/key_value_store.py) | [langchain-google-bigtable](https://pypi.org/project/langchain-google-bigtable/) | ❌ | ❌ |  |  |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup\n",
|
|
"\n",
|
|
"### Prerequisites\n",
|
|
"\n",
|
|
"To get started, you will need a Google Cloud project with an active Bigtable instance and table. \n",
|
|
"* [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)\n",
|
|
"* [Enable the Bigtable API](https://console.cloud.google.com/flows/enableapi?apiid=bigtable.googleapis.com)\n",
|
|
"* [Create a Bigtable instance and table](https://cloud.google.com/bigtable/docs/creating-instance)\n",
|
|
"\n",
|
|
"### Installation\n",
|
|
"\n",
|
|
"The integration is in the `langchain-google-bigtable` package. The command below also installs `langchain-google-vertexai` for the embedding cache example."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install -qU langchain-google-bigtable langchain-google-vertexai"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### ☁ Set Your Google Cloud Project\n",
|
|
"Set your Google Cloud project to use its resources within this notebook.\n",
|
|
"\n",
|
|
"If you don't know your project ID, you can run `gcloud config list` or see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# @markdown Please fill in your project, instance, and table details.\n",
|
|
"PROJECT_ID = \"your-gcp-project-id\" # @param {type:\"string\"}\n",
|
|
"INSTANCE_ID = \"your-instance-id\" # @param {type:\"string\"}\n",
|
|
"TABLE_ID = \"your-table-id\" # @param {type:\"string\"}\n",
|
|
"\n",
|
|
"!gcloud config set project {PROJECT_ID}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### 🔐 Authentication\n",
|
|
"Authenticate to Google Cloud to access your project resources.\n",
|
|
"- For **Colab**, use the cell below.\n",
|
|
"- For **Vertex AI Workbench**, see the [setup instructions](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from google.colab import auth\n",
|
|
"\n",
|
|
"auth.authenticate_user()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Instantiation\n",
|
|
"\n",
|
|
"To use `BigtableByteStore`, we first ensure a table exists and then initialize a `BigtableEngine` to manage connections."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_google_bigtable import (\n",
|
|
" BigtableByteStore,\n",
|
|
" BigtableEngine,\n",
|
|
" init_key_value_store_table,\n",
|
|
")\n",
|
|
"\n",
|
|
"# Ensure the table and column family exist.\n",
|
|
"init_key_value_store_table(\n",
|
|
" project_id=PROJECT_ID,\n",
|
|
" instance_id=INSTANCE_ID,\n",
|
|
" table_id=TABLE_ID,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### BigtableEngine\n",
|
|
"A `BigtableEngine` object handles the execution context for the store, especially for async operations. It's recommended to initialize a single engine and reuse it across multiple stores for better performance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Initialize the engine to manage async operations.\n",
|
|
"engine = await BigtableEngine.async_initialize(\n",
|
|
" project_id=PROJECT_ID, instance_id=INSTANCE_ID\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### BigtableByteStore\n",
|
|
"\n",
|
|
"This is the main class for interacting with the key-value store. It provides the methods for setting, getting, and deleting data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Initialize the store.\n",
|
|
"store = await BigtableByteStore.create(engine=engine, table_id=TABLE_ID)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Usage\n",
|
|
"\n",
|
|
"The store supports both sync (`mset`, `mget`) and async (`amset`, `amget`) methods. This guide uses the async versions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Set\n",
|
|
"Use `amset` to save key-value pairs to the store."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"kv_pairs = [\n",
|
|
" (\"key1\", b\"value1\"),\n",
|
|
" (\"key2\", b\"value2\"),\n",
|
|
" (\"key3\", b\"value3\"),\n",
|
|
"]\n",
|
|
"\n",
|
|
"await store.amset(kv_pairs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get\n",
|
|
"Use `amget` to retrieve values. If a key is not found, `None` is returned for that key."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"retrieved_vals = await store.amget([\"key1\", \"key2\", \"nonexistent_key\"])\n",
|
|
"print(retrieved_vals)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Delete\n",
|
|
"Use `amdelete` to remove keys from the store."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"await store.amdelete([\"key3\"])\n",
|
|
"\n",
|
|
"# Verifying the key was deleted\n",
|
|
"await store.amget([\"key1\", \"key3\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Iterate over keys\n",
|
|
"Use `ayield_keys` to iterate over all keys or keys with a specific prefix."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"all_keys = [key async for key in store.ayield_keys()]\n",
|
|
"print(f\"All keys: {all_keys}\")\n",
|
|
"\n",
|
|
"prefixed_keys = [key async for key in store.ayield_keys(prefix=\"key1\")]\n",
|
|
"print(f\"Prefixed keys: {prefixed_keys}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Advanced Usage: Embedding Caching\n",
|
|
"\n",
|
|
"A common use case for a key-value store is to cache expensive operations like computing text embeddings, which saves time and cost."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.embeddings import CacheBackedEmbeddings\n",
|
|
"from langchain_google_vertexai.embeddings import VertexAIEmbeddings\n",
|
|
"\n",
|
|
"underlying_embeddings = VertexAIEmbeddings(\n",
|
|
" project=PROJECT_ID, model_name=\"textembedding-gecko@003\"\n",
|
|
")\n",
|
|
"\n",
|
|
"# Use a namespace to avoid key collisions with other data.\n",
|
|
"cached_embedder = CacheBackedEmbeddings.from_bytes_store(\n",
|
|
" underlying_embeddings, store, namespace=\"text-embeddings\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"First call (computes and caches embedding):\")\n",
|
|
"%time embedding_result_1 = await cached_embedder.aembed_query(\"Hello, world!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"\\nSecond call (retrieves from cache):\")\n",
|
|
"%time embedding_result_2 = await cached_embedder.aembed_query(\"Hello, world!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### As a Simple Document Retriever\n",
|
|
"\n",
|
|
"This section shows how to create a simple retriever using the Bigtable store. It acts as a document persistence layer, fetching documents that match a query prefix."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_core.retrievers import BaseRetriever\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"from langchain_core.callbacks import CallbackManagerForRetrieverRun\n",
|
|
"from typing import List, Optional, Any, Union\n",
|
|
"import json\n",
|
|
"\n",
|
|
"\n",
|
|
"class SimpleKVStoreRetriever(BaseRetriever):\n",
|
|
" \"\"\"A simple retriever that retrieves documents based on a prefix match in the key-value store.\"\"\"\n",
|
|
"\n",
|
|
" store: BigtableByteStore\n",
|
|
" documents: List[Union[Document, str]]\n",
|
|
" k: int\n",
|
|
"\n",
|
|
" def set_up_store(self):\n",
|
|
" kv_pairs_to_set = []\n",
|
|
" for i, doc in enumerate(self.documents):\n",
|
|
" if isinstance(doc, str):\n",
|
|
" doc = Document(page_content=doc)\n",
|
|
" if not doc.id:\n",
|
|
" doc.id = str(i)\n",
|
|
" value = (\n",
|
|
" \"Page Content\\n\"\n",
|
|
" + doc.page_content\n",
|
|
" + \"\\nMetadata\"\n",
|
|
" + json.dumps(doc.metadata)\n",
|
|
" )\n",
|
|
" kv_pairs_to_set.append((doc.id, value.encode(\"utf-8\")))\n",
|
|
" self.store.mset(kv_pairs_to_set)\n",
|
|
"\n",
|
|
" async def _aget_relevant_documents(\n",
|
|
" self,\n",
|
|
" query: str,\n",
|
|
" *,\n",
|
|
" run_manager: Optional[CallbackManagerForRetrieverRun] = None,\n",
|
|
" ) -> List[Document]:\n",
|
|
" keys = [key async for key in self.store.ayield_keys(prefix=query)][: self.k]\n",
|
|
" documents_retrieved = []\n",
|
|
" async for document in await self.store.amget(keys):\n",
|
|
" if document:\n",
|
|
" document_str = document.decode(\"utf-8\")\n",
|
|
" page_content = document_str.split(\"Content\\n\")[1].split(\"\\nMetadata\")[0]\n",
|
|
" metadata = json.loads(document_str.split(\"\\nMetadata\")[1])\n",
|
|
" documents_retrieved.append(\n",
|
|
" Document(page_content=page_content, metadata=metadata)\n",
|
|
" )\n",
|
|
" return documents_retrieved\n",
|
|
"\n",
|
|
" def _get_relevant_documents(\n",
|
|
" self,\n",
|
|
" query: str,\n",
|
|
" *,\n",
|
|
" run_manager: Optional[CallbackManagerForRetrieverRun] = None,\n",
|
|
" ) -> list[Document]:\n",
|
|
" keys = [key for key in self.store.yield_keys(prefix=query)][: self.k]\n",
|
|
" documents_retrieved = []\n",
|
|
" for document in self.store.mget(keys):\n",
|
|
" if document:\n",
|
|
" document_str = document.decode(\"utf-8\")\n",
|
|
" page_content = document_str.split(\"Content\\n\")[1].split(\"\\nMetadata\")[0]\n",
|
|
" metadata = json.loads(document_str.split(\"\\nMetadata\")[1])\n",
|
|
" documents_retrieved.append(\n",
|
|
" Document(page_content=page_content, metadata=metadata)\n",
|
|
" )\n",
|
|
" return documents_retrieved"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"documents = [\n",
|
|
" Document(\n",
|
|
" page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n",
|
|
" metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n",
|
|
" id=\"fish#Goldfish\",\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
|
|
" metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n",
|
|
" id=\"mammals#Cats\",\n",
|
|
" ),\n",
|
|
" Document(\n",
|
|
" page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n",
|
|
" metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n",
|
|
" id=\"mammals#Rabbits\",\n",
|
|
" ),\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"retriever_store = BigtableByteStore.create_sync(\n",
|
|
" engine=engine, instance_id=INSTANCE_ID, table_id=TABLE_ID\n",
|
|
")\n",
|
|
"\n",
|
|
"KVDocumentRetriever = SimpleKVStoreRetriever(\n",
|
|
" store=retriever_store, documents=documents, k=2\n",
|
|
")\n",
|
|
"\n",
|
|
"KVDocumentRetriever.set_up_store()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"KVDocumentRetriever.invoke(\"fish\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"KVDocumentRetriever.invoke(\"mammals\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## API reference\n",
|
|
"\n",
|
|
"For full details on the `BigtableByteStore` class, see the source code on [GitHub](https://github.com/googleapis/langchain-google-bigtable-python/blob/main/src/langchain_google_bigtable/key_value_store.py)."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|