docs: Add Databricks Vector Search example notebook (#14158)

This PR adds an example notebook for the Databricks Vector Search vector
store. It also adds an introduction to the Databricks Vector Search
product on the Databricks's provider page.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Chengzu Ou 2023-12-12 17:40:29 -08:00 committed by GitHub
parent 414bddd5f0
commit df95abb7e7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 392 additions and 1 deletions

View File

@ -79,3 +79,9 @@ Databricks as an LLM provider
The notebook [Wrap Databricks endpoints as LLMs](/docs/integrations/llms/databricks#wrapping-a-serving-endpoint-custom-model) demonstrates how to serve a custom model that has been registered by MLflow as a Databricks endpoint.
It supports two types of endpoints: the serving endpoint, which is recommended for both production and development, and the cluster driver proxy app, which is recommended for interactive development.
Databricks Vector Search
------------------------
Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors. See the notebook [Databricks Vector Search](/docs/integrations/vectorstores/databricks_vector_search) for instructions to use it with LangChain.

View File

@ -0,0 +1,385 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "5a8c5767-adfe-4b9d-a665-a898756d7a6c",
"showTitle": false,
"title": ""
}
},
"source": [
"# Databricks Vector Search\n",
"\n",
"Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n",
"\n",
"This notebook shows how to use LangChain with Databricks Vector Search."
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "746cfacd-fb30-48fd-96a5-bbcc0d15ae49",
"showTitle": false,
"title": ""
}
},
"source": [
"Install `databricks-vectorsearch` and related Python packages used in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9258a3e7-e050-4390-9d3f-9adff1460dab",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"!pip install langchain-core databricks-vectorsearch openai tiktoken"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "f4f09d6d-002d-4cb0-a664-0a83bd2a13da",
"showTitle": false,
"title": ""
}
},
"source": [
"Use `OpenAIEmbeddings` for the embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "f11b902d-a772-45e0-bbd9-526218b717cc",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "59b568f3-8db2-427e-9a4a-1df6fa7a1739",
"showTitle": false,
"title": ""
}
},
"source": [
"Split documents and get embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "b28e1c7b-eae4-4be8-abbd-8433c7557dc2",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"emb_dim = len(embeddings.embed_query(\"hello\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "e8fcdda1-208a-45c9-816e-ff0d2c8f59d6",
"showTitle": false,
"title": ""
}
},
"source": [
"## Setup Databricks Vector Search client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9b87fff1-99e5-4d9f-aba3-d21a7ccc498e",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from databricks.vector_search.client import VectorSearchClient\n",
"\n",
"vsc = VectorSearchClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Vector Search Endpoint\n",
"This endpoint is used to create and access vector search indexes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "81090f87-3efd-4c1e-9f58-8d6adba7553d",
"showTitle": false,
"title": ""
}
},
"source": [
"## Create Direct Vector Access Index\n",
"Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9389ec6b-5885-411f-a26e-1a4b03651f5c",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n",
"index_name = \"ml.llm.demo_index\"\n",
"\n",
"index = vsc.create_direct_access_index(\n",
" endpoint_name=vector_search_endpoint_name,\n",
" index_name=index_name,\n",
" primary_key=\"id\",\n",
" embedding_dimension=emb_dim,\n",
" embedding_vector_column=\"text_vector\",\n",
" schema={\n",
" \"id\": \"string\",\n",
" \"text\": \"string\",\n",
" \"text_vector\": \"array<float>\",\n",
" \"source\": \"string\",\n",
" },\n",
")\n",
"\n",
"index.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "047a14c9-2f06-4f74-883d-815b2c69786c",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from langchain.vectorstores import DatabricksVectorSearch\n",
"\n",
"dvs = DatabricksVectorSearch(\n",
" index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "951bd581-2ced-497f-9c70-4fda902fd3a1",
"showTitle": false,
"title": ""
}
},
"source": [
"## Add docs to the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "1e85f235-901f-4cf5-845f-5dbf4ce42078",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"dvs.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "8bea6f0a-b305-455a-acba-99cc8c9350b5",
"showTitle": false,
"title": ""
}
},
"source": [
"## Similarity search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "25c5a044-a61a-4929-9e65-a0f0462925df",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"dvs.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "46e3f41b-dac2-4bed-91cb-a3914c25d275",
"showTitle": false,
"title": ""
}
},
"source": [
"## Work with Delta Sync Index\n",
"\n",
"You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "0c1f448e-77ca-41ce-887c-15948e866a0e",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"dvs_delta_sync = DatabricksVectorSearch(\"catalog_name.schema_name.delta_sync_index\")\n",
"dvs_delta_sync.similarity_search(query)"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "databricks_vector_search",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -96,7 +96,7 @@ class DatabricksVectorSearch(VectorStore):
)
For more information on Databricks Vector Search, see `Databricks Vector Search
documentation <TODO: pending-link-to-documentation-page>`.
documentation: https://docs.databricks.com/en/generative-ai/vector-search.html.
"""