diff --git a/docs/docs/integrations/providers/databricks.md b/docs/docs/integrations/providers/databricks.md index 3ac64692cbf..96911ab02f5 100644 --- a/docs/docs/integrations/providers/databricks.md +++ b/docs/docs/integrations/providers/databricks.md @@ -79,3 +79,9 @@ Databricks as an LLM provider The notebook [Wrap Databricks endpoints as LLMs](/docs/integrations/llms/databricks#wrapping-a-serving-endpoint-custom-model) demonstrates how to serve a custom model that has been registered by MLflow as a Databricks endpoint. It supports two types of endpoints: the serving endpoint, which is recommended for both production and development, and the cluster driver proxy app, which is recommended for interactive development. + + +Databricks Vector Search +------------------------ + +Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors. See the notebook [Databricks Vector Search](/docs/integrations/vectorstores/databricks_vector_search) for instructions to use it with LangChain. diff --git a/docs/docs/integrations/vectorstores/databricks_vector_search.ipynb b/docs/docs/integrations/vectorstores/databricks_vector_search.ipynb new file mode 100644 index 00000000000..52a53c758a0 --- /dev/null +++ b/docs/docs/integrations/vectorstores/databricks_vector_search.ipynb @@ -0,0 +1,385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "5a8c5767-adfe-4b9d-a665-a898756d7a6c", + "showTitle": false, + "title": "" + } + }, + "source": [ + "# Databricks Vector Search\n", + "\n", + "Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n", + "\n", + "This notebook shows how to use LangChain with Databricks Vector Search." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "746cfacd-fb30-48fd-96a5-bbcc0d15ae49", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Install `databricks-vectorsearch` and related Python packages used in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "9258a3e7-e050-4390-9d3f-9adff1460dab", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "!pip install langchain-core databricks-vectorsearch openai tiktoken" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "f4f09d6d-002d-4cb0-a664-0a83bd2a13da", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Use `OpenAIEmbeddings` for the embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "f11b902d-a772-45e0-bbd9-526218b717cc", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "59b568f3-8db2-427e-9a4a-1df6fa7a1739", + "showTitle": false, + "title": "" + } + }, + "source": [ + "Split documents and get embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "b28e1c7b-eae4-4be8-abbd-8433c7557dc2", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "from langchain.document_loaders import TextLoader\n", + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "\n", + "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "embeddings = OpenAIEmbeddings()\n", + "emb_dim = len(embeddings.embed_query(\"hello\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "e8fcdda1-208a-45c9-816e-ff0d2c8f59d6", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Setup Databricks Vector Search client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "9b87fff1-99e5-4d9f-aba3-d21a7ccc498e", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "from databricks.vector_search.client import VectorSearchClient\n", + "\n", + "vsc = VectorSearchClient()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a Vector Search Endpoint\n", + "This endpoint is used to create and access vector search indexes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "81090f87-3efd-4c1e-9f58-8d6adba7553d", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Create Direct Vector Access Index\n", + "Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "9389ec6b-5885-411f-a26e-1a4b03651f5c", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n", + "index_name = \"ml.llm.demo_index\"\n", + "\n", + "index = vsc.create_direct_access_index(\n", + " endpoint_name=vector_search_endpoint_name,\n", + " index_name=index_name,\n", + " primary_key=\"id\",\n", + " embedding_dimension=emb_dim,\n", + " embedding_vector_column=\"text_vector\",\n", + " schema={\n", + " \"id\": \"string\",\n", + " \"text\": \"string\",\n", + " \"text_vector\": \"array\",\n", + " \"source\": \"string\",\n", + " },\n", + ")\n", + "\n", + "index.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "047a14c9-2f06-4f74-883d-815b2c69786c", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import DatabricksVectorSearch\n", + "\n", + "dvs = DatabricksVectorSearch(\n", + " index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "951bd581-2ced-497f-9c70-4fda902fd3a1", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Add docs to the index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "1e85f235-901f-4cf5-845f-5dbf4ce42078", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "dvs.add_documents(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "8bea6f0a-b305-455a-acba-99cc8c9350b5", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Similarity search" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "25c5a044-a61a-4929-9e65-a0f0462925df", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "dvs.similarity_search(query)\n", + "print(docs[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "46e3f41b-dac2-4bed-91cb-a3914c25d275", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Work with Delta Sync Index\n", + "\n", + "You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "0c1f448e-77ca-41ce-887c-15948e866a0e", + "showTitle": false, + "title": "" + } + }, + "outputs": [], + "source": [ + "dvs_delta_sync = DatabricksVectorSearch(\"catalog_name.schema_name.delta_sync_index\")\n", + "dvs_delta_sync.similarity_search(query)" + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 2 + }, + "notebookName": "databricks_vector_search", + "widgets": {} + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/libs/community/langchain_community/vectorstores/databricks_vector_search.py b/libs/community/langchain_community/vectorstores/databricks_vector_search.py index 5f5882ae100..8505ea9cbc0 100644 --- a/libs/community/langchain_community/vectorstores/databricks_vector_search.py +++ b/libs/community/langchain_community/vectorstores/databricks_vector_search.py @@ -96,7 +96,7 @@ class DatabricksVectorSearch(VectorStore): ) For more information on Databricks Vector Search, see `Databricks Vector Search - documentation `. + documentation: https://docs.databricks.com/en/generative-ai/vector-search.html. """