diff --git a/docs/extras/integrations/vectorstores/vespa.ipynb b/docs/extras/integrations/vectorstores/vespa.ipynb new file mode 100644 index 00000000000..c7500944093 --- /dev/null +++ b/docs/extras/integrations/vectorstores/vespa.ipynb @@ -0,0 +1,883 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ce0f17b9", + "metadata": {}, + "source": [ + "# Vespa\n", + "\n", + ">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n", + "\n", + "This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n", + "\n", + "In order to create the vector store, we use\n", + "[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n", + "connection a `Vespa` service." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e6a11ab-38bd-4920-ba11-60cb2f075754", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#!pip install pyvespa" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Using the `pyvespa` package, you can either connect to a\n", + "[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n", + "or a local\n", + "[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n", + "Here, we will create a new Vespa application and deploy that using Docker.\n", + "\n", + "#### Creating a Vespa application\n", + "\n", + "First, we need to create an application package:" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import ApplicationPackage, Field, RankProfile\n", + "\n", + "app_package = ApplicationPackage(name=\"testapp\")\n", + "app_package.schema.add_fields(\n", + " Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n", + " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", + " indexing=[\"attribute\", \"summary\"],\n", + " attribute=[f\"distance-metric: angular\"]),\n", + ")\n", + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"default\",\n", + " first_phase=\"closeness(field, embedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This sets up a Vespa application with a schema for each document that contains\n", + "two fields: `text` for holding the document text and `embedding` for holding\n", + "the embedding vector. The `text` field is set up to use a BM25 index for\n", + "efficient text retrieval, and we'll see how to use this and hybrid search a\n", + "bit later.\n", + "\n", + "The `embedding` field is set up with a vector of length 384 to hold the\n", + "embedding representation of the text. See\n", + "[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n", + "for more on tensors in Vespa.\n", + "\n", + "Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n", + "instruct Vespa how to order documents. Here we set this up with a\n", + "[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n", + "\n", + "Now we can deploy this application locally:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c10dd962", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from vespa.deployment import VespaDocker\n", + "\n", + "vespa_docker = VespaDocker()\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)" + ] + }, + { + "cell_type": "markdown", + "id": "3df4ce53", + "metadata": {}, + "source": [ + "This deploys and creates a connection to a `Vespa` service. In case you\n", + "already have a Vespa application running, for instance in the cloud,\n", + "please refer to the PyVespa application for how to connect.\n", + "\n", + "#### Creating a Vespa vector store\n", + "\n", + "Now, let's load some documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from langchain.document_loaders import TextLoader\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "\n", + "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n", + "\n", + "embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Here, we also set up local sentence embedder to transform the text to embedding\n", + "vectors. One could also use OpenAI embeddings, but the vector length needs to\n", + "be updated to `1536` to reflect the larger size of that embedding.\n", + "\n", + "To feed these to Vespa, we need to configure how the vector store should map to\n", + "fields in the Vespa application. Then we create the vector store directly from\n", + "this set of documents:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "vespa_config = dict(\n", + " page_content_field=\"text\",\n", + " embedding_field=\"embedding\",\n", + " input_field=\"query_embedding\"\n", + ")\n", + "\n", + "from langchain.vectorstores import VespaStore\n", + "\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This creates a Vespa vector store and feeds that set of documents to Vespa.\n", + "The vector store takes care of calling the embedding function for each document\n", + "and inserts them into the database.\n", + "\n", + "We can now query the vector store:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ccca1f4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query)\n", + "\n", + "print(results[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "id": "1e7e34e1", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "This will use the embedding function given above to create a representation\n", + "for the query and use that to search Vespa. Note that this will use the\n", + "`default` ranking function, which we set up in the application package\n", + "above. You can use the `ranking` argument to `similarity_search` to\n", + "specify which ranking function to use.\n", + "\n", + "Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n", + "for more information.\n", + "\n", + "This covers the basic usage of the Vespa store in LangChain.\n", + "Now you can return the results and continue using these in LangChain.\n", + "\n", + "#### Updating documents\n", + "\n", + "An alternative to calling `from_documents`, you can create the vector\n", + "store directly and call `add_texts` from that. This can also be used to update\n", + "documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query)\n", + "result = results[0]\n", + "\n", + "result.page_content = \"UPDATED: \" + result.page_content\n", + "db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n", + "\n", + "results = db.similarity_search(query)\n", + "print(results[0].page_content)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "However, the `pyvespa` library contains methods to manipulate\n", + "content on Vespa which you can use directly.\n", + "\n", + "#### Deleting documents\n", + "\n", + "You can delete documents using the `delete` function:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "result = db.similarity_search(query)\n", + "# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", + "\n", + "db.delete([\"32\"])\n", + "result = db.similarity_search(query)\n", + "# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Again, the `pyvespa` connection contains methods to delete documents as well.\n", + "\n", + "### Returning with scores\n", + "\n", + "The `similarity_search` method only returns the documents in order of\n", + "relevancy. To retrieve the actual scores:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "results = db.similarity_search_with_score(query)\n", + "result = results[0]\n", + "# result[1] ~= 0.463" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n", + "cosine distance function (as given by the argument `angular` in the\n", + "application function).\n", + "\n", + "Different embedding functions need different distance functions, and Vespa\n", + "needs to know which distance function to use when orderings documents.\n", + "Please refer to the\n", + "[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n", + "for more information.\n", + "\n", + "### As retriever\n", + "\n", + "To use this vector store as a\n", + "[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n", + "simply call the `as_retriever` function, which is a standard vector store\n", + "method:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", + "retriever = db.as_retriever()\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = retriever.get_relevant_documents(query)\n", + "\n", + "# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This allows for more general, unstructured, retrieval from the vector store.\n", + "\n", + "### Metadata\n", + "\n", + "In the example so far, we've only used the text and the embedding for that\n", + "text. Documents usually contain additional information, which in LangChain\n", + "is referred to as metadata.\n", + "\n", + "Vespa can contain many fields with different types by adding them to the application\n", + "package:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "app_package.schema.add_fields(\n", + " # ...\n", + " Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", + " Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n", + " Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", + " # ...\n", + ")\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "We can add some metadata fields in the documents:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "# Add metadata\n", + "for i, doc in enumerate(docs):\n", + " doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n", + " doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n", + " doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "And let the Vespa vector store know about these fields:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Now, when searching for these documents, these fields will be returned.\n", + "Also, these fields can be filtered on:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query, filter=\"rating > 3\")\n", + "# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n", + "# results[0].metadata[\"author\"] == \"Unknown\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### Custom query\n", + "\n", + "If the default behavior of the similarity search does not fit your\n", + "requirements, you can always provide your own query. Thus, you don't\n", + "need to provide all of the configuration to the vector store, but\n", + "rather just write this yourself.\n", + "\n", + "First, let's add a BM25 ranking function to our application:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import FieldSet\n", + "\n", + "app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n", + "app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Then, to perform a regular text search based on BM25:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where userQuery()\",\n", + " \"query\": query,\n", + " \"type\": \"weakAnd\",\n", + " \"ranking\": \"bm25\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", + "# results[0][1] ~= 14.384" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "All of the powerful search and query capabilities of Vespa can be used\n", + "by using a custom query. Please refer to the Vespa documentation on it's\n", + "[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n", + "\n", + "### Hybrid search\n", + "\n", + "Hybrid search means using both a classic term-based search such as\n", + "BM25 and a vector search and combining the results. We need to create\n", + "a new rank profile for hybrid search on Vespa:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"hybrid\",\n", + " first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Here, we score each document as a combination of it's BM25 score and its\n", + "distance score. We can query using a custom query:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "query_embedding = embedding_function.embed_query(query)\n", + "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n", + " \"query\": query,\n", + " \"type\": \"weakAnd\",\n", + " \"input.query(query_embedding)\": query_embedding,\n", + " \"ranking\": \"hybrid\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", + "# results[0][1] ~= 2.897" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### Native embedders in Vespa\n", + "\n", + "Up until this point we've used an embedding function in Python to provide\n", + "embeddings for the texts. Vespa supports embedding function natively, so\n", + "you can defer this calculation in to Vespa. One benefit is the ability to use\n", + "GPUs when embedding documents if you have a large collections.\n", + "\n", + "Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n", + "for more information.\n", + "\n", + "First, we need to modify our application package:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import Component, Parameter\n", + "\n", + "app_package.components = [\n", + " Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n", + " parameters=[\n", + " Parameter(\"transformer-model\", {\"path\": \"...\"}),\n", + " Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n", + " ]\n", + " )\n", + "]\n", + "Field(name=\"hfembedding\", type=\"tensor(x[384])\",\n", + " is_document_field=False,\n", + " indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n", + " attribute=[f\"distance-metric: angular\"],\n", + " )\n", + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"hf_similarity\",\n", + " first_phase=\"closeness(field, hfembedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Please refer to the embeddings documentation on adding embedder models\n", + "and tokenizers to the application. Note that the `hfembedding` field\n", + "includes instructions for embedding using the `hf-embedder`.\n", + "\n", + "Now we can query with a custom query:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n", + " \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n", + " \"ranking\": \"internal_similarity\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", + "# results[0][1] ~= 0.630" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Note that the query here includes an `embed` instruction to embed the query\n", + "using the same model as for the documents.\n", + "\n", + "### Approximate nearest neighbor\n", + "\n", + "In all of the above examples, we've used exact nearest neighbor to\n", + "find results. However, for large collections of documents this is\n", + "not feasible as one has to scan through all documents to find the\n", + "best matches. To avoid this, we can use\n", + "[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n", + "\n", + "First, we can change the embedding field to create a HNSW index:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import HNSW\n", + "\n", + "app_package.schema.add_fields(\n", + " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", + " indexing=[\"attribute\", \"summary\", \"index\"],\n", + " ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n", + " )\n", + ")\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This creates a HNSW index on the embedding data which allows for efficient\n", + "searching. With this set, we can easily search using ANN by setting\n", + "the `approximate` argument to `True`:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query, approximate=True)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "This covers most of the functionality in the Vespa vector store in LangChain.\n", + "\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/libs/langchain/langchain/retrievers/vespa_retriever.py b/libs/langchain/langchain/retrievers/vespa_retriever.py index 97c71d998bd..54f64628a50 100644 --- a/libs/langchain/langchain/retrievers/vespa_retriever.py +++ b/libs/langchain/langchain/retrievers/vespa_retriever.py @@ -1,19 +1,16 @@ from __future__ import annotations import json -from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union +from typing import Any, Dict, List, Literal, Optional, Sequence, Union from langchain.callbacks.manager import CallbackManagerForRetrieverRun from langchain.schema import BaseRetriever, Document -if TYPE_CHECKING: - from vespa.application import Vespa - class VespaRetriever(BaseRetriever): """`Vespa` retriever.""" - app: Vespa + app: Any """Vespa application to query.""" body: Dict """Body of the query.""" diff --git a/libs/langchain/langchain/vectorstores/__init__.py b/libs/langchain/langchain/vectorstores/__init__.py index 0b113e04e8c..f066c4eb3cf 100644 --- a/libs/langchain/langchain/vectorstores/__init__.py +++ b/libs/langchain/langchain/vectorstores/__init__.py @@ -76,6 +76,7 @@ from langchain.vectorstores.usearch import USearch from langchain.vectorstores.vald import Vald from langchain.vectorstores.vearch import Vearch from langchain.vectorstores.vectara import Vectara +from langchain.vectorstores.vespa import VespaStore from langchain.vectorstores.weaviate import Weaviate from langchain.vectorstores.zep import ZepVectorStore from langchain.vectorstores.zilliz import Zilliz @@ -143,6 +144,7 @@ __all__ = [ "Vearch", "Vectara", "VectorStore", + "VespaStore", "Weaviate", "ZepVectorStore", "Zilliz", diff --git a/libs/langchain/langchain/vectorstores/vespa.py b/libs/langchain/langchain/vectorstores/vespa.py new file mode 100644 index 00000000000..ae0a7ce7647 --- /dev/null +++ b/libs/langchain/langchain/vectorstores/vespa.py @@ -0,0 +1,267 @@ +from __future__ import annotations + +from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union + +from langchain.docstore.document import Document +from langchain.schema.embeddings import Embeddings +from langchain.vectorstores.base import VectorStore, VectorStoreRetriever + + +class VespaStore(VectorStore): + """ + `Vespa` vector store. + + To use, you should have the python client library ``pyvespa`` installed. + + Example: + .. code-block:: python + + from langchain.vectorstores import VespaStore + from langchain.embeddings.openai import OpenAIEmbeddings + from vespa.application import Vespa + + # Create a vespa client dependent upon your application, + # e.g. either connecting to Vespa Cloud or a local deployment + # such as Docker. Please refer to the PyVespa documentation on + # how to initialize the client. + + vespa_app = Vespa(url="...", port=..., application_package=...) + + # You need to instruct LangChain on which fields to use for embeddings + vespa_config = dict( + page_content_field="text", + embedding_field="embedding", + input_field="query_embedding", + metadata_fields=["date", "rating", "author"] + ) + + embedding_function = OpenAIEmbeddings() + vectorstore = VespaStore(vespa_app, embedding_function, **vespa_config) + + """ + + def __init__( + self, + app: Any, + embedding_function: Optional[Embeddings] = None, + page_content_field: Optional[str] = None, + embedding_field: Optional[str] = None, + input_field: Optional[str] = None, + metadata_fields: Optional[List[str]] = None, + ) -> None: + """ + Initialize with a PyVespa client. + """ + try: + from vespa.application import Vespa + except ImportError: + raise ImportError( + "Could not import Vespa python package. " + "Please install it with `pip install pyvespa`." + ) + if not isinstance(app, Vespa): + raise ValueError( + f"app should be an instance of vespa.application.Vespa, got {type(app)}" + ) + + self._vespa_app = app + self._embedding_function = embedding_function + self._page_content_field = page_content_field + self._embedding_field = embedding_field + self._input_field = input_field + self._metadata_fields = metadata_fields + + def add_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + **kwargs: Any, + ) -> List[str]: + """ + Add texts to the vectorstore. + + Args: + texts: Iterable of strings to add to the vectorstore. + metadatas: Optional list of metadatas associated with the texts. + ids: Optional list of ids associated with the texts. + kwargs: vectorstore specific parameters + + Returns: + List of ids from adding the texts into the vectorstore. + """ + + embeddings = None + if self._embedding_function is not None: + embeddings = self._embedding_function.embed_documents(list(texts)) + + if ids is None: + ids = [str(f"{i+1}") for i, _ in enumerate(texts)] + + batch = [] + for i, text in enumerate(texts): + fields: Dict[str, Union[str, List[float]]] = {} + if self._page_content_field is not None: + fields[self._page_content_field] = text + if self._embedding_field is not None and embeddings is not None: + fields[self._embedding_field] = embeddings[i] + if metadatas is not None and self._metadata_fields is not None: + for metadata_field in self._metadata_fields: + if metadata_field in metadatas[i]: + fields[metadata_field] = metadatas[i][metadata_field] + batch.append({"id": ids[i], "fields": fields}) + + results = self._vespa_app.feed_batch(batch) + for result in results: + if not (str(result.status_code).startswith("2")): + raise RuntimeError( + f"Could not add document to Vespa. " + f"Error code: {result.status_code}. " + f"Message: {result.json['message']}" + ) + return ids + + def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]: + if ids is None: + return False + batch = [{"id": id} for id in ids] + result = self._vespa_app.delete_batch(batch) + return sum([0 if r.status_code == 200 else 1 for r in result]) == 0 + + def _create_query( + self, query_embedding: List[float], k: int = 4, **kwargs: Any + ) -> Dict: + hits = k + doc_embedding_field = self._embedding_field + input_embedding_field = self._input_field + ranking_function = kwargs["ranking"] if "ranking" in kwargs else "default" + filter = kwargs["filter"] if "filter" in kwargs else None + + approximate = kwargs["approximate"] if "approximate" in kwargs else False + approximate = "true" if approximate else "false" + + yql = "select * from sources * where " + yql += f"{{targetHits: {hits}, approximate: {approximate}}}" + yql += f"nearestNeighbor({doc_embedding_field}, {input_embedding_field})" + if filter is not None: + yql += f" and {filter}" + + query = { + "yql": yql, + f"input.query({input_embedding_field})": query_embedding, + "ranking": ranking_function, + "hits": hits, + } + return query + + def similarity_search_by_vector_with_score( + self, query_embedding: List[float], k: int = 4, **kwargs: Any + ) -> List[Tuple[Document, float]]: + """ + Performs similarity search from a embeddings vector. + + Args: + query_embedding: Embeddings vector to search for. + k: Number of results to return. + custom_query: Use this custom query instead default query (kwargs) + kwargs: other vector store specific parameters + + Returns: + List of ids from adding the texts into the vectorstore. + """ + if "custom_query" in kwargs: + query = kwargs["custom_query"] + else: + query = self._create_query(query_embedding, k, **kwargs) + + try: + response = self._vespa_app.query(body=query) + except Exception as e: + raise RuntimeError( + f"Could not retrieve data from Vespa: " + f"{e.args[0][0]['summary']}. " + f"Error: {e.args[0][0]['message']}" + ) + if not str(response.status_code).startswith("2"): + raise RuntimeError( + f"Could not retrieve data from Vespa. " + f"Error code: {response.status_code}. " + f"Message: {response.json['message']}" + ) + + root = response.json["root"] + if "errors" in root: + import json + + raise RuntimeError(json.dumps(root["errors"])) + + if response is None or response.hits is None: + return [] + + docs = [] + for child in response.hits: + page_content = child["fields"][self._page_content_field] + score = child["relevance"] + metadata = {"id": child["id"]} + if self._metadata_fields is not None: + for field in self._metadata_fields: + metadata[field] = child["fields"].get(field) + doc = Document(page_content=page_content, metadata=metadata) + docs.append((doc, score)) + return docs + + def similarity_search_by_vector( + self, embedding: List[float], k: int = 4, **kwargs: Any + ) -> List[Document]: + results = self.similarity_search_by_vector_with_score(embedding, k, **kwargs) + return [r[0] for r in results] + + def similarity_search_with_score( + self, query: str, k: int = 4, **kwargs: Any + ) -> List[Tuple[Document, float]]: + query_emb = [] + if self._embedding_function is not None: + query_emb = self._embedding_function.embed_query(query) + return self.similarity_search_by_vector_with_score(query_emb, k, **kwargs) + + def similarity_search( + self, query: str, k: int = 4, **kwargs: Any + ) -> List[Document]: + results = self.similarity_search_with_score(query, k, **kwargs) + return [r[0] for r in results] + + def max_marginal_relevance_search( + self, + query: str, + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + **kwargs: Any, + ) -> List[Document]: + raise NotImplementedError("MMR search not implemented") + + def max_marginal_relevance_search_by_vector( + self, + embedding: List[float], + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + **kwargs: Any, + ) -> List[Document]: + raise NotImplementedError("MMR search by vector not implemented") + + @classmethod + def from_texts( + cls: Type[VespaStore], + texts: List[str], + embedding: Embeddings, + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + **kwargs: Any, + ) -> VespaStore: + vespa = cls(embedding_function=embedding, **kwargs) + vespa.add_texts(texts=texts, metadatas=metadatas, ids=ids) + return vespa + + def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever: + return super().as_retriever(**kwargs)