mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-16 12:32:06 +00:00
Add Vespa vector store (#11329)
Addition of Vespa vector store integration including notebook showing its use. Maintainer: @lesters Twitter handle: LesterSolbakken
This commit is contained in:
parent
58a88f3911
commit
a30f98f534
883
docs/extras/integrations/vectorstores/vespa.ipynb
Normal file
883
docs/extras/integrations/vectorstores/vespa.ipynb
Normal file
@ -0,0 +1,883 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ce0f17b9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Vespa\n",
|
||||||
|
"\n",
|
||||||
|
">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n",
|
||||||
|
"\n",
|
||||||
|
"In order to create the vector store, we use\n",
|
||||||
|
"[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n",
|
||||||
|
"connection a `Vespa` service."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7e6a11ab-38bd-4920-ba11-60cb2f075754",
|
||||||
|
"metadata": {
|
||||||
|
"tags": []
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"#!pip install pyvespa"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Using the `pyvespa` package, you can either connect to a\n",
|
||||||
|
"[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
|
||||||
|
"or a local\n",
|
||||||
|
"[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
|
||||||
|
"Here, we will create a new Vespa application and deploy that using Docker.\n",
|
||||||
|
"\n",
|
||||||
|
"#### Creating a Vespa application\n",
|
||||||
|
"\n",
|
||||||
|
"First, we need to create an application package:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from vespa.package import ApplicationPackage, Field, RankProfile\n",
|
||||||
|
"\n",
|
||||||
|
"app_package = ApplicationPackage(name=\"testapp\")\n",
|
||||||
|
"app_package.schema.add_fields(\n",
|
||||||
|
" Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n",
|
||||||
|
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
||||||
|
" indexing=[\"attribute\", \"summary\"],\n",
|
||||||
|
" attribute=[f\"distance-metric: angular\"]),\n",
|
||||||
|
")\n",
|
||||||
|
"app_package.schema.add_rank_profile(\n",
|
||||||
|
" RankProfile(name=\"default\",\n",
|
||||||
|
" first_phase=\"closeness(field, embedding)\",\n",
|
||||||
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||||
|
" )\n",
|
||||||
|
")"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This sets up a Vespa application with a schema for each document that contains\n",
|
||||||
|
"two fields: `text` for holding the document text and `embedding` for holding\n",
|
||||||
|
"the embedding vector. The `text` field is set up to use a BM25 index for\n",
|
||||||
|
"efficient text retrieval, and we'll see how to use this and hybrid search a\n",
|
||||||
|
"bit later.\n",
|
||||||
|
"\n",
|
||||||
|
"The `embedding` field is set up with a vector of length 384 to hold the\n",
|
||||||
|
"embedding representation of the text. See\n",
|
||||||
|
"[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n",
|
||||||
|
"for more on tensors in Vespa.\n",
|
||||||
|
"\n",
|
||||||
|
"Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n",
|
||||||
|
"instruct Vespa how to order documents. Here we set this up with a\n",
|
||||||
|
"[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n",
|
||||||
|
"\n",
|
||||||
|
"Now we can deploy this application locally:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "c10dd962",
|
||||||
|
"metadata": {
|
||||||
|
"tags": []
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from vespa.deployment import VespaDocker\n",
|
||||||
|
"\n",
|
||||||
|
"vespa_docker = VespaDocker()\n",
|
||||||
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "3df4ce53",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"This deploys and creates a connection to a `Vespa` service. In case you\n",
|
||||||
|
"already have a Vespa application running, for instance in the cloud,\n",
|
||||||
|
"please refer to the PyVespa application for how to connect.\n",
|
||||||
|
"\n",
|
||||||
|
"#### Creating a Vespa vector store\n",
|
||||||
|
"\n",
|
||||||
|
"Now, let's load some documents:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from langchain.document_loaders import TextLoader\n",
|
||||||
|
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||||
|
"\n",
|
||||||
|
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
||||||
|
"documents = loader.load()\n",
|
||||||
|
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||||
|
"docs = text_splitter.split_documents(documents)\n",
|
||||||
|
"\n",
|
||||||
|
"from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
|
||||||
|
"\n",
|
||||||
|
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Here, we also set up local sentence embedder to transform the text to embedding\n",
|
||||||
|
"vectors. One could also use OpenAI embeddings, but the vector length needs to\n",
|
||||||
|
"be updated to `1536` to reflect the larger size of that embedding.\n",
|
||||||
|
"\n",
|
||||||
|
"To feed these to Vespa, we need to configure how the vector store should map to\n",
|
||||||
|
"fields in the Vespa application. Then we create the vector store directly from\n",
|
||||||
|
"this set of documents:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"vespa_config = dict(\n",
|
||||||
|
" page_content_field=\"text\",\n",
|
||||||
|
" embedding_field=\"embedding\",\n",
|
||||||
|
" input_field=\"query_embedding\"\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"from langchain.vectorstores import VespaStore\n",
|
||||||
|
"\n",
|
||||||
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This creates a Vespa vector store and feeds that set of documents to Vespa.\n",
|
||||||
|
"The vector store takes care of calling the embedding function for each document\n",
|
||||||
|
"and inserts them into the database.\n",
|
||||||
|
"\n",
|
||||||
|
"We can now query the vector store:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7ccca1f4",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"results = db.similarity_search(query)\n",
|
||||||
|
"\n",
|
||||||
|
"print(results[0].page_content)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "1e7e34e1",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"source": [
|
||||||
|
"This will use the embedding function given above to create a representation\n",
|
||||||
|
"for the query and use that to search Vespa. Note that this will use the\n",
|
||||||
|
"`default` ranking function, which we set up in the application package\n",
|
||||||
|
"above. You can use the `ranking` argument to `similarity_search` to\n",
|
||||||
|
"specify which ranking function to use.\n",
|
||||||
|
"\n",
|
||||||
|
"Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
|
||||||
|
"for more information.\n",
|
||||||
|
"\n",
|
||||||
|
"This covers the basic usage of the Vespa store in LangChain.\n",
|
||||||
|
"Now you can return the results and continue using these in LangChain.\n",
|
||||||
|
"\n",
|
||||||
|
"#### Updating documents\n",
|
||||||
|
"\n",
|
||||||
|
"An alternative to calling `from_documents`, you can create the vector\n",
|
||||||
|
"store directly and call `add_texts` from that. This can also be used to update\n",
|
||||||
|
"documents:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"results = db.similarity_search(query)\n",
|
||||||
|
"result = results[0]\n",
|
||||||
|
"\n",
|
||||||
|
"result.page_content = \"UPDATED: \" + result.page_content\n",
|
||||||
|
"db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n",
|
||||||
|
"\n",
|
||||||
|
"results = db.similarity_search(query)\n",
|
||||||
|
"print(results[0].page_content)"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"However, the `pyvespa` library contains methods to manipulate\n",
|
||||||
|
"content on Vespa which you can use directly.\n",
|
||||||
|
"\n",
|
||||||
|
"#### Deleting documents\n",
|
||||||
|
"\n",
|
||||||
|
"You can delete documents using the `delete` function:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"result = db.similarity_search(query)\n",
|
||||||
|
"# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
||||||
|
"\n",
|
||||||
|
"db.delete([\"32\"])\n",
|
||||||
|
"result = db.similarity_search(query)\n",
|
||||||
|
"# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\""
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Again, the `pyvespa` connection contains methods to delete documents as well.\n",
|
||||||
|
"\n",
|
||||||
|
"### Returning with scores\n",
|
||||||
|
"\n",
|
||||||
|
"The `similarity_search` method only returns the documents in order of\n",
|
||||||
|
"relevancy. To retrieve the actual scores:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"results = db.similarity_search_with_score(query)\n",
|
||||||
|
"result = results[0]\n",
|
||||||
|
"# result[1] ~= 0.463"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n",
|
||||||
|
"cosine distance function (as given by the argument `angular` in the\n",
|
||||||
|
"application function).\n",
|
||||||
|
"\n",
|
||||||
|
"Different embedding functions need different distance functions, and Vespa\n",
|
||||||
|
"needs to know which distance function to use when orderings documents.\n",
|
||||||
|
"Please refer to the\n",
|
||||||
|
"[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n",
|
||||||
|
"for more information.\n",
|
||||||
|
"\n",
|
||||||
|
"### As retriever\n",
|
||||||
|
"\n",
|
||||||
|
"To use this vector store as a\n",
|
||||||
|
"[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n",
|
||||||
|
"simply call the `as_retriever` function, which is a standard vector store\n",
|
||||||
|
"method:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
||||||
|
"retriever = db.as_retriever()\n",
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"results = retriever.get_relevant_documents(query)\n",
|
||||||
|
"\n",
|
||||||
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\""
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This allows for more general, unstructured, retrieval from the vector store.\n",
|
||||||
|
"\n",
|
||||||
|
"### Metadata\n",
|
||||||
|
"\n",
|
||||||
|
"In the example so far, we've only used the text and the embedding for that\n",
|
||||||
|
"text. Documents usually contain additional information, which in LangChain\n",
|
||||||
|
"is referred to as metadata.\n",
|
||||||
|
"\n",
|
||||||
|
"Vespa can contain many fields with different types by adding them to the application\n",
|
||||||
|
"package:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"app_package.schema.add_fields(\n",
|
||||||
|
" # ...\n",
|
||||||
|
" Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||||
|
" Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||||
|
" Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||||
|
" # ...\n",
|
||||||
|
")\n",
|
||||||
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"We can add some metadata fields in the documents:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Add metadata\n",
|
||||||
|
"for i, doc in enumerate(docs):\n",
|
||||||
|
" doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n",
|
||||||
|
" doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n",
|
||||||
|
" doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"And let the Vespa vector store know about these fields:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Now, when searching for these documents, these fields will be returned.\n",
|
||||||
|
"Also, these fields can be filtered on:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"results = db.similarity_search(query, filter=\"rating > 3\")\n",
|
||||||
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n",
|
||||||
|
"# results[0].metadata[\"author\"] == \"Unknown\""
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"### Custom query\n",
|
||||||
|
"\n",
|
||||||
|
"If the default behavior of the similarity search does not fit your\n",
|
||||||
|
"requirements, you can always provide your own query. Thus, you don't\n",
|
||||||
|
"need to provide all of the configuration to the vector store, but\n",
|
||||||
|
"rather just write this yourself.\n",
|
||||||
|
"\n",
|
||||||
|
"First, let's add a BM25 ranking function to our application:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from vespa.package import FieldSet\n",
|
||||||
|
"\n",
|
||||||
|
"app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n",
|
||||||
|
"app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n",
|
||||||
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
||||||
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Then, to perform a regular text search based on BM25:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"custom_query = {\n",
|
||||||
|
" \"yql\": f\"select * from sources * where userQuery()\",\n",
|
||||||
|
" \"query\": query,\n",
|
||||||
|
" \"type\": \"weakAnd\",\n",
|
||||||
|
" \"ranking\": \"bm25\",\n",
|
||||||
|
" \"hits\": 4\n",
|
||||||
|
"}\n",
|
||||||
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||||
|
"# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
||||||
|
"# results[0][1] ~= 14.384"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"All of the powerful search and query capabilities of Vespa can be used\n",
|
||||||
|
"by using a custom query. Please refer to the Vespa documentation on it's\n",
|
||||||
|
"[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n",
|
||||||
|
"\n",
|
||||||
|
"### Hybrid search\n",
|
||||||
|
"\n",
|
||||||
|
"Hybrid search means using both a classic term-based search such as\n",
|
||||||
|
"BM25 and a vector search and combining the results. We need to create\n",
|
||||||
|
"a new rank profile for hybrid search on Vespa:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"app_package.schema.add_rank_profile(\n",
|
||||||
|
" RankProfile(name=\"hybrid\",\n",
|
||||||
|
" first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n",
|
||||||
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
||||||
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Here, we score each document as a combination of it's BM25 score and its\n",
|
||||||
|
"distance score. We can query using a custom query:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"query_embedding = embedding_function.embed_query(query)\n",
|
||||||
|
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n",
|
||||||
|
"custom_query = {\n",
|
||||||
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n",
|
||||||
|
" \"query\": query,\n",
|
||||||
|
" \"type\": \"weakAnd\",\n",
|
||||||
|
" \"input.query(query_embedding)\": query_embedding,\n",
|
||||||
|
" \"ranking\": \"hybrid\",\n",
|
||||||
|
" \"hits\": 4\n",
|
||||||
|
"}\n",
|
||||||
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||||
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
||||||
|
"# results[0][1] ~= 2.897"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"### Native embedders in Vespa\n",
|
||||||
|
"\n",
|
||||||
|
"Up until this point we've used an embedding function in Python to provide\n",
|
||||||
|
"embeddings for the texts. Vespa supports embedding function natively, so\n",
|
||||||
|
"you can defer this calculation in to Vespa. One benefit is the ability to use\n",
|
||||||
|
"GPUs when embedding documents if you have a large collections.\n",
|
||||||
|
"\n",
|
||||||
|
"Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n",
|
||||||
|
"for more information.\n",
|
||||||
|
"\n",
|
||||||
|
"First, we need to modify our application package:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from vespa.package import Component, Parameter\n",
|
||||||
|
"\n",
|
||||||
|
"app_package.components = [\n",
|
||||||
|
" Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n",
|
||||||
|
" parameters=[\n",
|
||||||
|
" Parameter(\"transformer-model\", {\"path\": \"...\"}),\n",
|
||||||
|
" Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n",
|
||||||
|
" ]\n",
|
||||||
|
" )\n",
|
||||||
|
"]\n",
|
||||||
|
"Field(name=\"hfembedding\", type=\"tensor<float>(x[384])\",\n",
|
||||||
|
" is_document_field=False,\n",
|
||||||
|
" indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n",
|
||||||
|
" attribute=[f\"distance-metric: angular\"],\n",
|
||||||
|
" )\n",
|
||||||
|
"app_package.schema.add_rank_profile(\n",
|
||||||
|
" RankProfile(name=\"hf_similarity\",\n",
|
||||||
|
" first_phase=\"closeness(field, hfembedding)\",\n",
|
||||||
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||||
|
" )\n",
|
||||||
|
")"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Please refer to the embeddings documentation on adding embedder models\n",
|
||||||
|
"and tokenizers to the application. Note that the `hfembedding` field\n",
|
||||||
|
"includes instructions for embedding using the `hf-embedder`.\n",
|
||||||
|
"\n",
|
||||||
|
"Now we can query with a custom query:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n",
|
||||||
|
"custom_query = {\n",
|
||||||
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n",
|
||||||
|
" \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n",
|
||||||
|
" \"ranking\": \"internal_similarity\",\n",
|
||||||
|
" \"hits\": 4\n",
|
||||||
|
"}\n",
|
||||||
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||||
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
||||||
|
"# results[0][1] ~= 0.630"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Note that the query here includes an `embed` instruction to embed the query\n",
|
||||||
|
"using the same model as for the documents.\n",
|
||||||
|
"\n",
|
||||||
|
"### Approximate nearest neighbor\n",
|
||||||
|
"\n",
|
||||||
|
"In all of the above examples, we've used exact nearest neighbor to\n",
|
||||||
|
"find results. However, for large collections of documents this is\n",
|
||||||
|
"not feasible as one has to scan through all documents to find the\n",
|
||||||
|
"best matches. To avoid this, we can use\n",
|
||||||
|
"[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n",
|
||||||
|
"\n",
|
||||||
|
"First, we can change the embedding field to create a HNSW index:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from vespa.package import HNSW\n",
|
||||||
|
"\n",
|
||||||
|
"app_package.schema.add_fields(\n",
|
||||||
|
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
||||||
|
" indexing=[\"attribute\", \"summary\", \"index\"],\n",
|
||||||
|
" ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n",
|
||||||
|
" )\n",
|
||||||
|
")\n"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This creates a HNSW index on the embedding data which allows for efficient\n",
|
||||||
|
"searching. With this set, we can easily search using ANN by setting\n",
|
||||||
|
"the `approximate` argument to `True`:"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||||
|
"results = db.similarity_search(query, approximate=True)\n",
|
||||||
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%%\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"This covers most of the functionality in the Vespa vector store in LangChain.\n",
|
||||||
|
"\n"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.10.6"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
@ -1,19 +1,16 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union
|
from typing import Any, Dict, List, Literal, Optional, Sequence, Union
|
||||||
|
|
||||||
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
||||||
from langchain.schema import BaseRetriever, Document
|
from langchain.schema import BaseRetriever, Document
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from vespa.application import Vespa
|
|
||||||
|
|
||||||
|
|
||||||
class VespaRetriever(BaseRetriever):
|
class VespaRetriever(BaseRetriever):
|
||||||
"""`Vespa` retriever."""
|
"""`Vespa` retriever."""
|
||||||
|
|
||||||
app: Vespa
|
app: Any
|
||||||
"""Vespa application to query."""
|
"""Vespa application to query."""
|
||||||
body: Dict
|
body: Dict
|
||||||
"""Body of the query."""
|
"""Body of the query."""
|
||||||
|
@ -76,6 +76,7 @@ from langchain.vectorstores.usearch import USearch
|
|||||||
from langchain.vectorstores.vald import Vald
|
from langchain.vectorstores.vald import Vald
|
||||||
from langchain.vectorstores.vearch import Vearch
|
from langchain.vectorstores.vearch import Vearch
|
||||||
from langchain.vectorstores.vectara import Vectara
|
from langchain.vectorstores.vectara import Vectara
|
||||||
|
from langchain.vectorstores.vespa import VespaStore
|
||||||
from langchain.vectorstores.weaviate import Weaviate
|
from langchain.vectorstores.weaviate import Weaviate
|
||||||
from langchain.vectorstores.zep import ZepVectorStore
|
from langchain.vectorstores.zep import ZepVectorStore
|
||||||
from langchain.vectorstores.zilliz import Zilliz
|
from langchain.vectorstores.zilliz import Zilliz
|
||||||
@ -143,6 +144,7 @@ __all__ = [
|
|||||||
"Vearch",
|
"Vearch",
|
||||||
"Vectara",
|
"Vectara",
|
||||||
"VectorStore",
|
"VectorStore",
|
||||||
|
"VespaStore",
|
||||||
"Weaviate",
|
"Weaviate",
|
||||||
"ZepVectorStore",
|
"ZepVectorStore",
|
||||||
"Zilliz",
|
"Zilliz",
|
||||||
|
267
libs/langchain/langchain/vectorstores/vespa.py
Normal file
267
libs/langchain/langchain/vectorstores/vespa.py
Normal file
@ -0,0 +1,267 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union
|
||||||
|
|
||||||
|
from langchain.docstore.document import Document
|
||||||
|
from langchain.schema.embeddings import Embeddings
|
||||||
|
from langchain.vectorstores.base import VectorStore, VectorStoreRetriever
|
||||||
|
|
||||||
|
|
||||||
|
class VespaStore(VectorStore):
|
||||||
|
"""
|
||||||
|
`Vespa` vector store.
|
||||||
|
|
||||||
|
To use, you should have the python client library ``pyvespa`` installed.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from langchain.vectorstores import VespaStore
|
||||||
|
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||||
|
from vespa.application import Vespa
|
||||||
|
|
||||||
|
# Create a vespa client dependent upon your application,
|
||||||
|
# e.g. either connecting to Vespa Cloud or a local deployment
|
||||||
|
# such as Docker. Please refer to the PyVespa documentation on
|
||||||
|
# how to initialize the client.
|
||||||
|
|
||||||
|
vespa_app = Vespa(url="...", port=..., application_package=...)
|
||||||
|
|
||||||
|
# You need to instruct LangChain on which fields to use for embeddings
|
||||||
|
vespa_config = dict(
|
||||||
|
page_content_field="text",
|
||||||
|
embedding_field="embedding",
|
||||||
|
input_field="query_embedding",
|
||||||
|
metadata_fields=["date", "rating", "author"]
|
||||||
|
)
|
||||||
|
|
||||||
|
embedding_function = OpenAIEmbeddings()
|
||||||
|
vectorstore = VespaStore(vespa_app, embedding_function, **vespa_config)
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
app: Any,
|
||||||
|
embedding_function: Optional[Embeddings] = None,
|
||||||
|
page_content_field: Optional[str] = None,
|
||||||
|
embedding_field: Optional[str] = None,
|
||||||
|
input_field: Optional[str] = None,
|
||||||
|
metadata_fields: Optional[List[str]] = None,
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Initialize with a PyVespa client.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from vespa.application import Vespa
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"Could not import Vespa python package. "
|
||||||
|
"Please install it with `pip install pyvespa`."
|
||||||
|
)
|
||||||
|
if not isinstance(app, Vespa):
|
||||||
|
raise ValueError(
|
||||||
|
f"app should be an instance of vespa.application.Vespa, got {type(app)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
self._vespa_app = app
|
||||||
|
self._embedding_function = embedding_function
|
||||||
|
self._page_content_field = page_content_field
|
||||||
|
self._embedding_field = embedding_field
|
||||||
|
self._input_field = input_field
|
||||||
|
self._metadata_fields = metadata_fields
|
||||||
|
|
||||||
|
def add_texts(
|
||||||
|
self,
|
||||||
|
texts: Iterable[str],
|
||||||
|
metadatas: Optional[List[dict]] = None,
|
||||||
|
ids: Optional[List[str]] = None,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> List[str]:
|
||||||
|
"""
|
||||||
|
Add texts to the vectorstore.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
texts: Iterable of strings to add to the vectorstore.
|
||||||
|
metadatas: Optional list of metadatas associated with the texts.
|
||||||
|
ids: Optional list of ids associated with the texts.
|
||||||
|
kwargs: vectorstore specific parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of ids from adding the texts into the vectorstore.
|
||||||
|
"""
|
||||||
|
|
||||||
|
embeddings = None
|
||||||
|
if self._embedding_function is not None:
|
||||||
|
embeddings = self._embedding_function.embed_documents(list(texts))
|
||||||
|
|
||||||
|
if ids is None:
|
||||||
|
ids = [str(f"{i+1}") for i, _ in enumerate(texts)]
|
||||||
|
|
||||||
|
batch = []
|
||||||
|
for i, text in enumerate(texts):
|
||||||
|
fields: Dict[str, Union[str, List[float]]] = {}
|
||||||
|
if self._page_content_field is not None:
|
||||||
|
fields[self._page_content_field] = text
|
||||||
|
if self._embedding_field is not None and embeddings is not None:
|
||||||
|
fields[self._embedding_field] = embeddings[i]
|
||||||
|
if metadatas is not None and self._metadata_fields is not None:
|
||||||
|
for metadata_field in self._metadata_fields:
|
||||||
|
if metadata_field in metadatas[i]:
|
||||||
|
fields[metadata_field] = metadatas[i][metadata_field]
|
||||||
|
batch.append({"id": ids[i], "fields": fields})
|
||||||
|
|
||||||
|
results = self._vespa_app.feed_batch(batch)
|
||||||
|
for result in results:
|
||||||
|
if not (str(result.status_code).startswith("2")):
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Could not add document to Vespa. "
|
||||||
|
f"Error code: {result.status_code}. "
|
||||||
|
f"Message: {result.json['message']}"
|
||||||
|
)
|
||||||
|
return ids
|
||||||
|
|
||||||
|
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
|
||||||
|
if ids is None:
|
||||||
|
return False
|
||||||
|
batch = [{"id": id} for id in ids]
|
||||||
|
result = self._vespa_app.delete_batch(batch)
|
||||||
|
return sum([0 if r.status_code == 200 else 1 for r in result]) == 0
|
||||||
|
|
||||||
|
def _create_query(
|
||||||
|
self, query_embedding: List[float], k: int = 4, **kwargs: Any
|
||||||
|
) -> Dict:
|
||||||
|
hits = k
|
||||||
|
doc_embedding_field = self._embedding_field
|
||||||
|
input_embedding_field = self._input_field
|
||||||
|
ranking_function = kwargs["ranking"] if "ranking" in kwargs else "default"
|
||||||
|
filter = kwargs["filter"] if "filter" in kwargs else None
|
||||||
|
|
||||||
|
approximate = kwargs["approximate"] if "approximate" in kwargs else False
|
||||||
|
approximate = "true" if approximate else "false"
|
||||||
|
|
||||||
|
yql = "select * from sources * where "
|
||||||
|
yql += f"{{targetHits: {hits}, approximate: {approximate}}}"
|
||||||
|
yql += f"nearestNeighbor({doc_embedding_field}, {input_embedding_field})"
|
||||||
|
if filter is not None:
|
||||||
|
yql += f" and {filter}"
|
||||||
|
|
||||||
|
query = {
|
||||||
|
"yql": yql,
|
||||||
|
f"input.query({input_embedding_field})": query_embedding,
|
||||||
|
"ranking": ranking_function,
|
||||||
|
"hits": hits,
|
||||||
|
}
|
||||||
|
return query
|
||||||
|
|
||||||
|
def similarity_search_by_vector_with_score(
|
||||||
|
self, query_embedding: List[float], k: int = 4, **kwargs: Any
|
||||||
|
) -> List[Tuple[Document, float]]:
|
||||||
|
"""
|
||||||
|
Performs similarity search from a embeddings vector.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query_embedding: Embeddings vector to search for.
|
||||||
|
k: Number of results to return.
|
||||||
|
custom_query: Use this custom query instead default query (kwargs)
|
||||||
|
kwargs: other vector store specific parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of ids from adding the texts into the vectorstore.
|
||||||
|
"""
|
||||||
|
if "custom_query" in kwargs:
|
||||||
|
query = kwargs["custom_query"]
|
||||||
|
else:
|
||||||
|
query = self._create_query(query_embedding, k, **kwargs)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self._vespa_app.query(body=query)
|
||||||
|
except Exception as e:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Could not retrieve data from Vespa: "
|
||||||
|
f"{e.args[0][0]['summary']}. "
|
||||||
|
f"Error: {e.args[0][0]['message']}"
|
||||||
|
)
|
||||||
|
if not str(response.status_code).startswith("2"):
|
||||||
|
raise RuntimeError(
|
||||||
|
f"Could not retrieve data from Vespa. "
|
||||||
|
f"Error code: {response.status_code}. "
|
||||||
|
f"Message: {response.json['message']}"
|
||||||
|
)
|
||||||
|
|
||||||
|
root = response.json["root"]
|
||||||
|
if "errors" in root:
|
||||||
|
import json
|
||||||
|
|
||||||
|
raise RuntimeError(json.dumps(root["errors"]))
|
||||||
|
|
||||||
|
if response is None or response.hits is None:
|
||||||
|
return []
|
||||||
|
|
||||||
|
docs = []
|
||||||
|
for child in response.hits:
|
||||||
|
page_content = child["fields"][self._page_content_field]
|
||||||
|
score = child["relevance"]
|
||||||
|
metadata = {"id": child["id"]}
|
||||||
|
if self._metadata_fields is not None:
|
||||||
|
for field in self._metadata_fields:
|
||||||
|
metadata[field] = child["fields"].get(field)
|
||||||
|
doc = Document(page_content=page_content, metadata=metadata)
|
||||||
|
docs.append((doc, score))
|
||||||
|
return docs
|
||||||
|
|
||||||
|
def similarity_search_by_vector(
|
||||||
|
self, embedding: List[float], k: int = 4, **kwargs: Any
|
||||||
|
) -> List[Document]:
|
||||||
|
results = self.similarity_search_by_vector_with_score(embedding, k, **kwargs)
|
||||||
|
return [r[0] for r in results]
|
||||||
|
|
||||||
|
def similarity_search_with_score(
|
||||||
|
self, query: str, k: int = 4, **kwargs: Any
|
||||||
|
) -> List[Tuple[Document, float]]:
|
||||||
|
query_emb = []
|
||||||
|
if self._embedding_function is not None:
|
||||||
|
query_emb = self._embedding_function.embed_query(query)
|
||||||
|
return self.similarity_search_by_vector_with_score(query_emb, k, **kwargs)
|
||||||
|
|
||||||
|
def similarity_search(
|
||||||
|
self, query: str, k: int = 4, **kwargs: Any
|
||||||
|
) -> List[Document]:
|
||||||
|
results = self.similarity_search_with_score(query, k, **kwargs)
|
||||||
|
return [r[0] for r in results]
|
||||||
|
|
||||||
|
def max_marginal_relevance_search(
|
||||||
|
self,
|
||||||
|
query: str,
|
||||||
|
k: int = 4,
|
||||||
|
fetch_k: int = 20,
|
||||||
|
lambda_mult: float = 0.5,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> List[Document]:
|
||||||
|
raise NotImplementedError("MMR search not implemented")
|
||||||
|
|
||||||
|
def max_marginal_relevance_search_by_vector(
|
||||||
|
self,
|
||||||
|
embedding: List[float],
|
||||||
|
k: int = 4,
|
||||||
|
fetch_k: int = 20,
|
||||||
|
lambda_mult: float = 0.5,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> List[Document]:
|
||||||
|
raise NotImplementedError("MMR search by vector not implemented")
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_texts(
|
||||||
|
cls: Type[VespaStore],
|
||||||
|
texts: List[str],
|
||||||
|
embedding: Embeddings,
|
||||||
|
metadatas: Optional[List[dict]] = None,
|
||||||
|
ids: Optional[List[str]] = None,
|
||||||
|
**kwargs: Any,
|
||||||
|
) -> VespaStore:
|
||||||
|
vespa = cls(embedding_function=embedding, **kwargs)
|
||||||
|
vespa.add_texts(texts=texts, metadatas=metadatas, ids=ids)
|
||||||
|
return vespa
|
||||||
|
|
||||||
|
def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever:
|
||||||
|
return super().as_retriever(**kwargs)
|
Loading…
Reference in New Issue
Block a user