mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-15 20:12:30 +00:00
Add Vespa vector store (#11329)
Addition of Vespa vector store integration including notebook showing its use. Maintainer: @lesters Twitter handle: LesterSolbakken
This commit is contained in:
parent
58a88f3911
commit
a30f98f534
883
docs/extras/integrations/vectorstores/vespa.ipynb
Normal file
883
docs/extras/integrations/vectorstores/vespa.ipynb
Normal file
@ -0,0 +1,883 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ce0f17b9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Vespa\n",
|
||||
"\n",
|
||||
">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n",
|
||||
"\n",
|
||||
"This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n",
|
||||
"\n",
|
||||
"In order to create the vector store, we use\n",
|
||||
"[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n",
|
||||
"connection a `Vespa` service."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7e6a11ab-38bd-4920-ba11-60cb2f075754",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#!pip install pyvespa"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Using the `pyvespa` package, you can either connect to a\n",
|
||||
"[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
|
||||
"or a local\n",
|
||||
"[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
|
||||
"Here, we will create a new Vespa application and deploy that using Docker.\n",
|
||||
"\n",
|
||||
"#### Creating a Vespa application\n",
|
||||
"\n",
|
||||
"First, we need to create an application package:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from vespa.package import ApplicationPackage, Field, RankProfile\n",
|
||||
"\n",
|
||||
"app_package = ApplicationPackage(name=\"testapp\")\n",
|
||||
"app_package.schema.add_fields(\n",
|
||||
" Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n",
|
||||
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
||||
" indexing=[\"attribute\", \"summary\"],\n",
|
||||
" attribute=[f\"distance-metric: angular\"]),\n",
|
||||
")\n",
|
||||
"app_package.schema.add_rank_profile(\n",
|
||||
" RankProfile(name=\"default\",\n",
|
||||
" first_phase=\"closeness(field, embedding)\",\n",
|
||||
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||
" )\n",
|
||||
")"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This sets up a Vespa application with a schema for each document that contains\n",
|
||||
"two fields: `text` for holding the document text and `embedding` for holding\n",
|
||||
"the embedding vector. The `text` field is set up to use a BM25 index for\n",
|
||||
"efficient text retrieval, and we'll see how to use this and hybrid search a\n",
|
||||
"bit later.\n",
|
||||
"\n",
|
||||
"The `embedding` field is set up with a vector of length 384 to hold the\n",
|
||||
"embedding representation of the text. See\n",
|
||||
"[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n",
|
||||
"for more on tensors in Vespa.\n",
|
||||
"\n",
|
||||
"Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n",
|
||||
"instruct Vespa how to order documents. Here we set this up with a\n",
|
||||
"[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n",
|
||||
"\n",
|
||||
"Now we can deploy this application locally:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "c10dd962",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from vespa.deployment import VespaDocker\n",
|
||||
"\n",
|
||||
"vespa_docker = VespaDocker()\n",
|
||||
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3df4ce53",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This deploys and creates a connection to a `Vespa` service. In case you\n",
|
||||
"already have a Vespa application running, for instance in the cloud,\n",
|
||||
"please refer to the PyVespa application for how to connect.\n",
|
||||
"\n",
|
||||
"#### Creating a Vespa vector store\n",
|
||||
"\n",
|
||||
"Now, let's load some documents:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import TextLoader\n",
|
||||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||||
"\n",
|
||||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
||||
"documents = loader.load()\n",
|
||||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||||
"docs = text_splitter.split_documents(documents)\n",
|
||||
"\n",
|
||||
"from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
|
||||
"\n",
|
||||
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Here, we also set up local sentence embedder to transform the text to embedding\n",
|
||||
"vectors. One could also use OpenAI embeddings, but the vector length needs to\n",
|
||||
"be updated to `1536` to reflect the larger size of that embedding.\n",
|
||||
"\n",
|
||||
"To feed these to Vespa, we need to configure how the vector store should map to\n",
|
||||
"fields in the Vespa application. Then we create the vector store directly from\n",
|
||||
"this set of documents:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vespa_config = dict(\n",
|
||||
" page_content_field=\"text\",\n",
|
||||
" embedding_field=\"embedding\",\n",
|
||||
" input_field=\"query_embedding\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"from langchain.vectorstores import VespaStore\n",
|
||||
"\n",
|
||||
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This creates a Vespa vector store and feeds that set of documents to Vespa.\n",
|
||||
"The vector store takes care of calling the embedding function for each document\n",
|
||||
"and inserts them into the database.\n",
|
||||
"\n",
|
||||
"We can now query the vector store:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7ccca1f4",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"results = db.similarity_search(query)\n",
|
||||
"\n",
|
||||
"print(results[0].page_content)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1e7e34e1",
|
||||
"metadata": {
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"This will use the embedding function given above to create a representation\n",
|
||||
"for the query and use that to search Vespa. Note that this will use the\n",
|
||||
"`default` ranking function, which we set up in the application package\n",
|
||||
"above. You can use the `ranking` argument to `similarity_search` to\n",
|
||||
"specify which ranking function to use.\n",
|
||||
"\n",
|
||||
"Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
|
||||
"for more information.\n",
|
||||
"\n",
|
||||
"This covers the basic usage of the Vespa store in LangChain.\n",
|
||||
"Now you can return the results and continue using these in LangChain.\n",
|
||||
"\n",
|
||||
"#### Updating documents\n",
|
||||
"\n",
|
||||
"An alternative to calling `from_documents`, you can create the vector\n",
|
||||
"store directly and call `add_texts` from that. This can also be used to update\n",
|
||||
"documents:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"results = db.similarity_search(query)\n",
|
||||
"result = results[0]\n",
|
||||
"\n",
|
||||
"result.page_content = \"UPDATED: \" + result.page_content\n",
|
||||
"db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n",
|
||||
"\n",
|
||||
"results = db.similarity_search(query)\n",
|
||||
"print(results[0].page_content)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"However, the `pyvespa` library contains methods to manipulate\n",
|
||||
"content on Vespa which you can use directly.\n",
|
||||
"\n",
|
||||
"#### Deleting documents\n",
|
||||
"\n",
|
||||
"You can delete documents using the `delete` function:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result = db.similarity_search(query)\n",
|
||||
"# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
||||
"\n",
|
||||
"db.delete([\"32\"])\n",
|
||||
"result = db.similarity_search(query)\n",
|
||||
"# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\""
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Again, the `pyvespa` connection contains methods to delete documents as well.\n",
|
||||
"\n",
|
||||
"### Returning with scores\n",
|
||||
"\n",
|
||||
"The `similarity_search` method only returns the documents in order of\n",
|
||||
"relevancy. To retrieve the actual scores:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"results = db.similarity_search_with_score(query)\n",
|
||||
"result = results[0]\n",
|
||||
"# result[1] ~= 0.463"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n",
|
||||
"cosine distance function (as given by the argument `angular` in the\n",
|
||||
"application function).\n",
|
||||
"\n",
|
||||
"Different embedding functions need different distance functions, and Vespa\n",
|
||||
"needs to know which distance function to use when orderings documents.\n",
|
||||
"Please refer to the\n",
|
||||
"[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n",
|
||||
"for more information.\n",
|
||||
"\n",
|
||||
"### As retriever\n",
|
||||
"\n",
|
||||
"To use this vector store as a\n",
|
||||
"[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n",
|
||||
"simply call the `as_retriever` function, which is a standard vector store\n",
|
||||
"method:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
||||
"retriever = db.as_retriever()\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"results = retriever.get_relevant_documents(query)\n",
|
||||
"\n",
|
||||
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\""
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This allows for more general, unstructured, retrieval from the vector store.\n",
|
||||
"\n",
|
||||
"### Metadata\n",
|
||||
"\n",
|
||||
"In the example so far, we've only used the text and the embedding for that\n",
|
||||
"text. Documents usually contain additional information, which in LangChain\n",
|
||||
"is referred to as metadata.\n",
|
||||
"\n",
|
||||
"Vespa can contain many fields with different types by adding them to the application\n",
|
||||
"package:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"app_package.schema.add_fields(\n",
|
||||
" # ...\n",
|
||||
" Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||
" Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||
" Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
||||
" # ...\n",
|
||||
")\n",
|
||||
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We can add some metadata fields in the documents:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Add metadata\n",
|
||||
"for i, doc in enumerate(docs):\n",
|
||||
" doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n",
|
||||
" doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n",
|
||||
" doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"And let the Vespa vector store know about these fields:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now, when searching for these documents, these fields will be returned.\n",
|
||||
"Also, these fields can be filtered on:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"results = db.similarity_search(query, filter=\"rating > 3\")\n",
|
||||
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n",
|
||||
"# results[0].metadata[\"author\"] == \"Unknown\""
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Custom query\n",
|
||||
"\n",
|
||||
"If the default behavior of the similarity search does not fit your\n",
|
||||
"requirements, you can always provide your own query. Thus, you don't\n",
|
||||
"need to provide all of the configuration to the vector store, but\n",
|
||||
"rather just write this yourself.\n",
|
||||
"\n",
|
||||
"First, let's add a BM25 ranking function to our application:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from vespa.package import FieldSet\n",
|
||||
"\n",
|
||||
"app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n",
|
||||
"app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n",
|
||||
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
||||
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Then, to perform a regular text search based on BM25:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"custom_query = {\n",
|
||||
" \"yql\": f\"select * from sources * where userQuery()\",\n",
|
||||
" \"query\": query,\n",
|
||||
" \"type\": \"weakAnd\",\n",
|
||||
" \"ranking\": \"bm25\",\n",
|
||||
" \"hits\": 4\n",
|
||||
"}\n",
|
||||
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||
"# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
||||
"# results[0][1] ~= 14.384"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"All of the powerful search and query capabilities of Vespa can be used\n",
|
||||
"by using a custom query. Please refer to the Vespa documentation on it's\n",
|
||||
"[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n",
|
||||
"\n",
|
||||
"### Hybrid search\n",
|
||||
"\n",
|
||||
"Hybrid search means using both a classic term-based search such as\n",
|
||||
"BM25 and a vector search and combining the results. We need to create\n",
|
||||
"a new rank profile for hybrid search on Vespa:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"app_package.schema.add_rank_profile(\n",
|
||||
" RankProfile(name=\"hybrid\",\n",
|
||||
" first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n",
|
||||
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
||||
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Here, we score each document as a combination of it's BM25 score and its\n",
|
||||
"distance score. We can query using a custom query:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"query_embedding = embedding_function.embed_query(query)\n",
|
||||
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n",
|
||||
"custom_query = {\n",
|
||||
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n",
|
||||
" \"query\": query,\n",
|
||||
" \"type\": \"weakAnd\",\n",
|
||||
" \"input.query(query_embedding)\": query_embedding,\n",
|
||||
" \"ranking\": \"hybrid\",\n",
|
||||
" \"hits\": 4\n",
|
||||
"}\n",
|
||||
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
||||
"# results[0][1] ~= 2.897"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Native embedders in Vespa\n",
|
||||
"\n",
|
||||
"Up until this point we've used an embedding function in Python to provide\n",
|
||||
"embeddings for the texts. Vespa supports embedding function natively, so\n",
|
||||
"you can defer this calculation in to Vespa. One benefit is the ability to use\n",
|
||||
"GPUs when embedding documents if you have a large collections.\n",
|
||||
"\n",
|
||||
"Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n",
|
||||
"for more information.\n",
|
||||
"\n",
|
||||
"First, we need to modify our application package:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from vespa.package import Component, Parameter\n",
|
||||
"\n",
|
||||
"app_package.components = [\n",
|
||||
" Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n",
|
||||
" parameters=[\n",
|
||||
" Parameter(\"transformer-model\", {\"path\": \"...\"}),\n",
|
||||
" Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n",
|
||||
" ]\n",
|
||||
" )\n",
|
||||
"]\n",
|
||||
"Field(name=\"hfembedding\", type=\"tensor<float>(x[384])\",\n",
|
||||
" is_document_field=False,\n",
|
||||
" indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n",
|
||||
" attribute=[f\"distance-metric: angular\"],\n",
|
||||
" )\n",
|
||||
"app_package.schema.add_rank_profile(\n",
|
||||
" RankProfile(name=\"hf_similarity\",\n",
|
||||
" first_phase=\"closeness(field, hfembedding)\",\n",
|
||||
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")]\n",
|
||||
" )\n",
|
||||
")"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Please refer to the embeddings documentation on adding embedder models\n",
|
||||
"and tokenizers to the application. Note that the `hfembedding` field\n",
|
||||
"includes instructions for embedding using the `hf-embedder`.\n",
|
||||
"\n",
|
||||
"Now we can query with a custom query:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n",
|
||||
"custom_query = {\n",
|
||||
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n",
|
||||
" \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n",
|
||||
" \"ranking\": \"internal_similarity\",\n",
|
||||
" \"hits\": 4\n",
|
||||
"}\n",
|
||||
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
||||
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
||||
"# results[0][1] ~= 0.630"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Note that the query here includes an `embed` instruction to embed the query\n",
|
||||
"using the same model as for the documents.\n",
|
||||
"\n",
|
||||
"### Approximate nearest neighbor\n",
|
||||
"\n",
|
||||
"In all of the above examples, we've used exact nearest neighbor to\n",
|
||||
"find results. However, for large collections of documents this is\n",
|
||||
"not feasible as one has to scan through all documents to find the\n",
|
||||
"best matches. To avoid this, we can use\n",
|
||||
"[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n",
|
||||
"\n",
|
||||
"First, we can change the embedding field to create a HNSW index:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from vespa.package import HNSW\n",
|
||||
"\n",
|
||||
"app_package.schema.add_fields(\n",
|
||||
" Field(name=\"embedding\", type=\"tensor<float>(x[384])\",\n",
|
||||
" indexing=[\"attribute\", \"summary\", \"index\"],\n",
|
||||
" ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n",
|
||||
" )\n",
|
||||
")\n"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This creates a HNSW index on the embedding data which allows for efficient\n",
|
||||
"searching. With this set, we can easily search using ANN by setting\n",
|
||||
"the `approximate` argument to `True`:"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||||
"results = db.similarity_search(query, approximate=True)\n",
|
||||
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%%\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"This covers most of the functionality in the Vespa vector store in LangChain.\n",
|
||||
"\n"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
@ -1,19 +1,16 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Sequence, Union
|
||||
from typing import Any, Dict, List, Literal, Optional, Sequence, Union
|
||||
|
||||
from langchain.callbacks.manager import CallbackManagerForRetrieverRun
|
||||
from langchain.schema import BaseRetriever, Document
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from vespa.application import Vespa
|
||||
|
||||
|
||||
class VespaRetriever(BaseRetriever):
|
||||
"""`Vespa` retriever."""
|
||||
|
||||
app: Vespa
|
||||
app: Any
|
||||
"""Vespa application to query."""
|
||||
body: Dict
|
||||
"""Body of the query."""
|
||||
|
@ -76,6 +76,7 @@ from langchain.vectorstores.usearch import USearch
|
||||
from langchain.vectorstores.vald import Vald
|
||||
from langchain.vectorstores.vearch import Vearch
|
||||
from langchain.vectorstores.vectara import Vectara
|
||||
from langchain.vectorstores.vespa import VespaStore
|
||||
from langchain.vectorstores.weaviate import Weaviate
|
||||
from langchain.vectorstores.zep import ZepVectorStore
|
||||
from langchain.vectorstores.zilliz import Zilliz
|
||||
@ -143,6 +144,7 @@ __all__ = [
|
||||
"Vearch",
|
||||
"Vectara",
|
||||
"VectorStore",
|
||||
"VespaStore",
|
||||
"Weaviate",
|
||||
"ZepVectorStore",
|
||||
"Zilliz",
|
||||
|
267
libs/langchain/langchain/vectorstores/vespa.py
Normal file
267
libs/langchain/langchain/vectorstores/vespa.py
Normal file
@ -0,0 +1,267 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union
|
||||
|
||||
from langchain.docstore.document import Document
|
||||
from langchain.schema.embeddings import Embeddings
|
||||
from langchain.vectorstores.base import VectorStore, VectorStoreRetriever
|
||||
|
||||
|
||||
class VespaStore(VectorStore):
|
||||
"""
|
||||
`Vespa` vector store.
|
||||
|
||||
To use, you should have the python client library ``pyvespa`` installed.
|
||||
|
||||
Example:
|
||||
.. code-block:: python
|
||||
|
||||
from langchain.vectorstores import VespaStore
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from vespa.application import Vespa
|
||||
|
||||
# Create a vespa client dependent upon your application,
|
||||
# e.g. either connecting to Vespa Cloud or a local deployment
|
||||
# such as Docker. Please refer to the PyVespa documentation on
|
||||
# how to initialize the client.
|
||||
|
||||
vespa_app = Vespa(url="...", port=..., application_package=...)
|
||||
|
||||
# You need to instruct LangChain on which fields to use for embeddings
|
||||
vespa_config = dict(
|
||||
page_content_field="text",
|
||||
embedding_field="embedding",
|
||||
input_field="query_embedding",
|
||||
metadata_fields=["date", "rating", "author"]
|
||||
)
|
||||
|
||||
embedding_function = OpenAIEmbeddings()
|
||||
vectorstore = VespaStore(vespa_app, embedding_function, **vespa_config)
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
app: Any,
|
||||
embedding_function: Optional[Embeddings] = None,
|
||||
page_content_field: Optional[str] = None,
|
||||
embedding_field: Optional[str] = None,
|
||||
input_field: Optional[str] = None,
|
||||
metadata_fields: Optional[List[str]] = None,
|
||||
) -> None:
|
||||
"""
|
||||
Initialize with a PyVespa client.
|
||||
"""
|
||||
try:
|
||||
from vespa.application import Vespa
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Could not import Vespa python package. "
|
||||
"Please install it with `pip install pyvespa`."
|
||||
)
|
||||
if not isinstance(app, Vespa):
|
||||
raise ValueError(
|
||||
f"app should be an instance of vespa.application.Vespa, got {type(app)}"
|
||||
)
|
||||
|
||||
self._vespa_app = app
|
||||
self._embedding_function = embedding_function
|
||||
self._page_content_field = page_content_field
|
||||
self._embedding_field = embedding_field
|
||||
self._input_field = input_field
|
||||
self._metadata_fields = metadata_fields
|
||||
|
||||
def add_texts(
|
||||
self,
|
||||
texts: Iterable[str],
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> List[str]:
|
||||
"""
|
||||
Add texts to the vectorstore.
|
||||
|
||||
Args:
|
||||
texts: Iterable of strings to add to the vectorstore.
|
||||
metadatas: Optional list of metadatas associated with the texts.
|
||||
ids: Optional list of ids associated with the texts.
|
||||
kwargs: vectorstore specific parameters
|
||||
|
||||
Returns:
|
||||
List of ids from adding the texts into the vectorstore.
|
||||
"""
|
||||
|
||||
embeddings = None
|
||||
if self._embedding_function is not None:
|
||||
embeddings = self._embedding_function.embed_documents(list(texts))
|
||||
|
||||
if ids is None:
|
||||
ids = [str(f"{i+1}") for i, _ in enumerate(texts)]
|
||||
|
||||
batch = []
|
||||
for i, text in enumerate(texts):
|
||||
fields: Dict[str, Union[str, List[float]]] = {}
|
||||
if self._page_content_field is not None:
|
||||
fields[self._page_content_field] = text
|
||||
if self._embedding_field is not None and embeddings is not None:
|
||||
fields[self._embedding_field] = embeddings[i]
|
||||
if metadatas is not None and self._metadata_fields is not None:
|
||||
for metadata_field in self._metadata_fields:
|
||||
if metadata_field in metadatas[i]:
|
||||
fields[metadata_field] = metadatas[i][metadata_field]
|
||||
batch.append({"id": ids[i], "fields": fields})
|
||||
|
||||
results = self._vespa_app.feed_batch(batch)
|
||||
for result in results:
|
||||
if not (str(result.status_code).startswith("2")):
|
||||
raise RuntimeError(
|
||||
f"Could not add document to Vespa. "
|
||||
f"Error code: {result.status_code}. "
|
||||
f"Message: {result.json['message']}"
|
||||
)
|
||||
return ids
|
||||
|
||||
def delete(self, ids: Optional[List[str]] = None, **kwargs: Any) -> Optional[bool]:
|
||||
if ids is None:
|
||||
return False
|
||||
batch = [{"id": id} for id in ids]
|
||||
result = self._vespa_app.delete_batch(batch)
|
||||
return sum([0 if r.status_code == 200 else 1 for r in result]) == 0
|
||||
|
||||
def _create_query(
|
||||
self, query_embedding: List[float], k: int = 4, **kwargs: Any
|
||||
) -> Dict:
|
||||
hits = k
|
||||
doc_embedding_field = self._embedding_field
|
||||
input_embedding_field = self._input_field
|
||||
ranking_function = kwargs["ranking"] if "ranking" in kwargs else "default"
|
||||
filter = kwargs["filter"] if "filter" in kwargs else None
|
||||
|
||||
approximate = kwargs["approximate"] if "approximate" in kwargs else False
|
||||
approximate = "true" if approximate else "false"
|
||||
|
||||
yql = "select * from sources * where "
|
||||
yql += f"{{targetHits: {hits}, approximate: {approximate}}}"
|
||||
yql += f"nearestNeighbor({doc_embedding_field}, {input_embedding_field})"
|
||||
if filter is not None:
|
||||
yql += f" and {filter}"
|
||||
|
||||
query = {
|
||||
"yql": yql,
|
||||
f"input.query({input_embedding_field})": query_embedding,
|
||||
"ranking": ranking_function,
|
||||
"hits": hits,
|
||||
}
|
||||
return query
|
||||
|
||||
def similarity_search_by_vector_with_score(
|
||||
self, query_embedding: List[float], k: int = 4, **kwargs: Any
|
||||
) -> List[Tuple[Document, float]]:
|
||||
"""
|
||||
Performs similarity search from a embeddings vector.
|
||||
|
||||
Args:
|
||||
query_embedding: Embeddings vector to search for.
|
||||
k: Number of results to return.
|
||||
custom_query: Use this custom query instead default query (kwargs)
|
||||
kwargs: other vector store specific parameters
|
||||
|
||||
Returns:
|
||||
List of ids from adding the texts into the vectorstore.
|
||||
"""
|
||||
if "custom_query" in kwargs:
|
||||
query = kwargs["custom_query"]
|
||||
else:
|
||||
query = self._create_query(query_embedding, k, **kwargs)
|
||||
|
||||
try:
|
||||
response = self._vespa_app.query(body=query)
|
||||
except Exception as e:
|
||||
raise RuntimeError(
|
||||
f"Could not retrieve data from Vespa: "
|
||||
f"{e.args[0][0]['summary']}. "
|
||||
f"Error: {e.args[0][0]['message']}"
|
||||
)
|
||||
if not str(response.status_code).startswith("2"):
|
||||
raise RuntimeError(
|
||||
f"Could not retrieve data from Vespa. "
|
||||
f"Error code: {response.status_code}. "
|
||||
f"Message: {response.json['message']}"
|
||||
)
|
||||
|
||||
root = response.json["root"]
|
||||
if "errors" in root:
|
||||
import json
|
||||
|
||||
raise RuntimeError(json.dumps(root["errors"]))
|
||||
|
||||
if response is None or response.hits is None:
|
||||
return []
|
||||
|
||||
docs = []
|
||||
for child in response.hits:
|
||||
page_content = child["fields"][self._page_content_field]
|
||||
score = child["relevance"]
|
||||
metadata = {"id": child["id"]}
|
||||
if self._metadata_fields is not None:
|
||||
for field in self._metadata_fields:
|
||||
metadata[field] = child["fields"].get(field)
|
||||
doc = Document(page_content=page_content, metadata=metadata)
|
||||
docs.append((doc, score))
|
||||
return docs
|
||||
|
||||
def similarity_search_by_vector(
|
||||
self, embedding: List[float], k: int = 4, **kwargs: Any
|
||||
) -> List[Document]:
|
||||
results = self.similarity_search_by_vector_with_score(embedding, k, **kwargs)
|
||||
return [r[0] for r in results]
|
||||
|
||||
def similarity_search_with_score(
|
||||
self, query: str, k: int = 4, **kwargs: Any
|
||||
) -> List[Tuple[Document, float]]:
|
||||
query_emb = []
|
||||
if self._embedding_function is not None:
|
||||
query_emb = self._embedding_function.embed_query(query)
|
||||
return self.similarity_search_by_vector_with_score(query_emb, k, **kwargs)
|
||||
|
||||
def similarity_search(
|
||||
self, query: str, k: int = 4, **kwargs: Any
|
||||
) -> List[Document]:
|
||||
results = self.similarity_search_with_score(query, k, **kwargs)
|
||||
return [r[0] for r in results]
|
||||
|
||||
def max_marginal_relevance_search(
|
||||
self,
|
||||
query: str,
|
||||
k: int = 4,
|
||||
fetch_k: int = 20,
|
||||
lambda_mult: float = 0.5,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
raise NotImplementedError("MMR search not implemented")
|
||||
|
||||
def max_marginal_relevance_search_by_vector(
|
||||
self,
|
||||
embedding: List[float],
|
||||
k: int = 4,
|
||||
fetch_k: int = 20,
|
||||
lambda_mult: float = 0.5,
|
||||
**kwargs: Any,
|
||||
) -> List[Document]:
|
||||
raise NotImplementedError("MMR search by vector not implemented")
|
||||
|
||||
@classmethod
|
||||
def from_texts(
|
||||
cls: Type[VespaStore],
|
||||
texts: List[str],
|
||||
embedding: Embeddings,
|
||||
metadatas: Optional[List[dict]] = None,
|
||||
ids: Optional[List[str]] = None,
|
||||
**kwargs: Any,
|
||||
) -> VespaStore:
|
||||
vespa = cls(embedding_function=embedding, **kwargs)
|
||||
vespa.add_texts(texts=texts, metadatas=metadatas, ids=ids)
|
||||
return vespa
|
||||
|
||||
def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever:
|
||||
return super().as_retriever(**kwargs)
|
Loading…
Reference in New Issue
Block a user