mirror of
https://github.com/hwchase17/langchain.git
synced 2026-04-03 19:04:23 +00:00
Use docusaurus versioning with a callout, merged master as well @hwchase17 @baskaryan --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> Co-authored-by: Leonid Kuligin <lkuligin@yandex.ru> Co-authored-by: Averi Kitsch <akitsch@google.com> Co-authored-by: Erick Friis <erick@langchain.dev> Co-authored-by: Nuno Campos <nuno@langchain.dev> Co-authored-by: Nuno Campos <nuno@boringbits.io> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Martín Gotelli Ferenaz <martingotelliferenaz@gmail.com> Co-authored-by: Fayfox <admin@fayfox.com> Co-authored-by: Eugene Yurtsev <eugene@langchain.dev> Co-authored-by: Dawson Bauer <105886620+djbauer2@users.noreply.github.com> Co-authored-by: Ravindu Somawansa <ravindu.somawansa@gmail.com> Co-authored-by: Dhruv Chawla <43818888+Dominastorm@users.noreply.github.com> Co-authored-by: ccurme <chester.curme@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Benito Geordie <89472452+benitoThree@users.noreply.github.com> Co-authored-by: kartikTAI <129414343+kartikTAI@users.noreply.github.com> Co-authored-by: Kartik Sarangmath <kartik@thirdai.com> Co-authored-by: Sevin F. Varoglu <sfvaroglu@octoml.ai> Co-authored-by: MacanPN <martin.triska@gmail.com> Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> Co-authored-by: Hyeongchan Kim <kozistr@gmail.com> Co-authored-by: sdan <git@sdan.io> Co-authored-by: Guangdong Liu <liugddx@gmail.com> Co-authored-by: Rahul Triptahi <rahul.psit.ec@gmail.com> Co-authored-by: Rahul Tripathi <rauhl.psit.ec@gmail.com> Co-authored-by: pjb157 <84070455+pjb157@users.noreply.github.com> Co-authored-by: Eun Hye Kim <ehkim1440@gmail.com> Co-authored-by: kaijietti <43436010+kaijietti@users.noreply.github.com> Co-authored-by: Pengcheng Liu <pcliu.fd@gmail.com> Co-authored-by: Tomer Cagan <tomer@tomercagan.com> Co-authored-by: Christophe Bornet <cbornet@hotmail.com>
946 lines
26 KiB
Plaintext
946 lines
26 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ce0f17b9",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Vespa\n",
|
|
"\n",
|
|
">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n",
|
|
"\n",
|
|
"This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n",
|
|
"\n",
|
|
"In order to create the vector store, we use\n",
|
|
"[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n",
|
|
"connection a `Vespa` service."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7e6a11ab-38bd-4920-ba11-60cb2f075754",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install --upgrade --quiet pyvespa"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "283b49c9",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"Using the `pyvespa` package, you can either connect to a\n",
|
|
"[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n",
|
|
"or a local\n",
|
|
"[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n",
|
|
"Here, we will create a new Vespa application and deploy that using Docker.\n",
|
|
"\n",
|
|
"#### Creating a Vespa application\n",
|
|
"\n",
|
|
"First, we need to create an application package:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "91150665",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import ApplicationPackage, Field, RankProfile\n",
|
|
"\n",
|
|
"app_package = ApplicationPackage(name=\"testapp\")\n",
|
|
"app_package.schema.add_fields(\n",
|
|
" Field(\n",
|
|
" name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"\n",
|
|
" ),\n",
|
|
" Field(\n",
|
|
" name=\"embedding\",\n",
|
|
" type=\"tensor<float>(x[384])\",\n",
|
|
" indexing=[\"attribute\", \"summary\"],\n",
|
|
" attribute=[\"distance-metric: angular\"],\n",
|
|
" ),\n",
|
|
")\n",
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(\n",
|
|
" name=\"default\",\n",
|
|
" first_phase=\"closeness(field, embedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")],\n",
|
|
" )\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15477106",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This sets up a Vespa application with a schema for each document that contains\n",
|
|
"two fields: `text` for holding the document text and `embedding` for holding\n",
|
|
"the embedding vector. The `text` field is set up to use a BM25 index for\n",
|
|
"efficient text retrieval, and we'll see how to use this and hybrid search a\n",
|
|
"bit later.\n",
|
|
"\n",
|
|
"The `embedding` field is set up with a vector of length 384 to hold the\n",
|
|
"embedding representation of the text. See\n",
|
|
"[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n",
|
|
"for more on tensors in Vespa.\n",
|
|
"\n",
|
|
"Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n",
|
|
"instruct Vespa how to order documents. Here we set this up with a\n",
|
|
"[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n",
|
|
"\n",
|
|
"Now we can deploy this application locally:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "c10dd962",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.deployment import VespaDocker\n",
|
|
"\n",
|
|
"vespa_docker = VespaDocker()\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3df4ce53",
|
|
"metadata": {},
|
|
"source": [
|
|
"This deploys and creates a connection to a `Vespa` service. In case you\n",
|
|
"already have a Vespa application running, for instance in the cloud,\n",
|
|
"please refer to the PyVespa application for how to connect.\n",
|
|
"\n",
|
|
"#### Creating a Vespa vector store\n",
|
|
"\n",
|
|
"Now, let's load some documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7abde491",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.document_loaders import TextLoader\n",
|
|
"from langchain_text_splitters import CharacterTextSplitter\n",
|
|
"\n",
|
|
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
|
"documents = loader.load()\n",
|
|
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
|
"docs = text_splitter.split_documents(documents)\n",
|
|
"\n",
|
|
"from langchain_community.embeddings.sentence_transformer import (\n",
|
|
" SentenceTransformerEmbeddings,\n",
|
|
")\n",
|
|
"\n",
|
|
"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d42365c7",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Here, we also set up local sentence embedder to transform the text to embedding\n",
|
|
"vectors. One could also use OpenAI embeddings, but the vector length needs to\n",
|
|
"be updated to `1536` to reflect the larger size of that embedding.\n",
|
|
"\n",
|
|
"To feed these to Vespa, we need to configure how the vector store should map to\n",
|
|
"fields in the Vespa application. Then we create the vector store directly from\n",
|
|
"this set of documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0b647878",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"vespa_config = dict(\n",
|
|
" page_content_field=\"text\",\n",
|
|
" embedding_field=\"embedding\",\n",
|
|
" input_field=\"query_embedding\",\n",
|
|
")\n",
|
|
"\n",
|
|
"from langchain_community.vectorstores import VespaStore\n",
|
|
"\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d6bd0aab",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This creates a Vespa vector store and feeds that set of documents to Vespa.\n",
|
|
"The vector store takes care of calling the embedding function for each document\n",
|
|
"and inserts them into the database.\n",
|
|
"\n",
|
|
"We can now query the vector store:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7ccca1f4",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"\n",
|
|
"print(results[0].page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1e7e34e1",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This will use the embedding function given above to create a representation\n",
|
|
"for the query and use that to search Vespa. Note that this will use the\n",
|
|
"`default` ranking function, which we set up in the application package\n",
|
|
"above. You can use the `ranking` argument to `similarity_search` to\n",
|
|
"specify which ranking function to use.\n",
|
|
"\n",
|
|
"Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"This covers the basic usage of the Vespa store in LangChain.\n",
|
|
"Now you can return the results and continue using these in LangChain.\n",
|
|
"\n",
|
|
"#### Updating documents\n",
|
|
"\n",
|
|
"An alternative to calling `from_documents`, you can create the vector\n",
|
|
"store directly and call `add_texts` from that. This can also be used to update\n",
|
|
"documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a5256284",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"result = results[0]\n",
|
|
"\n",
|
|
"result.page_content = \"UPDATED: \" + result.page_content\n",
|
|
"db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n",
|
|
"\n",
|
|
"results = db.similarity_search(query)\n",
|
|
"print(results[0].page_content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2526b50e",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"However, the `pyvespa` library contains methods to manipulate\n",
|
|
"content on Vespa which you can use directly.\n",
|
|
"\n",
|
|
"#### Deleting documents\n",
|
|
"\n",
|
|
"You can delete documents using the `delete` function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "52cab87e",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"result = db.similarity_search(query)\n",
|
|
"# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
|
"\n",
|
|
"db.delete([\"32\"])\n",
|
|
"result = db.similarity_search(query)\n",
|
|
"# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "deffaba5",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Again, the `pyvespa` connection contains methods to delete documents as well.\n",
|
|
"\n",
|
|
"### Returning with scores\n",
|
|
"\n",
|
|
"The `similarity_search` method only returns the documents in order of\n",
|
|
"relevancy. To retrieve the actual scores:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cd9ae173",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"results = db.similarity_search_with_score(query)\n",
|
|
"result = results[0]\n",
|
|
"# result[1] ~= 0.463"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7257d67a",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n",
|
|
"cosine distance function (as given by the argument `angular` in the\n",
|
|
"application function).\n",
|
|
"\n",
|
|
"Different embedding functions need different distance functions, and Vespa\n",
|
|
"needs to know which distance function to use when orderings documents.\n",
|
|
"Please refer to the\n",
|
|
"[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"### As retriever\n",
|
|
"\n",
|
|
"To use this vector store as a\n",
|
|
"[LangChain retriever](/docs/modules/data_connection/retrievers/)\n",
|
|
"simply call the `as_retriever` function, which is a standard vector store\n",
|
|
"method:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7fb717a9",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
|
"retriever = db.as_retriever()\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = retriever.get_relevant_documents(query)\n",
|
|
"\n",
|
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fba7f07e",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This allows for more general, unstructured, retrieval from the vector store.\n",
|
|
"\n",
|
|
"### Metadata\n",
|
|
"\n",
|
|
"In the example so far, we've only used the text and the embedding for that\n",
|
|
"text. Documents usually contain additional information, which in LangChain\n",
|
|
"is referred to as metadata.\n",
|
|
"\n",
|
|
"Vespa can contain many fields with different types by adding them to the application\n",
|
|
"package:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "59cffcf2",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"app_package.schema.add_fields(\n",
|
|
" # ...\n",
|
|
" Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n",
|
|
" # ...\n",
|
|
")\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "eebef70c",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"We can add some metadata fields in the documents:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b21efbfa",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add metadata\n",
|
|
"for i, doc in enumerate(docs):\n",
|
|
" doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n",
|
|
" doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n",
|
|
" doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9b42bd4d",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"And let the Vespa vector store know about these fields:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6bb272f6",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "43818655",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Now, when searching for these documents, these fields will be returned.\n",
|
|
"Also, these fields can be filtered on:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "831759f3",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query, filter=\"rating > 3\")\n",
|
|
"# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n",
|
|
"# results[0].metadata[\"author\"] == \"Unknown\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a49aad6e",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Custom query\n",
|
|
"\n",
|
|
"If the default behavior of the similarity search does not fit your\n",
|
|
"requirements, you can always provide your own query. Thus, you don't\n",
|
|
"need to provide all of the configuration to the vector store, but\n",
|
|
"rather just write this yourself.\n",
|
|
"\n",
|
|
"First, let's add a BM25 ranking function to our application:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d0fb0562",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import FieldSet\n",
|
|
"\n",
|
|
"app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n",
|
|
"app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fe607747",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Then, to perform a regular text search based on BM25:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cee245c3",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": \"select * from sources * where userQuery()\",\n",
|
|
" \"query\": query,\n",
|
|
" \"type\": \"weakAnd\",\n",
|
|
" \"ranking\": \"bm25\",\n",
|
|
" \"hits\": 4,\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n",
|
|
"# results[0][1] ~= 14.384"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "41a4c081",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"All of the powerful search and query capabilities of Vespa can be used\n",
|
|
"by using a custom query. Please refer to the Vespa documentation on it's\n",
|
|
"[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n",
|
|
"\n",
|
|
"### Hybrid search\n",
|
|
"\n",
|
|
"Hybrid search means using both a classic term-based search such as\n",
|
|
"BM25 and a vector search and combining the results. We need to create\n",
|
|
"a new rank profile for hybrid search on Vespa:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bf73efc1",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(\n",
|
|
" name=\"hybrid\",\n",
|
|
" first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")],\n",
|
|
" )\n",
|
|
")\n",
|
|
"vespa_app = vespa_docker.deploy(application_package=app_package)\n",
|
|
"db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "40f48711",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Here, we score each document as a combination of it's BM25 score and its\n",
|
|
"distance score. We can query using a custom query:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d2e289f0",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"query_embedding = embedding_function.embed_query(query)\n",
|
|
"nearest_neighbor_expression = (\n",
|
|
" \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n",
|
|
")\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n",
|
|
" \"query\": query,\n",
|
|
" \"type\": \"weakAnd\",\n",
|
|
" \"input.query(query_embedding)\": query_embedding,\n",
|
|
" \"ranking\": \"hybrid\",\n",
|
|
" \"hits\": 4,\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
|
"# results[0][1] ~= 2.897"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "958e269f",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Native embedders in Vespa\n",
|
|
"\n",
|
|
"Up until this point we've used an embedding function in Python to provide\n",
|
|
"embeddings for the texts. Vespa supports embedding function natively, so\n",
|
|
"you can defer this calculation in to Vespa. One benefit is the ability to use\n",
|
|
"GPUs when embedding documents if you have a large collections.\n",
|
|
"\n",
|
|
"Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n",
|
|
"for more information.\n",
|
|
"\n",
|
|
"First, we need to modify our application package:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "56b9686c",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import Component, Parameter\n",
|
|
"\n",
|
|
"app_package.components = [\n",
|
|
" Component(\n",
|
|
" id=\"hf-embedder\",\n",
|
|
" type=\"hugging-face-embedder\",\n",
|
|
" parameters=[\n",
|
|
" Parameter(\"transformer-model\", {\"path\": \"...\"}),\n",
|
|
" Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n",
|
|
" ],\n",
|
|
" )\n",
|
|
"]\n",
|
|
"Field(\n",
|
|
" name=\"hfembedding\",\n",
|
|
" type=\"tensor<float>(x[384])\",\n",
|
|
" is_document_field=False,\n",
|
|
" indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n",
|
|
" attribute=[\"distance-metric: angular\"],\n",
|
|
")\n",
|
|
"app_package.schema.add_rank_profile(\n",
|
|
" RankProfile(\n",
|
|
" name=\"hf_similarity\",\n",
|
|
" first_phase=\"closeness(field, hfembedding)\",\n",
|
|
" inputs=[(\"query(query_embedding)\", \"tensor<float>(x[384])\")],\n",
|
|
" )\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5cd721a8",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Please refer to the embeddings documentation on adding embedder models\n",
|
|
"and tokenizers to the application. Note that the `hfembedding` field\n",
|
|
"includes instructions for embedding using the `hf-embedder`.\n",
|
|
"\n",
|
|
"Now we can query with a custom query:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "da631d13",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"nearest_neighbor_expression = (\n",
|
|
" \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n",
|
|
")\n",
|
|
"custom_query = {\n",
|
|
" \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n",
|
|
" \"input.query(query_embedding)\": f'embed(hf-embedder, \"{query}\")',\n",
|
|
" \"ranking\": \"internal_similarity\",\n",
|
|
" \"hits\": 4,\n",
|
|
"}\n",
|
|
"results = db.similarity_search_with_score(query, custom_query=custom_query)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n",
|
|
"# results[0][1] ~= 0.630"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a333b553",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"Note that the query here includes an `embed` instruction to embed the query\n",
|
|
"using the same model as for the documents.\n",
|
|
"\n",
|
|
"### Approximate nearest neighbor\n",
|
|
"\n",
|
|
"In all of the above examples, we've used exact nearest neighbor to\n",
|
|
"find results. However, for large collections of documents this is\n",
|
|
"not feasible as one has to scan through all documents to find the\n",
|
|
"best matches. To avoid this, we can use\n",
|
|
"[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n",
|
|
"\n",
|
|
"First, we can change the embedding field to create a HNSW index:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9ee955c8",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from vespa.package import HNSW\n",
|
|
"\n",
|
|
"app_package.schema.add_fields(\n",
|
|
" Field(\n",
|
|
" name=\"embedding\",\n",
|
|
" type=\"tensor<float>(x[384])\",\n",
|
|
" indexing=[\"attribute\", \"summary\", \"index\"],\n",
|
|
" ann=HNSW(\n",
|
|
" distance_metric=\"angular\",\n",
|
|
" max_links_per_node=16,\n",
|
|
" neighbors_to_explore_at_insert=200,\n",
|
|
" ),\n",
|
|
" )\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2ed1c224",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This creates a HNSW index on the embedding data which allows for efficient\n",
|
|
"searching. With this set, we can easily search using ANN by setting\n",
|
|
"the `approximate` argument to `True`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7981739a",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"results = db.similarity_search(query, approximate=True)\n",
|
|
"# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "24791204",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"This covers most of the functionality in the Vespa vector store in LangChain.\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
} |