diff --git a/docs/extras/integrations/vectorstores/vespa.ipynb b/docs/extras/integrations/vectorstores/vespa.ipynb index c7500944093..62e2bd7679a 100644 --- a/docs/extras/integrations/vectorstores/vespa.ipynb +++ b/docs/extras/integrations/vectorstores/vespa.ipynb @@ -1,883 +1,922 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "ce0f17b9", - "metadata": {}, - "source": [ - "# Vespa\n", - "\n", - ">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n", - "\n", - "This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n", - "\n", - "In order to create the vector store, we use\n", - "[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n", - "connection a `Vespa` service." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7e6a11ab-38bd-4920-ba11-60cb2f075754", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "#!pip install pyvespa" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Using the `pyvespa` package, you can either connect to a\n", - "[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n", - "or a local\n", - "[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n", - "Here, we will create a new Vespa application and deploy that using Docker.\n", - "\n", - "#### Creating a Vespa application\n", - "\n", - "First, we need to create an application package:" - ], - "metadata": { - "collapsed": false - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from vespa.package import ApplicationPackage, Field, RankProfile\n", - "\n", - "app_package = ApplicationPackage(name=\"testapp\")\n", - "app_package.schema.add_fields(\n", - " Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n", - " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", - " indexing=[\"attribute\", \"summary\"],\n", - " attribute=[f\"distance-metric: angular\"]),\n", - ")\n", - "app_package.schema.add_rank_profile(\n", - " RankProfile(name=\"default\",\n", - " first_phase=\"closeness(field, embedding)\",\n", - " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", - " )\n", - ")" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" + "cells": [ + { + "cell_type": "markdown", + "id": "ce0f17b9", + "metadata": {}, + "source": [ + "# Vespa\n", + "\n", + ">[Vespa](https://vespa.ai/) is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.\n", + "\n", + "This notebook shows how to use `Vespa.ai` as a LangChain vector store.\n", + "\n", + "In order to create the vector store, we use\n", + "[pyvespa](https://pyvespa.readthedocs.io/en/latest/index.html) to create a\n", + "connection a `Vespa` service." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e6a11ab-38bd-4920-ba11-60cb2f075754", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "#!pip install pyvespa" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Using the `pyvespa` package, you can either connect to a\n", + "[Vespa Cloud instance](https://pyvespa.readthedocs.io/en/latest/deploy-vespa-cloud.html)\n", + "or a local\n", + "[Docker instance](https://pyvespa.readthedocs.io/en/latest/deploy-docker.html).\n", + "Here, we will create a new Vespa application and deploy that using Docker.\n", + "\n", + "#### Creating a Vespa application\n", + "\n", + "First, we need to create an application package:" + ], + "metadata": { + "collapsed": false + }, + "id": "283b49c9" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import ApplicationPackage, Field, RankProfile\n", + "\n", + "app_package = ApplicationPackage(name=\"testapp\")\n", + "app_package.schema.add_fields(\n", + " Field(name=\"text\", type=\"string\", indexing=[\"index\", \"summary\"], index=\"enable-bm25\"),\n", + " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", + " indexing=[\"attribute\", \"summary\"],\n", + " attribute=[f\"distance-metric: angular\"]),\n", + ")\n", + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"default\",\n", + " first_phase=\"closeness(field, embedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "91150665" + }, + { + "cell_type": "markdown", + "source": [ + "This sets up a Vespa application with a schema for each document that contains\n", + "two fields: `text` for holding the document text and `embedding` for holding\n", + "the embedding vector. The `text` field is set up to use a BM25 index for\n", + "efficient text retrieval, and we'll see how to use this and hybrid search a\n", + "bit later.\n", + "\n", + "The `embedding` field is set up with a vector of length 384 to hold the\n", + "embedding representation of the text. See\n", + "[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n", + "for more on tensors in Vespa.\n", + "\n", + "Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n", + "instruct Vespa how to order documents. Here we set this up with a\n", + "[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n", + "\n", + "Now we can deploy this application locally:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "15477106" + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c10dd962", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from vespa.deployment import VespaDocker\n", + "\n", + "vespa_docker = VespaDocker()\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)" + ] + }, + { + "cell_type": "markdown", + "id": "3df4ce53", + "metadata": {}, + "source": [ + "This deploys and creates a connection to a `Vespa` service. In case you\n", + "already have a Vespa application running, for instance in the cloud,\n", + "please refer to the PyVespa application for how to connect.\n", + "\n", + "#### Creating a Vespa vector store\n", + "\n", + "Now, let's load some documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from langchain.document_loaders import TextLoader\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "\n", + "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n", + "\n", + "embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "7abde491" + }, + { + "cell_type": "markdown", + "source": [ + "Here, we also set up local sentence embedder to transform the text to embedding\n", + "vectors. One could also use OpenAI embeddings, but the vector length needs to\n", + "be updated to `1536` to reflect the larger size of that embedding.\n", + "\n", + "To feed these to Vespa, we need to configure how the vector store should map to\n", + "fields in the Vespa application. Then we create the vector store directly from\n", + "this set of documents:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "d42365c7" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "vespa_config = dict(\n", + " page_content_field=\"text\",\n", + " embedding_field=\"embedding\",\n", + " input_field=\"query_embedding\"\n", + ")\n", + "\n", + "from langchain.vectorstores import VespaStore\n", + "\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "0b647878" + }, + { + "cell_type": "markdown", + "source": [ + "This creates a Vespa vector store and feeds that set of documents to Vespa.\n", + "The vector store takes care of calling the embedding function for each document\n", + "and inserts them into the database.\n", + "\n", + "We can now query the vector store:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "d6bd0aab" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ccca1f4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query)\n", + "\n", + "print(results[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "id": "1e7e34e1", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "This will use the embedding function given above to create a representation\n", + "for the query and use that to search Vespa. Note that this will use the\n", + "`default` ranking function, which we set up in the application package\n", + "above. You can use the `ranking` argument to `similarity_search` to\n", + "specify which ranking function to use.\n", + "\n", + "Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n", + "for more information.\n", + "\n", + "This covers the basic usage of the Vespa store in LangChain.\n", + "Now you can return the results and continue using these in LangChain.\n", + "\n", + "#### Updating documents\n", + "\n", + "An alternative to calling `from_documents`, you can create the vector\n", + "store directly and call `add_texts` from that. This can also be used to update\n", + "documents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query)\n", + "result = results[0]\n", + "\n", + "result.page_content = \"UPDATED: \" + result.page_content\n", + "db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n", + "\n", + "results = db.similarity_search(query)\n", + "print(results[0].page_content)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "a5256284" + }, + { + "cell_type": "markdown", + "source": [ + "However, the `pyvespa` library contains methods to manipulate\n", + "content on Vespa which you can use directly.\n", + "\n", + "#### Deleting documents\n", + "\n", + "You can delete documents using the `delete` function:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "2526b50e" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "result = db.similarity_search(query)\n", + "# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", + "\n", + "db.delete([\"32\"])\n", + "result = db.similarity_search(query)\n", + "# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "52cab87e" + }, + { + "cell_type": "markdown", + "source": [ + "Again, the `pyvespa` connection contains methods to delete documents as well.\n", + "\n", + "### Returning with scores\n", + "\n", + "The `similarity_search` method only returns the documents in order of\n", + "relevancy. To retrieve the actual scores:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "deffaba5" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "results = db.similarity_search_with_score(query)\n", + "result = results[0]\n", + "# result[1] ~= 0.463" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "cd9ae173" + }, + { + "cell_type": "markdown", + "source": [ + "This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n", + "cosine distance function (as given by the argument `angular` in the\n", + "application function).\n", + "\n", + "Different embedding functions need different distance functions, and Vespa\n", + "needs to know which distance function to use when orderings documents.\n", + "Please refer to the\n", + "[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n", + "for more information.\n", + "\n", + "### As retriever\n", + "\n", + "To use this vector store as a\n", + "[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n", + "simply call the `as_retriever` function, which is a standard vector store\n", + "method:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "7257d67a" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", + "retriever = db.as_retriever()\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = retriever.get_relevant_documents(query)\n", + "\n", + "# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "7fb717a9" + }, + { + "cell_type": "markdown", + "source": [ + "This allows for more general, unstructured, retrieval from the vector store.\n", + "\n", + "### Metadata\n", + "\n", + "In the example so far, we've only used the text and the embedding for that\n", + "text. Documents usually contain additional information, which in LangChain\n", + "is referred to as metadata.\n", + "\n", + "Vespa can contain many fields with different types by adding them to the application\n", + "package:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "fba7f07e" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "app_package.schema.add_fields(\n", + " # ...\n", + " Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", + " Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n", + " Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", + " # ...\n", + ")\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "59cffcf2" + }, + { + "cell_type": "markdown", + "source": [ + "We can add some metadata fields in the documents:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "eebef70c" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "# Add metadata\n", + "for i, doc in enumerate(docs):\n", + " doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n", + " doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n", + " doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "b21efbfa" + }, + { + "cell_type": "markdown", + "source": [ + "And let the Vespa vector store know about these fields:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "9b42bd4d" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "6bb272f6" + }, + { + "cell_type": "markdown", + "source": [ + "Now, when searching for these documents, these fields will be returned.\n", + "Also, these fields can be filtered on:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "43818655" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query, filter=\"rating > 3\")\n", + "# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n", + "# results[0].metadata[\"author\"] == \"Unknown\"" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "831759f3" + }, + { + "cell_type": "markdown", + "source": [ + "### Custom query\n", + "\n", + "If the default behavior of the similarity search does not fit your\n", + "requirements, you can always provide your own query. Thus, you don't\n", + "need to provide all of the configuration to the vector store, but\n", + "rather just write this yourself.\n", + "\n", + "First, let's add a BM25 ranking function to our application:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "a49aad6e" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import FieldSet\n", + "\n", + "app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n", + "app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "d0fb0562" + }, + { + "cell_type": "markdown", + "source": [ + "Then, to perform a regular text search based on BM25:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "fe607747" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where userQuery()\",\n", + " \"query\": query,\n", + " \"type\": \"weakAnd\",\n", + " \"ranking\": \"bm25\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", + "# results[0][1] ~= 14.384" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "cee245c3" + }, + { + "cell_type": "markdown", + "source": [ + "All of the powerful search and query capabilities of Vespa can be used\n", + "by using a custom query. Please refer to the Vespa documentation on it's\n", + "[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n", + "\n", + "### Hybrid search\n", + "\n", + "Hybrid search means using both a classic term-based search such as\n", + "BM25 and a vector search and combining the results. We need to create\n", + "a new rank profile for hybrid search on Vespa:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "41a4c081" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"hybrid\",\n", + " first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")\n", + "vespa_app = vespa_docker.deploy(application_package=app_package)\n", + "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "bf73efc1" + }, + { + "cell_type": "markdown", + "source": [ + "Here, we score each document as a combination of it's BM25 score and its\n", + "distance score. We can query using a custom query:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "40f48711" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "query_embedding = embedding_function.embed_query(query)\n", + "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n", + " \"query\": query,\n", + " \"type\": \"weakAnd\",\n", + " \"input.query(query_embedding)\": query_embedding,\n", + " \"ranking\": \"hybrid\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", + "# results[0][1] ~= 2.897" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "d2e289f0" + }, + { + "cell_type": "markdown", + "source": [ + "### Native embedders in Vespa\n", + "\n", + "Up until this point we've used an embedding function in Python to provide\n", + "embeddings for the texts. Vespa supports embedding function natively, so\n", + "you can defer this calculation in to Vespa. One benefit is the ability to use\n", + "GPUs when embedding documents if you have a large collections.\n", + "\n", + "Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n", + "for more information.\n", + "\n", + "First, we need to modify our application package:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "958e269f" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import Component, Parameter\n", + "\n", + "app_package.components = [\n", + " Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n", + " parameters=[\n", + " Parameter(\"transformer-model\", {\"path\": \"...\"}),\n", + " Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n", + " ]\n", + " )\n", + "]\n", + "Field(name=\"hfembedding\", type=\"tensor(x[384])\",\n", + " is_document_field=False,\n", + " indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n", + " attribute=[f\"distance-metric: angular\"],\n", + " )\n", + "app_package.schema.add_rank_profile(\n", + " RankProfile(name=\"hf_similarity\",\n", + " first_phase=\"closeness(field, hfembedding)\",\n", + " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", + " )\n", + ")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "56b9686c" + }, + { + "cell_type": "markdown", + "source": [ + "Please refer to the embeddings documentation on adding embedder models\n", + "and tokenizers to the application. Note that the `hfembedding` field\n", + "includes instructions for embedding using the `hf-embedder`.\n", + "\n", + "Now we can query with a custom query:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "5cd721a8" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n", + "custom_query = {\n", + " \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n", + " \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n", + " \"ranking\": \"internal_similarity\",\n", + " \"hits\": 4\n", + "}\n", + "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", + "# results[0][1] ~= 0.630" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "da631d13" + }, + { + "cell_type": "markdown", + "source": [ + "Note that the query here includes an `embed` instruction to embed the query\n", + "using the same model as for the documents.\n", + "\n", + "### Approximate nearest neighbor\n", + "\n", + "In all of the above examples, we've used exact nearest neighbor to\n", + "find results. However, for large collections of documents this is\n", + "not feasible as one has to scan through all documents to find the\n", + "best matches. To avoid this, we can use\n", + "[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n", + "\n", + "First, we can change the embedding field to create a HNSW index:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "a333b553" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "from vespa.package import HNSW\n", + "\n", + "app_package.schema.add_fields(\n", + " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", + " indexing=[\"attribute\", \"summary\", \"index\"],\n", + " ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n", + " )\n", + ")\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "9ee955c8" + }, + { + "cell_type": "markdown", + "source": [ + "This creates a HNSW index on the embedding data which allows for efficient\n", + "searching. With this set, we can easily search using ANN by setting\n", + "the `approximate` argument to `True`:" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "2ed1c224" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "results = db.similarity_search(query, approximate=True)\n", + "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "7981739a" + }, + { + "cell_type": "markdown", + "source": [ + "This covers most of the functionality in the Vespa vector store in LangChain.\n", + "\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "id": "24791204" } - } - }, - { - "cell_type": "markdown", - "source": [ - "This sets up a Vespa application with a schema for each document that contains\n", - "two fields: `text` for holding the document text and `embedding` for holding\n", - "the embedding vector. The `text` field is set up to use a BM25 index for\n", - "efficient text retrieval, and we'll see how to use this and hybrid search a\n", - "bit later.\n", - "\n", - "The `embedding` field is set up with a vector of length 384 to hold the\n", - "embedding representation of the text. See\n", - "[Vespa's Tensor Guide](https://docs.vespa.ai/en/tensor-user-guide.html)\n", - "for more on tensors in Vespa.\n", - "\n", - "Lastly, we add a [rank profile](https://docs.vespa.ai/en/ranking.html) to\n", - "instruct Vespa how to order documents. Here we set this up with a\n", - "[nearest neighbor search](https://docs.vespa.ai/en/nearest-neighbor-search.html).\n", - "\n", - "Now we can deploy this application locally:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" } - } }, - { - "cell_type": "code", - "execution_count": 2, - "id": "c10dd962", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from vespa.deployment import VespaDocker\n", - "\n", - "vespa_docker = VespaDocker()\n", - "vespa_app = vespa_docker.deploy(application_package=app_package)" - ] - }, - { - "cell_type": "markdown", - "id": "3df4ce53", - "metadata": {}, - "source": [ - "This deploys and creates a connection to a `Vespa` service. In case you\n", - "already have a Vespa application running, for instance in the cloud,\n", - "please refer to the PyVespa application for how to connect.\n", - "\n", - "#### Creating a Vespa vector store\n", - "\n", - "Now, let's load some documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from langchain.document_loaders import TextLoader\n", - "from langchain.text_splitter import CharacterTextSplitter\n", - "\n", - "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", - "documents = loader.load()\n", - "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", - "docs = text_splitter.split_documents(documents)\n", - "\n", - "from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n", - "\n", - "embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Here, we also set up local sentence embedder to transform the text to embedding\n", - "vectors. One could also use OpenAI embeddings, but the vector length needs to\n", - "be updated to `1536` to reflect the larger size of that embedding.\n", - "\n", - "To feed these to Vespa, we need to configure how the vector store should map to\n", - "fields in the Vespa application. Then we create the vector store directly from\n", - "this set of documents:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "vespa_config = dict(\n", - " page_content_field=\"text\",\n", - " embedding_field=\"embedding\",\n", - " input_field=\"query_embedding\"\n", - ")\n", - "\n", - "from langchain.vectorstores import VespaStore\n", - "\n", - "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "This creates a Vespa vector store and feeds that set of documents to Vespa.\n", - "The vector store takes care of calling the embedding function for each document\n", - "and inserts them into the database.\n", - "\n", - "We can now query the vector store:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7ccca1f4", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "results = db.similarity_search(query)\n", - "\n", - "print(results[0].page_content)" - ] - }, - { - "cell_type": "markdown", - "id": "1e7e34e1", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "This will use the embedding function given above to create a representation\n", - "for the query and use that to search Vespa. Note that this will use the\n", - "`default` ranking function, which we set up in the application package\n", - "above. You can use the `ranking` argument to `similarity_search` to\n", - "specify which ranking function to use.\n", - "\n", - "Please refer to the [pyvespa documentation](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Query)\n", - "for more information.\n", - "\n", - "This covers the basic usage of the Vespa store in LangChain.\n", - "Now you can return the results and continue using these in LangChain.\n", - "\n", - "#### Updating documents\n", - "\n", - "An alternative to calling `from_documents`, you can create the vector\n", - "store directly and call `add_texts` from that. This can also be used to update\n", - "documents:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "results = db.similarity_search(query)\n", - "result = results[0]\n", - "\n", - "result.page_content = \"UPDATED: \" + result.page_content\n", - "db.add_texts([result.page_content], [result.metadata], result.metadata[\"id\"])\n", - "\n", - "results = db.similarity_search(query)\n", - "print(results[0].page_content)" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "However, the `pyvespa` library contains methods to manipulate\n", - "content on Vespa which you can use directly.\n", - "\n", - "#### Deleting documents\n", - "\n", - "You can delete documents using the `delete` function:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "result = db.similarity_search(query)\n", - "# docs[0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", - "\n", - "db.delete([\"32\"])\n", - "result = db.similarity_search(query)\n", - "# docs[0].metadata[\"id\"] != \"id:testapp:testapp::32\"" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Again, the `pyvespa` connection contains methods to delete documents as well.\n", - "\n", - "### Returning with scores\n", - "\n", - "The `similarity_search` method only returns the documents in order of\n", - "relevancy. To retrieve the actual scores:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "results = db.similarity_search_with_score(query)\n", - "result = results[0]\n", - "# result[1] ~= 0.463" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "This is a result of using the `\"all-MiniLM-L6-v2\"` embedding model using the\n", - "cosine distance function (as given by the argument `angular` in the\n", - "application function).\n", - "\n", - "Different embedding functions need different distance functions, and Vespa\n", - "needs to know which distance function to use when orderings documents.\n", - "Please refer to the\n", - "[documentation on distance functions](https://docs.vespa.ai/en/reference/schema-reference.html#distance-metric)\n", - "for more information.\n", - "\n", - "### As retriever\n", - "\n", - "To use this vector store as a\n", - "[LangChain retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/)\n", - "simply call the `as_retriever` function, which is a standard vector store\n", - "method:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", - "retriever = db.as_retriever()\n", - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "results = retriever.get_relevant_documents(query)\n", - "\n", - "# results[0].metadata[\"id\"] == \"id:testapp:testapp::32\"" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "This allows for more general, unstructured, retrieval from the vector store.\n", - "\n", - "### Metadata\n", - "\n", - "In the example so far, we've only used the text and the embedding for that\n", - "text. Documents usually contain additional information, which in LangChain\n", - "is referred to as metadata.\n", - "\n", - "Vespa can contain many fields with different types by adding them to the application\n", - "package:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "app_package.schema.add_fields(\n", - " # ...\n", - " Field(name=\"date\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", - " Field(name=\"rating\", type=\"int\", indexing=[\"attribute\", \"summary\"]),\n", - " Field(name=\"author\", type=\"string\", indexing=[\"attribute\", \"summary\"]),\n", - " # ...\n", - ")\n", - "vespa_app = vespa_docker.deploy(application_package=app_package)" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "We can add some metadata fields in the documents:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Add metadata\n", - "for i, doc in enumerate(docs):\n", - " doc.metadata[\"date\"] = f\"2023-{(i % 12)+1}-{(i % 28)+1}\"\n", - " doc.metadata[\"rating\"] = range(1, 6)[i % 5]\n", - " doc.metadata[\"author\"] = [\"Joe Biden\", \"Unknown\"][min(i, 1)]" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "And let the Vespa vector store know about these fields:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "vespa_config.update(dict(metadata_fields=[\"date\", \"rating\", \"author\"]))" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Now, when searching for these documents, these fields will be returned.\n", - "Also, these fields can be filtered on:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)\n", - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "results = db.similarity_search(query, filter=\"rating > 3\")\n", - "# results[0].metadata[\"id\"] == \"id:testapp:testapp::34\"\n", - "# results[0].metadata[\"author\"] == \"Unknown\"" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Custom query\n", - "\n", - "If the default behavior of the similarity search does not fit your\n", - "requirements, you can always provide your own query. Thus, you don't\n", - "need to provide all of the configuration to the vector store, but\n", - "rather just write this yourself.\n", - "\n", - "First, let's add a BM25 ranking function to our application:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from vespa.package import FieldSet\n", - "\n", - "app_package.schema.add_field_set(FieldSet(name=\"default\", fields=[\"text\"]))\n", - "app_package.schema.add_rank_profile(RankProfile(name=\"bm25\", first_phase=\"bm25(text)\"))\n", - "vespa_app = vespa_docker.deploy(application_package=app_package)\n", - "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Then, to perform a regular text search based on BM25:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "custom_query = {\n", - " \"yql\": f\"select * from sources * where userQuery()\",\n", - " \"query\": query,\n", - " \"type\": \"weakAnd\",\n", - " \"ranking\": \"bm25\",\n", - " \"hits\": 4\n", - "}\n", - "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", - "# results[0][0].metadata[\"id\"] == \"id:testapp:testapp::32\"\n", - "# results[0][1] ~= 14.384" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "All of the powerful search and query capabilities of Vespa can be used\n", - "by using a custom query. Please refer to the Vespa documentation on it's\n", - "[Query API](https://docs.vespa.ai/en/query-api.html) for more details.\n", - "\n", - "### Hybrid search\n", - "\n", - "Hybrid search means using both a classic term-based search such as\n", - "BM25 and a vector search and combining the results. We need to create\n", - "a new rank profile for hybrid search on Vespa:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "app_package.schema.add_rank_profile(\n", - " RankProfile(name=\"hybrid\",\n", - " first_phase=\"log(bm25(text)) + 0.5 * closeness(field, embedding)\",\n", - " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", - " )\n", - ")\n", - "vespa_app = vespa_docker.deploy(application_package=app_package)\n", - "db = VespaStore.from_documents(docs, embedding_function, app=vespa_app, **vespa_config)" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Here, we score each document as a combination of it's BM25 score and its\n", - "distance score. We can query using a custom query:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "query_embedding = embedding_function.embed_query(query)\n", - "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(embedding, query_embedding)\"\n", - "custom_query = {\n", - " \"yql\": f\"select * from sources * where {nearest_neighbor_expression} and userQuery()\",\n", - " \"query\": query,\n", - " \"type\": \"weakAnd\",\n", - " \"input.query(query_embedding)\": query_embedding,\n", - " \"ranking\": \"hybrid\",\n", - " \"hits\": 4\n", - "}\n", - "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", - "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", - "# results[0][1] ~= 2.897" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "### Native embedders in Vespa\n", - "\n", - "Up until this point we've used an embedding function in Python to provide\n", - "embeddings for the texts. Vespa supports embedding function natively, so\n", - "you can defer this calculation in to Vespa. One benefit is the ability to use\n", - "GPUs when embedding documents if you have a large collections.\n", - "\n", - "Please refer to [Vespa embeddings](https://docs.vespa.ai/en/embedding.html)\n", - "for more information.\n", - "\n", - "First, we need to modify our application package:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from vespa.package import Component, Parameter\n", - "\n", - "app_package.components = [\n", - " Component(id=\"hf-embedder\", type=\"hugging-face-embedder\",\n", - " parameters=[\n", - " Parameter(\"transformer-model\", {\"path\": \"...\"}),\n", - " Parameter(\"tokenizer-model\", {\"url\": \"...\"}),\n", - " ]\n", - " )\n", - "]\n", - "Field(name=\"hfembedding\", type=\"tensor(x[384])\",\n", - " is_document_field=False,\n", - " indexing=[\"input text\", \"embed hf-embedder\", \"attribute\", \"summary\"],\n", - " attribute=[f\"distance-metric: angular\"],\n", - " )\n", - "app_package.schema.add_rank_profile(\n", - " RankProfile(name=\"hf_similarity\",\n", - " first_phase=\"closeness(field, hfembedding)\",\n", - " inputs=[(\"query(query_embedding)\", \"tensor(x[384])\")]\n", - " )\n", - ")" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Please refer to the embeddings documentation on adding embedder models\n", - "and tokenizers to the application. Note that the `hfembedding` field\n", - "includes instructions for embedding using the `hf-embedder`.\n", - "\n", - "Now we can query with a custom query:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "nearest_neighbor_expression = \"{targetHits: 4}nearestNeighbor(internalembedding, query_embedding)\"\n", - "custom_query = {\n", - " \"yql\": f\"select * from sources * where {nearest_neighbor_expression}\",\n", - " \"input.query(query_embedding)\": f\"embed(hf-embedder, \\\"{query}\\\")\",\n", - " \"ranking\": \"internal_similarity\",\n", - " \"hits\": 4\n", - "}\n", - "results = db.similarity_search_with_score(query, custom_query=custom_query)\n", - "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")\n", - "# results[0][1] ~= 0.630" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "Note that the query here includes an `embed` instruction to embed the query\n", - "using the same model as for the documents.\n", - "\n", - "### Approximate nearest neighbor\n", - "\n", - "In all of the above examples, we've used exact nearest neighbor to\n", - "find results. However, for large collections of documents this is\n", - "not feasible as one has to scan through all documents to find the\n", - "best matches. To avoid this, we can use\n", - "[approximate nearest neighbors](https://docs.vespa.ai/en/approximate-nn-hnsw.html).\n", - "\n", - "First, we can change the embedding field to create a HNSW index:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from vespa.package import HNSW\n", - "\n", - "app_package.schema.add_fields(\n", - " Field(name=\"embedding\", type=\"tensor(x[384])\",\n", - " indexing=[\"attribute\", \"summary\", \"index\"],\n", - " ann=HNSW(distance_metric=\"angular\", max_links_per_node=16, neighbors_to_explore_at_insert=200)\n", - " )\n", - ")\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "This creates a HNSW index on the embedding data which allows for efficient\n", - "searching. With this set, we can easily search using ANN by setting\n", - "the `approximate` argument to `True`:" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "results = db.similarity_search(query, approximate=True)\n", - "# results[0][0].metadata[\"id\"], \"id:testapp:testapp::32\")" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "markdown", - "source": [ - "This covers most of the functionality in the Vespa vector store in LangChain.\n", - "\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "nbformat": 4, + "nbformat_minor": 5 } \ No newline at end of file diff --git a/docs/extras/modules/data_connection/retrievers/self_query/opensearch_self_query.ipynb b/docs/extras/modules/data_connection/retrievers/self_query/opensearch_self_query.ipynb index 947960c2170..0045176faa5 100644 --- a/docs/extras/modules/data_connection/retrievers/self_query/opensearch_self_query.ipynb +++ b/docs/extras/modules/data_connection/retrievers/self_query/opensearch_self_query.ipynb @@ -1,439 +1,440 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "13afcae7", - "metadata": {}, - "source": [ - "# OpenSearch\n", - "\n", - "> [OpenSearch](https://opensearch.org/) is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. `OpenSearch` is a distributed search and analytics engine based on `Apache Lucene`.\n", - "\n", - "In this notebook, we'll demo the `SelfQueryRetriever` with an `OpenSearch` vector store." - ] - }, - { - "cell_type": "markdown", - "id": "68e75fb9", - "metadata": {}, - "source": [ - "## Creating an OpenSearch vector store\n", - "\n", - "First, we'll want to create an `OpenSearch` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n", - "\n", - "**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `opensearch-py` package." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install lark opensearch-py" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "cb4a5787", - "metadata": { - "tags": [] - }, - "outputs": [ + "cells": [ { - "name": "stdin", - "output_type": "stream", - "text": [ - "OpenAI API Key: ········\n" - ] - } - ], - "source": [ - "from langchain.schema import Document\n", - "from langchain.embeddings.openai import OpenAIEmbeddings\n", - "from langchain.vectorstores import OpenSearchVectorSearch\n", - "import os\n", - "import getpass\n", - "\n", - "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n", - "\n", - "embeddings = OpenAIEmbeddings()" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "bcbe04d9", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "docs = [\n", - " Document(\n", - " page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", - " metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n", - " ),\n", - " Document(\n", - " page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", - " metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n", - " ),\n", - " Document(\n", - " page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", - " metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n", - " ),\n", - " Document(\n", - " page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", - " metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n", - " ),\n", - " Document(\n", - " page_content=\"Toys come alive and have a blast doing so\",\n", - " metadata={\"year\": 1995, \"genre\": \"animated\"},\n", - " ),\n", - " Document(\n", - " page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n", - " metadata={\n", - " \"year\": 1979,\n", - " \"rating\": 9.9,\n", - " \"director\": \"Andrei Tarkovsky\",\n", - " \"genre\": \"science fiction\",\n", - " },\n", - " ),\n", - "]\n", - "vectorstore = OpenSearchVectorSearch.from_documents(\n", - " docs, embeddings, index_name=\"opensearch-self-query-demo\", opensearch_url=\"http://localhost:9200\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "5ecaab6d", - "metadata": {}, - "source": [ - "## Creating our self-querying retriever\n", - "Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "86e34dbf", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from langchain.llms import OpenAI\n", - "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", - "from langchain.chains.query_constructor.base import AttributeInfo\n", - "\n", - "metadata_field_info = [\n", - " AttributeInfo(\n", - " name=\"genre\",\n", - " description=\"The genre of the movie\",\n", - " type=\"string or list[string]\",\n", - " ),\n", - " AttributeInfo(\n", - " name=\"year\",\n", - " description=\"The year the movie was released\",\n", - " type=\"integer\",\n", - " ),\n", - " AttributeInfo(\n", - " name=\"director\",\n", - " description=\"The name of the movie director\",\n", - " type=\"string\",\n", - " ),\n", - " AttributeInfo(\n", - " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", - " ),\n", - "]\n", - "document_content_description = \"Brief summary of a movie\"\n", - "llm = OpenAI(temperature=0)\n", - "retriever = SelfQueryRetriever.from_llm(\n", - " llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "ea9df8d4", - "metadata": {}, - "source": [ - "## Testing it out\n", - "And now we can try actually using our retriever!" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "38a126e9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query='dinosaur' filter=None limit=None\n" - ] + "cell_type": "markdown", + "id": "13afcae7", + "metadata": {}, + "source": [ + "# OpenSearch\n", + "\n", + "> [OpenSearch](https://opensearch.org/) is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. `OpenSearch` is a distributed search and analytics engine based on `Apache Lucene`.\n", + "\n", + "In this notebook, we'll demo the `SelfQueryRetriever` with an `OpenSearch` vector store." + ] }, { - "data": { - "text/plain": [ - "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'}),\n", - " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n", - " Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n", - " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]" + "cell_type": "markdown", + "id": "68e75fb9", + "metadata": {}, + "source": [ + "## Creating an OpenSearch vector store\n", + "\n", + "First, we'll want to create an `OpenSearch` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n", + "\n", + "**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `opensearch-py` package." ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This example only specifies a relevant query\n", - "retriever.get_relevant_documents(\"What are some movies about dinosaurs\")" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "60bf0074-e65e-4558-a4f2-8190f3e4e2f9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query=' ' filter=Comparison(comparator=, attribute='rating', value=8.5) limit=None\n" - ] }, { - "data": { - "text/plain": [ - "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'}),\n", - " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This example only specifies a filter\n", - "retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "b19d4da0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query='women' filter=Comparison(comparator=, attribute='director', value='Greta Gerwig') limit=None\n" - ] + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!pip install lark opensearch-py" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + }, + "id": "6078a74d" }, { - "data": { - "text/plain": [ - "[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3})]" + "cell_type": "code", + "execution_count": 3, + "id": "cb4a5787", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdin", + "output_type": "stream", + "text": [ + "OpenAI API Key: \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\n" + ] + } + ], + "source": [ + "from langchain.schema import Document\n", + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.vectorstores import OpenSearchVectorSearch\n", + "import os\n", + "import getpass\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n", + "\n", + "embeddings = OpenAIEmbeddings()" ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This example specifies a query and a filter\n", - "retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "a59f946b-78a1-4d3e-9942-63834c7d7589", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query=' ' filter=Operation(operator=, arguments=[Comparison(comparator=, attribute='rating', value=8.5), Comparison(comparator=, attribute='genre', value='science fiction')]) limit=None\n" - ] }, { - "data": { - "text/plain": [ - "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]" + "cell_type": "code", + "execution_count": 8, + "id": "bcbe04d9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "docs = [\n", + " Document(\n", + " page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", + " metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", + " metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n", + " ),\n", + " Document(\n", + " page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", + " metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n", + " ),\n", + " Document(\n", + " page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", + " metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n", + " ),\n", + " Document(\n", + " page_content=\"Toys come alive and have a blast doing so\",\n", + " metadata={\"year\": 1995, \"genre\": \"animated\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n", + " metadata={\n", + " \"year\": 1979,\n", + " \"rating\": 9.9,\n", + " \"director\": \"Andrei Tarkovsky\",\n", + " \"genre\": \"science fiction\",\n", + " },\n", + " ),\n", + "]\n", + "vectorstore = OpenSearchVectorSearch.from_documents(\n", + " docs, embeddings, index_name=\"opensearch-self-query-demo\", opensearch_url=\"http://localhost:9200\"\n", + ")" ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This example specifies a composite filter\n", - "retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")" - ] - }, - { - "cell_type": "markdown", - "id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51", - "metadata": {}, - "source": [ - "## Filter k\n", - "\n", - "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", - "\n", - "We can do this by passing `enable_limit=True` to the constructor." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "bff36b88-b506-4877-9c63-e5a1a8d78e64", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "retriever = SelfQueryRetriever.from_llm(\n", - " llm,\n", - " vectorstore,\n", - " document_content_description,\n", - " metadata_field_info,\n", - " enable_limit=True,\n", - " verbose=True,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "2758d229-4f97-499c-819f-888acaf8ee10", - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query='dinosaur' filter=None limit=2\n" - ] }, { - "data": { - "text/plain": [ - "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'}),\n", - " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]" + "cell_type": "markdown", + "id": "5ecaab6d", + "metadata": {}, + "source": [ + "## Creating our self-querying retriever\n", + "Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents." ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# This example only specifies a relevant query\n", - "retriever.get_relevant_documents(\"what are two movies about dinosaurs\")" - ] - }, - { - "cell_type": "markdown", - "id": "61a10294", - "metadata": {}, - "source": [ - "## Complex queries in Action!\n", - "We've tried out some simple queries, but what about more complex ones? Let's try out a few more complex queries that utilize the full power of OpenSearch." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "e460da93", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "query='animated toys' filter=Operation(operator=, arguments=[Operation(operator=, arguments=[Comparison(comparator=, attribute='genre', value='animated'), Comparison(comparator=, attribute='genre', value='comedy')]), Comparison(comparator=, attribute='year', value=1990)]) limit=None\n" - ] }, { - "data": { - "text/plain": [ - "[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]" + "cell_type": "code", + "execution_count": 9, + "id": "86e34dbf", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from langchain.llms import OpenAI\n", + "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", + "from langchain.chains.query_constructor.base import AttributeInfo\n", + "\n", + "metadata_field_info = [\n", + " AttributeInfo(\n", + " name=\"genre\",\n", + " description=\"The genre of the movie\",\n", + " type=\"string or list[string]\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"year\",\n", + " description=\"The year the movie was released\",\n", + " type=\"integer\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"director\",\n", + " description=\"The name of the movie director\",\n", + " type=\"string\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", + " ),\n", + "]\n", + "document_content_description = \"Brief summary of a movie\"\n", + "llm = OpenAI(temperature=0)\n", + "retriever = SelfQueryRetriever.from_llm(\n", + " llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n", + ")" ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "retriever.get_relevant_documents(\"what animated or comedy movies have been released in the last 30 years about animated toys?\")" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "0851fc42", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ + }, { - "data": { - "text/plain": [ - "{'acknowledged': True}" + "cell_type": "markdown", + "id": "ea9df8d4", + "metadata": {}, + "source": [ + "## Testing it out\n", + "And now we can try actually using our retriever!" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "38a126e9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query='dinosaur' filter=None limit=None\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'}),\n", + " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n", + " Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n", + " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example only specifies a relevant query\n", + "retriever.get_relevant_documents(\"What are some movies about dinosaurs\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "60bf0074-e65e-4558-a4f2-8190f3e4e2f9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query=' ' filter=Comparison(comparator=, attribute='rating', value=8.5) limit=None\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'}),\n", + " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example only specifies a filter\n", + "retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "b19d4da0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query='women' filter=Comparison(comparator=, attribute='director', value='Greta Gerwig') limit=None\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3})]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example specifies a query and a filter\n", + "retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a59f946b-78a1-4d3e-9942-63834c7d7589", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query=' ' filter=Operation(operator=, arguments=[Comparison(comparator=, attribute='rating', value=8.5), Comparison(comparator=, attribute='genre', value='science fiction')]) limit=None\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example specifies a composite filter\n", + "retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")" + ] + }, + { + "cell_type": "markdown", + "id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51", + "metadata": {}, + "source": [ + "## Filter k\n", + "\n", + "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", + "\n", + "We can do this by passing `enable_limit=True` to the constructor." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "bff36b88-b506-4877-9c63-e5a1a8d78e64", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "retriever = SelfQueryRetriever.from_llm(\n", + " llm,\n", + " vectorstore,\n", + " document_content_description,\n", + " metadata_field_info,\n", + " enable_limit=True,\n", + " verbose=True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "2758d229-4f97-499c-819f-888acaf8ee10", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query='dinosaur' filter=None limit=2\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'year': 1993, 'rating': 7.7, 'genre': 'science fiction'}),\n", + " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example only specifies a relevant query\n", + "retriever.get_relevant_documents(\"what are two movies about dinosaurs\")" + ] + }, + { + "cell_type": "markdown", + "id": "61a10294", + "metadata": {}, + "source": [ + "## Complex queries in Action!\n", + "We've tried out some simple queries, but what about more complex ones? Let's try out a few more complex queries that utilize the full power of OpenSearch." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "e460da93", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "query='animated toys' filter=Operation(operator=, arguments=[Operation(operator=, arguments=[Comparison(comparator=, attribute='genre', value='animated'), Comparison(comparator=, attribute='genre', value='comedy')]), Comparison(comparator=, attribute='year', value=1990)]) limit=None\n" + ] + }, + { + "data": { + "text/plain": [ + "[Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'})]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever.get_relevant_documents(\"what animated or comedy movies have been released in the last 30 years about animated toys?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "0851fc42", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'acknowledged': True}" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vectorstore.client.indices.delete(index=\"opensearch-self-query-demo\")\n" ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" } - ], - "source": [ - "vectorstore.client.indices.delete(index=\"opensearch-self-query-demo\")\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.18" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "nbformat": 4, + "nbformat_minor": 5 } \ No newline at end of file diff --git a/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx b/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx index 0a6135e3033..2087fc9e4ca 100644 --- a/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx +++ b/docs/snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx @@ -31,7 +31,8 @@ from langchain.text_splitter import ( 'markdown', 'latex', 'html', - 'sol',] + 'sol', + 'csharp'] ``` @@ -342,3 +343,72 @@ sol_docs ``` + + +## C# +Here's an example using the C# text splitter: + +```csharp +using System; +class Program +{ + static void Main() + { + int age = 30; // Change the age value as needed + + // Categorize the age without any console output + if (age < 18) + { + // Age is under 18 + } + else if (age >= 18 && age < 65) + { + // Age is an adult + } + else + { + // Age is a senior citizen + } + } +} +``` + + + +``` + [Document(page_content='using System;', metadata={}), + Document(page_content='class Program\n{', metadata={}), + Document(page_content='static void', metadata={}), + Document(page_content='Main()', metadata={}), + Document(page_content='{', metadata={}), + Document(page_content='int age', metadata={}), + Document(page_content='= 30; // Change', metadata={}), + Document(page_content='the age value', metadata={}), + Document(page_content='as needed', metadata={}), + Document(page_content='//', metadata={}), + Document(page_content='Categorize the', metadata={}), + Document(page_content='age without any', metadata={}), + Document(page_content='console output', metadata={}), + Document(page_content='if (age', metadata={}), + Document(page_content='< 18)', metadata={}), + Document(page_content='{', metadata={}), + Document(page_content='//', metadata={}), + Document(page_content='Age is under 18', metadata={}), + Document(page_content='}', metadata={}), + Document(page_content='else if', metadata={}), + Document(page_content='(age >= 18 &&', metadata={}), + Document(page_content='age < 65)', metadata={}), + Document(page_content='{', metadata={}), + Document(page_content='//', metadata={}), + Document(page_content='Age is an adult', metadata={}), + Document(page_content='}', metadata={}), + Document(page_content='else', metadata={}), + Document(page_content='{', metadata={}), + Document(page_content='//', metadata={}), + Document(page_content='Age is a senior', metadata={}), + Document(page_content='citizen', metadata={}), + Document(page_content='}\n }', metadata={}), + Document(page_content='}', metadata={})] + ``` + +