langchain/docs/docs/integrations/retrievers/self_query/vectara_self_query.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13afcae7",
   "metadata": {},
   "source": [
    "# Vectara self-querying \n",
    "\n",
    "[Vectara](https://vectara.com/) provides a Trusted Generative AI platform, allowing organizations to rapidly create a ChatGPT-like experience (an AI assistant) which is grounded in the data, documents, and knowledge that they have (technically, it is Retrieval-Augmented-Generation-as-a-service). \n",
    "\n",
    "Vectara serverless RAG-as-a-service provides all the components of RAG behind an easy-to-use API, including:\n",
    "1. A way to extract text from files (PDF, PPT, DOCX, etc)\n",
    "2. ML-based chunking that provides state of the art performance.\n",
    "3. The [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model.\n",
    "4. Its own internal vector database where text chunks and embedding vectors are stored.\n",
    "5. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and [MMR](https://vectara.com/get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker/))\n",
    "7. An LLM to for creating a [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview), based on the retrieved documents (context), including citations.\n",
    "\n",
    "See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.\n",
    "\n",
    "This notebook shows how to use `SelfQueryRetriever` with Vectara."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68e75fb9",
   "metadata": {},
   "source": [
    "# Getting Started\n",
    "\n",
    "To get started, use the following steps:\n",
    "1. If you don't already have one, [Sign up](https://www.vectara.com/integrations/langchain) for your free Vectara account. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.\n",
    "2. Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **\"Create Corpus\"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.\n",
    "3. Next you'll need to create API keys to access the corpus. Click on the **\"Access Control\"** tab in the corpus view and then the **\"Create API Key\"** button. Give your key a name, and choose whether you want query-only or query+index for your key. Click \"Create\" and you now have an active API key. Keep this key confidential. \n",
    "\n",
    "To use LangChain with Vectara, you'll need to have these three values: `customer ID`, `corpus ID` and `api_key`.\n",
    "You can provide those to LangChain in two ways:\n",
    "\n",
    "1. Include in your environment these three variables: `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`.\n",
    "\n",
    "   For example, you can set these variables using os.environ and getpass as follows:\n",
    "\n",
    "```python\n",
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"VECTARA_CUSTOMER_ID\"] = getpass.getpass(\"Vectara Customer ID:\")\n",
    "os.environ[\"VECTARA_CORPUS_ID\"] = getpass.getpass(\"Vectara Corpus ID:\")\n",
    "os.environ[\"VECTARA_API_KEY\"] = getpass.getpass(\"Vectara API Key:\")\n",
    "```\n",
    "\n",
    "2. Add them to the `Vectara` vectorstore constructor:\n",
    "\n",
    "```python\n",
    "vectara = Vectara(\n",
    "                vectara_customer_id=vectara_customer_id,\n",
    "                vectara_corpus_id=vectara_corpus_id,\n",
    "                vectara_api_key=vectara_api_key\n",
    "            )\n",
    "```\n",
    "In this notebook we assume they are provided in the environment.\n",
    "\n",
    "**Notes:** The self-query retriever requires you to have `lark` installed (`pip install lark`). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "742ac16d",
   "metadata": {},
   "source": [
    "## Connecting to Vectara from LangChain\n",
    "\n",
    "In this example, we assume that you've created an account and a corpus, and added your `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY` (created with permissions for both indexing and query) as environment variables.\n",
    "\n",
    "We further assume the corpus has 4 fields defined as filterable metadata attributes: `year`, `director`, `rating`, and `genre`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9d3aa44f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"VECTARA_API_KEY\"] = \"<YOUR_VECTARA_API_KEY>\"\n",
    "os.environ[\"VECTARA_CORPUS_ID\"] = \"<YOUR_VECTARA_CORPUS_ID>\"\n",
    "os.environ[\"VECTARA_CUSTOMER_ID\"] = \"<YOUR_VECTARA_CUSTOMER_ID>\"\n",
    "\n",
    "from langchain.chains.query_constructor.base import AttributeInfo\n",
    "from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
    "from langchain.schema import Document\n",
    "from langchain_community.vectorstores import Vectara\n",
    "from langchain_openai.chat_models import ChatOpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13a6be33-de3c-4628-acc8-b94102c275b7",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "We first define an example dataset of movie, and upload those to the corpus, along with the metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bcbe04d9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "docs = [\n",
    "    Document(\n",
    "        page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
    "        metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
    "        metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
    "        metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
    "        metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Toys come alive and have a blast doing so\",\n",
    "        metadata={\"year\": 1995, \"genre\": \"animated\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
    "        metadata={\n",
    "            \"year\": 1979,\n",
    "            \"rating\": 9.9,\n",
    "            \"director\": \"Andrei Tarkovsky\",\n",
    "            \"genre\": \"science fiction\",\n",
    "        },\n",
    "    ),\n",
    "]\n",
    "\n",
    "vectara = Vectara()\n",
    "for doc in docs:\n",
    "    vectara.add_texts([doc.page_content], doc_metadata=doc.metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ecaab6d",
   "metadata": {},
   "source": [
    "## Creating the self-querying retriever\n",
    "Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents.\n",
    "\n",
    "We then provide an llm (in this case OpenAI) and the `vectara` vectorstore as arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "86e34dbf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "metadata_field_info = [\n",
    "    AttributeInfo(\n",
    "        name=\"genre\",\n",
    "        description=\"The genre of the movie\",\n",
    "        type=\"string or list[string]\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"year\",\n",
    "        description=\"The year the movie was released\",\n",
    "        type=\"integer\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"director\",\n",
    "        description=\"The name of the movie director\",\n",
    "        type=\"string\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
    "    ),\n",
    "]\n",
    "document_content_description = \"Brief summary of a movie\"\n",
    "llm = ChatOpenAI(temperature=0, model=\"gpt-4o\", max_tokens=4069)\n",
    "retriever = SelfQueryRetriever.from_llm(\n",
    "    llm, vectara, document_content_description, metadata_field_info, verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea9df8d4",
   "metadata": {},
   "source": [
    "## Self-retrieval Queries\n",
    "And now we can try actually using our retriever!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "38a126e9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'lang': 'eng', 'offset': '0', 'len': '66', 'year': '1993', 'rating': '7.7', 'genre': 'science fiction', 'source': 'langchain'}),\n",
       " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'lang': 'eng', 'offset': '0', 'len': '116', 'year': '2006', 'director': 'Satoshi Kon', 'rating': '8.6', 'source': 'langchain'}),\n",
       " Document(page_content='Toys come alive and have a blast doing so', metadata={'lang': 'eng', 'offset': '0', 'len': '41', 'year': '1995', 'genre': 'animated', 'source': 'langchain'}),\n",
       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'lang': 'eng', 'offset': '0', 'len': '60', 'year': '1979', 'rating': '9.9', 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'}),\n",
       " Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'lang': 'eng', 'offset': '0', 'len': '82', 'year': '2019', 'director': 'Greta Gerwig', 'rating': '8.3', 'source': 'langchain'}),\n",
       " Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'lang': 'eng', 'offset': '0', 'len': '76', 'year': '2010', 'director': 'Christopher Nolan', 'rating': '8.2', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example only specifies a relevant query\n",
    "retriever.invoke(\"What are movies about scientists\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fc3f1e6e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'lang': 'eng', 'offset': '0', 'len': '116', 'year': '2006', 'director': 'Satoshi Kon', 'rating': '8.6', 'source': 'langchain'}),\n",
       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'lang': 'eng', 'offset': '0', 'len': '60', 'year': '1979', 'rating': '9.9', 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example only specifies a filter\n",
    "retriever.invoke(\"I want to watch a movie rated higher than 8.5\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b19d4da0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'lang': 'eng', 'offset': '0', 'len': '82', 'year': '2019', 'director': 'Greta Gerwig', 'rating': '8.3', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example specifies a query and a filter\n",
    "retriever.invoke(\"Has Greta Gerwig directed any movies about women\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f900e40e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'lang': 'eng', 'offset': '0', 'len': '116', 'year': '2006', 'director': 'Satoshi Kon', 'rating': '8.6', 'source': 'langchain'}),\n",
       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'lang': 'eng', 'offset': '0', 'len': '60', 'year': '1979', 'rating': '9.9', 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example specifies a composite filter\n",
    "retriever.invoke(\"What's a highly rated (above 8.5) science fiction film?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "12a51522",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Toys come alive and have a blast doing so', metadata={'lang': 'eng', 'offset': '0', 'len': '41', 'year': '1995', 'genre': 'animated', 'source': 'langchain'}),\n",
       " Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'lang': 'eng', 'offset': '0', 'len': '66', 'year': '1993', 'rating': '7.7', 'genre': 'science fiction', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example specifies a query and composite filter\n",
    "retriever.invoke(\n",
    "    \"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51",
   "metadata": {},
   "source": [
    "## Filter k\n",
    "\n",
    "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n",
    "\n",
    "We can do this by passing `enable_limit=True` to the constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "bff36b88-b506-4877-9c63-e5a1a8d78e64",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "retriever = SelfQueryRetriever.from_llm(\n",
    "    llm,\n",
    "    vectara,\n",
    "    document_content_description,\n",
    "    metadata_field_info,\n",
    "    enable_limit=True,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00e8baad-a9d7-4498-bd8d-ca41d0691386",
   "metadata": {},
   "source": [
    "This is cool, we can include the number of results we would like to see in the query and the self retriever would correctly understand it. For example, let's look for "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2758d229-4f97-499c-819f-888acaf8ee10",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'lang': 'eng', 'offset': '0', 'len': '116', 'year': '2006', 'director': 'Satoshi Kon', 'rating': '8.6', 'source': 'langchain'}),\n",
       " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'lang': 'eng', 'offset': '0', 'len': '60', 'year': '1979', 'rating': '9.9', 'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'source': 'langchain'})]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example only specifies a relevant query\n",
    "retriever.invoke(\"what are two movies with a rating above 8.5\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed4b9dbc-e3cd-442d-b108-705295f51fa1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}