Adding Self-querying for Vectara (#10332)

- Description: Adding support for self-querying to Vectara integration
  - Issue: per customer request
  - Tag maintainer: @rlancemartin @baskaryan 
  - Twitter handle: @ofermend 

Also updated some documentation, added self-query testing, and a demo
notebook with self-query example.
This commit is contained in:
Ofer Mendelevitch
2023-09-07 10:24:50 -07:00
committed by GitHub
parent 25ec655e4f
commit a9eb7c6cfc
8 changed files with 743 additions and 33 deletions

View File

@@ -11,9 +11,10 @@ What is Vectara?
- You can use Vectara's integration with LangChain as a Vector store or using the Retriever abstraction.
## Installation and Setup
To use Vectara with LangChain no special installation steps are required. You just have to provide your customer_id, corpus ID, and an API key created within the Vectara console to enable indexing and searching.
To use Vectara with LangChain no special installation steps are required.
To get started, follow our [quickstart](https://docs.vectara.com/docs/quickstart) guide to create an account, a corpus and an API key.
Once you have these, you can provide them as arguments to the Vectara vectorstore, or you can set them as environment variables.
Alternatively these can be provided as environment variables
- export `VECTARA_CUSTOMER_ID`="your_customer_id"
- export `VECTARA_CORPUS_ID`="your_corpus_id"
- export `VECTARA_API_KEY`="your-vectara-api-key"

View File

@@ -26,7 +26,7 @@
"source": [
"# Setup\n",
"\n",
"You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps:\n",
"You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps (see our [quickstart](https://docs.vectara.com/docs/quickstart) guide):\n",
"1. [Sign up](https://console.vectara.com/signup) for a Vectara account if you don't already have one. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.\n",
"2. Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **\"Create Corpus\"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.\n",
"3. Next you'll need to create API keys to access the corpus. Click on the **\"Authorization\"** tab in the corpus view and then the **\"Create API Key\"** button. Give your key a name, and choose whether you want query only or query+index for your key. Click \"Create\" and you now have an active API key. Keep this key confidential. \n",
@@ -47,7 +47,7 @@
"os.environ[\"VECTARA_API_KEY\"] = getpass.getpass(\"Vectara API Key:\")\n",
"```\n",
"\n",
"2. Add them to the Vectara vectorstore constructor:\n",
"1. Provide them as arguments when creating the Vectara vectorstore object:\n",
"\n",
"```python\n",
"vectorstore = Vectara(\n",
@@ -65,13 +65,22 @@
"source": [
"## Connecting to Vectara from LangChain\n",
"\n",
"To get started, let's ingest the documents using the from_documents() method.\n",
"We assume here that you've added your VECTARA_CUSTOMER_ID, VECTARA_CORPUS_ID and query+indexing VECTARA_API_KEY as environment variables."
"In this example, we assume that you've created an account and a corpus, and added your VECTARA_CUSTOMER_ID, VECTARA_CORPUS_ID and VECTARA_API_KEY (created with permissions for both indexing and query) as environment variables.\n",
"\n",
"The corpus has 3 fields defined as metadata for filtering:\n",
"* url: a string field containing the source URL of the document (where relevant)\n",
"* speech: a string field containing the name of the speech\n",
"* author: the name of the author\n",
"\n",
"Let's start by ingesting 3 documents into the corpus:\n",
"1. The State of the Union speech from 2022, available in the LangChain repository as a text file\n",
"2. The \"I have a dream\" speech by Dr. Kind\n",
"3. The \"We shall Fight on the Beaches\" speech by Winston Churchil"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"id": "04a1f1a0",
"metadata": {},
"outputs": [],
@@ -79,12 +88,17 @@
"from langchain.embeddings import FakeEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.vectorstores import Vectara\n",
"from langchain.document_loaders import TextLoader"
"from langchain.document_loaders import TextLoader\n",
"\n",
"from langchain.llms import OpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 3,
"id": "be0a4973",
"metadata": {},
"outputs": [],
@@ -97,7 +111,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 4,
"id": "8429667e",
"metadata": {
"ExecuteTime": {
@@ -111,7 +125,7 @@
"vectara = Vectara.from_documents(\n",
" docs,\n",
" embedding=FakeEmbeddings(size=768),\n",
" doc_metadata={\"speech\": \"state-of-the-union\"},\n",
" doc_metadata={\"speech\": \"state-of-the-union\", \"author\": \"Biden\"},\n",
")"
]
},
@@ -130,7 +144,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 5,
"id": "85ef3468",
"metadata": {},
"outputs": [],
@@ -142,14 +156,16 @@
" [\n",
" \"https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf\",\n",
" \"I-have-a-dream\",\n",
" \"Dr. King\"\n",
" ],\n",
" [\n",
" \"https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf\",\n",
" \"we shall fight on the beaches\",\n",
" \"Churchil\"\n",
" ],\n",
"]\n",
"files_list = []\n",
"for url, _ in urls:\n",
"for url, _, _ in urls:\n",
" name = tempfile.NamedTemporaryFile().name\n",
" urllib.request.urlretrieve(url, name)\n",
" files_list.append(name)\n",
@@ -157,7 +173,7 @@
"docsearch: Vectara = Vectara.from_files(\n",
" files=files_list,\n",
" embedding=FakeEmbeddings(size=768),\n",
" metadatas=[{\"url\": url, \"speech\": title} for url, title in urls],\n",
" metadatas=[{\"url\": url, \"speech\": title, \"author\": author} for url, title, author in urls],\n",
")"
]
},
@@ -178,7 +194,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 6,
"id": "a8c513ab",
"metadata": {
"ExecuteTime": {
@@ -197,7 +213,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"id": "fc516993",
"metadata": {
"ExecuteTime": {
@@ -231,7 +247,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"id": "8804a21d",
"metadata": {
"ExecuteTime": {
@@ -249,7 +265,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 9,
"id": "756a6887",
"metadata": {
"ExecuteTime": {
@@ -264,7 +280,7 @@
"text": [
"Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. A former top litigator in private practice.\n",
"\n",
"Score: 0.786569\n"
"Score: 0.8299499\n"
]
}
],
@@ -284,7 +300,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 10,
"id": "47784de5",
"metadata": {},
"outputs": [
@@ -307,7 +323,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 11,
"id": "3e22949f",
"metadata": {},
"outputs": [
@@ -315,7 +331,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"With this threshold of 0.2 we have 3 documents\n"
"With this threshold of 0.2 we have 5 documents\n"
]
}
],
@@ -340,7 +356,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"id": "9427195f",
"metadata": {
"ExecuteTime": {
@@ -352,10 +368,10 @@
{
"data": {
"text/plain": [
"VectaraRetriever(tags=['Vectara'], metadata=None, vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x1586bd330>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '2'})"
"VectaraRetriever(tags=['Vectara'], metadata=None, vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x13b15e9b0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '2'})"
]
},
"execution_count": 11,
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
@@ -367,7 +383,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"id": "f3c70c31",
"metadata": {
"ExecuteTime": {
@@ -379,10 +395,10 @@
{
"data": {
"text/plain": [
"Document(page_content='Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. A former top litigator in private practice.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union'})"
"Document(page_content='Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. A former top litigator in private practice.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union', 'author': 'Biden'})"
]
},
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
@@ -392,10 +408,118 @@
"retriever.get_relevant_documents(query)[0]"
]
},
{
"cell_type": "markdown",
"id": "e944c26a",
"metadata": {},
"source": [
"## Using Vectara as a SelfQuery Retriever"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8be674de",
"metadata": {},
"outputs": [],
"source": [
"metadata_field_info = [\n",
" AttributeInfo(\n",
" name=\"speech\",\n",
" description=\"what name of the speech\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"author\",\n",
" description=\"author of the speech\",\n",
" type=\"string or list[string]\",\n",
" ),\n",
"]\n",
"document_content_description = \"the text of the speech\"\n",
"\n",
"vectordb = Vectara()\n",
"llm = OpenAI(temperature=0)\n",
"retriever = SelfQueryRetriever.from_llm(llm, vectara, \n",
" document_content_description, metadata_field_info, \n",
" verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "f8938999",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/ofer/dev/langchain/libs/langchain/langchain/chains/llm.py:278: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
" warnings.warn(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='freedom' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='Biden') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='Well I know this nation. We will meet the test. To protect freedom and liberty, to expand fairness and opportunity. We will save democracy. As hard as these times have been, I am more optimistic about America today than I have been my whole life.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '346', 'len': '67', 'speech': 'state-of-the-union', 'author': 'Biden'}),\n",
" Document(page_content='To our fellow Ukrainian Americans who forge a deep bond that connects our two nations we stand with you. Putin may circle Kyiv with tanks, but he will never gain the hearts and souls of the Ukrainian people. He will never extinguish their love of freedom. He will never weaken the resolve of the free world. We meet tonight in an America that has lived through two of the hardest years this nation has ever faced.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '740', 'len': '47', 'speech': 'state-of-the-union', 'author': 'Biden'}),\n",
" Document(page_content='But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russias Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '413', 'len': '77', 'speech': 'state-of-the-union', 'author': 'Biden'}),\n",
" Document(page_content='We can do this. \\n\\nMy fellow Americans—tonight , we have gathered in a sacred space—the citadel of our democracy. In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. We have fought for freedom, expanded liberty, defeated totalitarianism and terror. And built the strongest, freest, and most prosperous nation the world has ever known. Now is the hour. \\n\\nOur moment of responsibility.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '906', 'len': '82', 'speech': 'state-of-the-union', 'author': 'Biden'}),\n",
" Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. We cannot let this happen. Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '0', 'len': '63', 'speech': 'state-of-the-union', 'author': 'Biden'})]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\"what did Biden say about the freedom?\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a97037fb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query='freedom' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='Dr. King') limit=None\n"
]
},
{
"data": {
"text/plain": [
"[Document(page_content='And if America is to be a great nation, this must become true. So\\nlet freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty\\nmountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let\\nfreedom ring from the snowcapped Rockies of Colorado.', metadata={'lang': 'eng', 'section': '3', 'offset': '1534', 'len': '55', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'}),\n",
" Document(page_content='And if America is to be a great nation, this must become true. So\\nlet freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty\\nmountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let\\nfreedom ring from the snowcapped Rockies of Colorado.', metadata={'lang': 'eng', 'section': '3', 'offset': '1534', 'len': '55', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'}),\n",
" Document(page_content='Let freedom ring from the curvaceous slopes of\\nCalifornia. But not only that. Let freedom ring from Stone Mountain of Georgia. Let freedom ring from Lookout\\nMountain of Tennessee. Let freedom ring from every hill and molehill of Mississippi, from every\\nmountain side. Let freedom ring . . .\\nWhen we allow freedom to ring—when we let it ring from every city and every hamlet, from every state\\nand every city, we will be able to speed up that day when all of Gods children, black men and white\\nmen, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the\\nold Negro spiritual, “Free at last, Free at last, Great God a-mighty, We are free at last.”', metadata={'lang': 'eng', 'section': '3', 'offset': '1842', 'len': '52', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'}),\n",
" Document(page_content='Let freedom ring from the curvaceous slopes of\\nCalifornia. But not only that. Let freedom ring from Stone Mountain of Georgia. Let freedom ring from Lookout\\nMountain of Tennessee. Let freedom ring from every hill and molehill of Mississippi, from every\\nmountain side. Let freedom ring . . .\\nWhen we allow freedom to ring—when we let it ring from every city and every hamlet, from every state\\nand every city, we will be able to speed up that day when all of Gods children, black men and white\\nmen, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the\\nold Negro spiritual, “Free at last, Free at last, Great God a-mighty, We are free at last.”', metadata={'lang': 'eng', 'section': '3', 'offset': '1842', 'len': '52', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'}),\n",
" Document(page_content='Let freedom ring from the mighty\\nmountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let\\nfreedom ring from the snowcapped Rockies of Colorado. Let freedom ring from the curvaceous slopes of\\nCalifornia. But not only that. Let freedom ring from Stone Mountain of Georgia.', metadata={'lang': 'eng', 'section': '3', 'offset': '1657', 'len': '57', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'})]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.get_relevant_documents(\"what did Dr. King say about the freedom?\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2300e785",
"id": "f6d17e90",
"metadata": {},
"outputs": [],
"source": []