# Vectara

>[Vectara](https://vectara.com/) is a API platform for building GenAI applications. It provides an easy-to-use API for document indexing and querying that is managed by Vectara and is optimized for performance and accuracy. 
See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API.

This notebook shows how to use functionality related to the `Vectara`'s integration with langchain.
Note that unlike many other integrations in this category, Vectara provides an end-to-end managed service for [Grounded Generation](https://vectara.com/grounded-generation/) (aka retrieval agumented generation), which includes:
1. A way to extract text from document files and chunk them into sentences.
2. Its own embeddings model and vector store - each text segment is encoded into a vector embedding and stored in the Vectara internal vector store
3. A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching))

All of these are supported in this LangChain integration.

# Setup

You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps (see our [quickstart](https://docs.vectara.com/docs/quickstart) guide):
1. [Sign up](https://console.vectara.com/signup) for a Vectara account if you don't already have one. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.
2. Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **"Create Corpus"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.
3. Next you'll need to create API keys to access the corpus. Click on the **"Authorization"** tab in the corpus view and then the **"Create API Key"** button. Give your key a name, and choose whether you want query only or query+index for your key. Click "Create" and you now have an active API key. Keep this key confidential. 

To use LangChain with Vectara, you'll need to have these three values: customer ID, corpus ID and api_key.
You can provide those to LangChain in two ways:

1. Include in your environment these three variables: `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`.

> For example, you can set these variables using os.environ and getpass as follows:

```python
import os
import getpass

os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
```

1. Provide them as arguments when creating the Vectara vectorstore object:

```python
vectorstore = Vectara(
                vectara_customer_id=vectara_customer_id,
                vectara_corpus_id=vectara_corpus_id,
                vectara_api_key=vectara_api_key
            )
```

## Connecting to Vectara from LangChain

In this example, we assume that you've created an account and a corpus, and added your VECTARA_CUSTOMER_ID, VECTARA_CORPUS_ID and VECTARA_API_KEY (created with permissions for both indexing and query) as environment variables.

The corpus has 3 fields defined as metadata for filtering:
* url: a string field containing the source URL of the document (where relevant)
* speech: a string field containing the name of the speech
* author: the name of the author

Let's start by ingesting 3 documents into the corpus:
1. The State of the Union speech from 2022, available in the LangChain repository as a text file
2. The "I have a dream" speech by Dr. Kind
3. The "We shall Fight on the Beaches" speech by Winston Churchil

In [2]:
from langchain.embeddings import FakeEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Vectara
from langchain.document_loaders import TextLoader

from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [3]:
loader = TextLoader("../../modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [4]:
vectara = Vectara.from_documents(
    docs,
    embedding=FakeEmbeddings(size=768),
    doc_metadata={"speech": "state-of-the-union", "author": "Biden"},
)

Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.
To use this, we added the add_files() method (as well as from_files()). 

Let's see this in action. We pick two PDF documents to upload: 
1. The "I have a dream" speech by Dr. King
2. Churchill's "We Shall Fight on the Beaches" speech

In [5]:
import tempfile
import urllib.request

urls = [
    [
        "https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf",
        "I-have-a-dream",
        "Dr. King"
    ],
    [
        "https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf",
        "we shall fight on the beaches",
        "Churchil"
    ],
]
files_list = []
for url, _, _ in urls:
    name = tempfile.NamedTemporaryFile().name
    urllib.request.urlretrieve(url, name)
    files_list.append(name)

docsearch: Vectara = Vectara.from_files(
    files=files_list,
    embedding=FakeEmbeddings(size=768),
    metadatas=[{"url": url, "speech": title, "author": author} for url, title, author in urls],
)

## Similarity search

The simplest scenario for using Vectara is to perform a similarity search. 

In [6]:
query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search(
    query, n_sentence_context=0, filter="doc.speech = 'state-of-the-union'"
)

In [7]:
print(found_docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.


## Similarity search with score

Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result.

In [8]:
query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search_with_score(
    query, filter="doc.speech = 'state-of-the-union'", score_threshold=0.2,
)

In [9]:
document, score = found_docs[0]
print(document.page_content)
print(f"\nScore: {score}")

Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.

Score: 0.8299499


Now let's do similar search for content in the files we uploaded

In [10]:
query = "We must forever conduct our struggle"
min_score = 1.2
found_docs = vectara.similarity_search_with_score(
    query, filter="doc.speech = 'I-have-a-dream'", score_threshold=min_score,
)
print(f"With this threshold of {min_score} we have {len(found_docs)} documents")

With this threshold of 1.2 we have 0 documents


In [11]:
query = "We must forever conduct our struggle"
min_score = 0.2
found_docs = vectara.similarity_search_with_score(
    query, filter="doc.speech = 'I-have-a-dream'", score_threshold=min_score,
)
print(f"With this threshold of {min_score} we have {len(found_docs)} documents")


With this threshold of 0.2 we have 5 documents


## Vectara as a Retriever

Vectara, as all the other LangChain vectorstores, is most often used as a LangChain Retriever:

In [12]:
retriever = vectara.as_retriever()
retriever

VectaraRetriever(tags=['Vectara'], metadata=None, vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x13b15e9b0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '2'})

In [13]:
query = "What did the president say about Ketanji Brown Jackson"
retriever.get_relevant_documents(query)[0]

Document(page_content='Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union', 'author': 'Biden'})

## Using Vectara as a SelfQuery Retriever

In [15]:
metadata_field_info = [
    AttributeInfo(
        name="speech",
        description="what name of the speech",
        type="string or list[string]",
    ),
    AttributeInfo(
        name="author",
        description="author of the speech",
        type="string or list[string]",
    ),
]
document_content_description = "the text of the speech"

vectordb = Vectara()
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(llm, vectara, 
                                        document_content_description, metadata_field_info, 
                                        verbose=True)

In [16]:
retriever.get_relevant_documents("what did Biden say about the freedom?")



query='freedom' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='Biden') limit=None


[Document(page_content='Well I know this nation. We will meet the test. To protect freedom and liberty, to expand fairness and opportunity. We will save democracy. As hard as these times have been, I am more optimistic about America today than I have been my whole life.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '346', 'len': '67', 'speech': 'state-of-the-union', 'author': 'Biden'}),
 Document(page_content='To our fellow Ukrainian Americans who forge a deep bond that connects our two nations we stand with you. Putin may circle Kyiv with tanks, but he will never gain the hearts and souls of the Ukrainian people. He will never extinguish their love of freedom. He will never weaken the resolve of the free world. We meet tonight in an America that has lived through two of the hardest years this nation has ever faced.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '740', 'len': '47', 'speech': 'state-of-the-union', 'author': 'Biden'}),
 Document(page_content='B

In [17]:
retriever.get_relevant_documents("what did Dr. King say about the freedom?")

query='freedom' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='Dr. King') limit=None


[Document(page_content='And if America is to be a great nation, this must become true. So\nlet freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty\nmountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let\nfreedom ring from the snowcapped Rockies of Colorado.', metadata={'lang': 'eng', 'section': '3', 'offset': '1534', 'len': '55', 'CreationDate': '1424880481', 'Producer': 'Adobe PDF Library 10.0', 'Author': 'Sasha Rolon-Pereira', 'Title': 'Martin Luther King Jr.pdf', 'Creator': 'Acrobat PDFMaker 10.1 for Word', 'ModDate': '1424880524', 'url': 'https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf', 'speech': 'I-have-a-dream', 'author': 'Dr. King', 'title': 'Martin Luther King Jr.pdf'}),
 Document(page_content='And if America is to be a great nation, this must become true. So\nlet freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty