Files
langchain/docs/versioned_docs/version-0.2.x/how_to/multi_vector.ipynb
Harrison Chase 66b2ac62eb cr
2024-04-24 17:07:56 -07:00

618 lines
18 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "d9172545",
"metadata": {},
"source": [
"# How to use the MultiVector Retriever\n",
"\n",
"It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base `MultiVectorRetriever` which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the `MultiVectorRetriever`.\n",
"\n",
"The methods to create multiple vectors per document include:\n",
"\n",
"- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).\n",
"- Summary: create a summary for each document, embed that along with (or instead of) the document.\n",
"- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.\n",
"\n",
"\n",
"Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "eed469be",
"metadata": {},
"outputs": [],
"source": [
"from langchain.retrievers.multi_vector import MultiVectorRetriever"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "18c1421a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.storage import InMemoryByteStore\n",
"from langchain_chroma import Chroma\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6d869496",
"metadata": {},
"outputs": [],
"source": [
"loaders = [\n",
" TextLoader(\"../../paul_graham_essay.txt\"),\n",
" TextLoader(\"../../state_of_the_union.txt\"),\n",
"]\n",
"docs = []\n",
"for loader in loaders:\n",
" docs.extend(loader.load())\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)\n",
"docs = text_splitter.split_documents(docs)"
]
},
{
"cell_type": "markdown",
"id": "fa17beda",
"metadata": {},
"source": [
"## Smaller chunks\n",
"\n",
"Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the `ParentDocumentRetriever` does. Here we show what is going on under the hood."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0e7b6b45",
"metadata": {},
"outputs": [],
"source": [
"# The vectorstore to use to index the child chunks\n",
"vectorstore = Chroma(\n",
" collection_name=\"full_documents\", embedding_function=OpenAIEmbeddings()\n",
")\n",
"# The storage layer for the parent documents\n",
"store = InMemoryByteStore()\n",
"id_key = \"doc_id\"\n",
"# The retriever (empty to start)\n",
"retriever = MultiVectorRetriever(\n",
" vectorstore=vectorstore,\n",
" byte_store=store,\n",
" id_key=id_key,\n",
")\n",
"import uuid\n",
"\n",
"doc_ids = [str(uuid.uuid4()) for _ in docs]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "72a36491",
"metadata": {},
"outputs": [],
"source": [
"# The splitter to use to create smaller chunks\n",
"child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5d23247d",
"metadata": {},
"outputs": [],
"source": [
"sub_docs = []\n",
"for i, doc in enumerate(docs):\n",
" _id = doc_ids[i]\n",
" _sub_docs = child_text_splitter.split_documents([doc])\n",
" for _doc in _sub_docs:\n",
" _doc.metadata[id_key] = _id\n",
" sub_docs.extend(_sub_docs)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "92ed5861",
"metadata": {},
"outputs": [],
"source": [
"retriever.vectorstore.add_documents(sub_docs)\n",
"retriever.docstore.mset(list(zip(doc_ids, docs)))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8afed60c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '2fd77862-9ed5-4fad-bf76-e487b747b333', 'source': '../../state_of_the_union.txt'})"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Vectorstore alone retrieves the small chunks\n",
"retriever.vectorstore.similarity_search(\"justice breyer\")[0]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3c9017f1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9875"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retriever returns larger chunks\n",
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "cdef8339-f9fa-4b3b-955f-ad9dbdf2734f",
"metadata": {},
"source": [
"The default search type the retriever performs on the vector database is a similarity search. LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the `search_type` property as follows:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "36739460-a737-4a8e-b70f-50bf8c8eaae7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9875"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.retrievers.multi_vector import SearchType\n",
"\n",
"retriever.search_type = SearchType.mmr\n",
"\n",
"len(retriever.get_relevant_documents(\"justice breyer\")[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "d6a7ae0d",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"Oftentimes a summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "1433dff4",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"\n",
"from langchain_core.documents import Document\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_openai import ChatOpenAI"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "35b30390",
"metadata": {},
"outputs": [],
"source": [
"chain = (\n",
" {\"doc\": lambda x: x.page_content}\n",
" | ChatPromptTemplate.from_template(\"Summarize the following document:\\n\\n{doc}\")\n",
" | ChatOpenAI(max_retries=0)\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "41a2a738",
"metadata": {},
"outputs": [],
"source": [
"summaries = chain.batch(docs, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "7ac5e4b1",
"metadata": {},
"outputs": [],
"source": [
"# The vectorstore to use to index the child chunks\n",
"vectorstore = Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings())\n",
"# The storage layer for the parent documents\n",
"store = InMemoryByteStore()\n",
"id_key = \"doc_id\"\n",
"# The retriever (empty to start)\n",
"retriever = MultiVectorRetriever(\n",
" vectorstore=vectorstore,\n",
" byte_store=store,\n",
" id_key=id_key,\n",
")\n",
"doc_ids = [str(uuid.uuid4()) for _ in docs]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "0d93309f",
"metadata": {},
"outputs": [],
"source": [
"summary_docs = [\n",
" Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
" for i, s in enumerate(summaries)\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6d5edf0d",
"metadata": {},
"outputs": [],
"source": [
"retriever.vectorstore.add_documents(summary_docs)\n",
"retriever.docstore.mset(list(zip(doc_ids, docs)))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "862ae920",
"metadata": {},
"outputs": [],
"source": [
"# # We can also add the original chunks to the vectorstore if we so want\n",
"# for i, doc in enumerate(docs):\n",
"# doc.metadata[id_key] = doc_ids[i]\n",
"# retriever.vectorstore.add_documents(docs)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "299232d6",
"metadata": {},
"outputs": [],
"source": [
"sub_docs = vectorstore.similarity_search(\"justice breyer\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "10e404c0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content=\"The document is a speech given by President Biden addressing various issues and outlining his agenda for the nation. He highlights the importance of nominating a Supreme Court justice and introduces his nominee, Judge Ketanji Brown Jackson. He emphasizes the need to secure the border and reform the immigration system, including providing a pathway to citizenship for Dreamers and essential workers. The President also discusses the protection of women's rights, including access to healthcare and the right to choose. He calls for the passage of the Equality Act to protect LGBTQ+ rights. Additionally, President Biden discusses the need to address the opioid epidemic, improve mental health services, support veterans, and fight against cancer. He expresses optimism for the future of America and the strength of the American people.\", metadata={'doc_id': '56345bff-3ead-418c-a4ff-dff203f77474'})"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sub_docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "e4cce5c2",
"metadata": {},
"outputs": [],
"source": [
"retrieved_docs = retriever.get_relevant_documents(\"justice breyer\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c8570dbb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9194"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(retrieved_docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "097a5396",
"metadata": {},
"source": [
"## Hypothetical Queries\n",
"\n",
"An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "5219b085",
"metadata": {},
"outputs": [],
"source": [
"functions = [\n",
" {\n",
" \"name\": \"hypothetical_questions\",\n",
" \"description\": \"Generate hypothetical questions\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"questions\": {\n",
" \"type\": \"array\",\n",
" \"items\": {\"type\": \"string\"},\n",
" },\n",
" },\n",
" \"required\": [\"questions\"],\n",
" },\n",
" }\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "523deb92",
"metadata": {},
"outputs": [],
"source": [
"from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser\n",
"\n",
"chain = (\n",
" {\"doc\": lambda x: x.page_content}\n",
" # Only asking for 3 hypothetical questions, but this could be adjusted\n",
" | ChatPromptTemplate.from_template(\n",
" \"Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\\n\\n{doc}\"\n",
" )\n",
" | ChatOpenAI(max_retries=0, model=\"gpt-4\").bind(\n",
" functions=functions, function_call={\"name\": \"hypothetical_questions\"}\n",
" )\n",
" | JsonKeyOutputFunctionsParser(key_name=\"questions\")\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "11d30554",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[\"What was the author's first experience with programming like?\",\n",
" 'Why did the author switch their focus from AI to Lisp during their graduate studies?',\n",
" 'What led the author to contemplate a career in art instead of computer science?']"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(docs[0])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "3eb2e48c",
"metadata": {},
"outputs": [],
"source": [
"hypothetical_questions = chain.batch(docs, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "b2cd6e75",
"metadata": {},
"outputs": [],
"source": [
"# The vectorstore to use to index the child chunks\n",
"vectorstore = Chroma(\n",
" collection_name=\"hypo-questions\", embedding_function=OpenAIEmbeddings()\n",
")\n",
"# The storage layer for the parent documents\n",
"store = InMemoryByteStore()\n",
"id_key = \"doc_id\"\n",
"# The retriever (empty to start)\n",
"retriever = MultiVectorRetriever(\n",
" vectorstore=vectorstore,\n",
" byte_store=store,\n",
" id_key=id_key,\n",
")\n",
"doc_ids = [str(uuid.uuid4()) for _ in docs]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "18831b3b",
"metadata": {},
"outputs": [],
"source": [
"question_docs = []\n",
"for i, question_list in enumerate(hypothetical_questions):\n",
" question_docs.extend(\n",
" [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "224b24c5",
"metadata": {},
"outputs": [],
"source": [
"retriever.vectorstore.add_documents(question_docs)\n",
"retriever.docstore.mset(list(zip(doc_ids, docs)))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "7b442b90",
"metadata": {},
"outputs": [],
"source": [
"sub_docs = vectorstore.similarity_search(\"justice breyer\")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "089b5ad0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Who has been nominated to serve on the United States Supreme Court?', metadata={'doc_id': '0b3a349e-c936-4e77-9c40-0a39fc3e07f0'}),\n",
" Document(page_content=\"What was the context and content of Robert Morris' advice to the document's author in 2010?\", metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),\n",
" Document(page_content='How did personal circumstances influence the decision to pass on the leadership of Y Combinator?', metadata={'doc_id': 'b2b2cdca-988a-4af1-ba47-46170770bc8c'}),\n",
" Document(page_content='What were the reasons for the author leaving Yahoo in the summer of 1999?', metadata={'doc_id': 'ce4f4981-ca60-4f56-86f0-89466de62325'})]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sub_docs"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "7594b24e",
"metadata": {},
"outputs": [],
"source": [
"retrieved_docs = retriever.get_relevant_documents(\"justice breyer\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "4c120c65",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9194"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(retrieved_docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "005072b8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}