From 2463c8060ccbae87207f006055869d445ba412c7 Mon Sep 17 00:00:00 2001 From: ccurme Date: Tue, 14 May 2024 09:41:36 -0400 Subject: [PATCH] docs: how-to on adding scores to retriever results (#21626) --- docs/docs/how_to/add_scores_retriever.ipynb | 446 ++++++++++++++++++++ docs/docs/how_to/index.mdx | 1 + 2 files changed, 447 insertions(+) create mode 100644 docs/docs/how_to/add_scores_retriever.ipynb diff --git a/docs/docs/how_to/add_scores_retriever.ipynb b/docs/docs/how_to/add_scores_retriever.ipynb new file mode 100644 index 00000000000..2d3851ae200 --- /dev/null +++ b/docs/docs/how_to/add_scores_retriever.ipynb @@ -0,0 +1,446 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9d59582a-6473-4b34-929b-3e94cb443c3d", + "metadata": {}, + "source": [ + "# How to add scores to retriever results\n", + "\n", + "Retrievers will return sequences of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, which by default include no information about the process that retrieved them (e.g., a similarity score against a query). Here we demonstrate how to add retrieval scores to the `.metadata` of documents:\n", + "1. From [vectorstore retrievers](/docs/how_to/vectorstore_retriever);\n", + "2. From higher-order LangChain retrievers, such as [SelfQueryRetriever](/docs/how_to/self_query) or [MultiVectorRetriever](/docs/how_to/multi_vector).\n", + "\n", + "For (1), we will implement a short wrapper function around the corresponding vectorstore. For (2), we will update a method of the corresponding class.\n", + "\n", + "## Create vectorstore\n", + "\n", + "First we populate a vectorstore with some data. We will use a [PineconeVectorStore](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html), but this guide is compatible with any LangChain vectorstore that implements a `.similarity_search_with_score` method." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b8cfcb1b-64ee-4b91-8d82-ce7803834985", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.documents import Document\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_pinecone import PineconeVectorStore\n", + "\n", + "docs = [\n", + " Document(\n", + " page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n", + " metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n", + " metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n", + " ),\n", + " Document(\n", + " page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n", + " metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n", + " ),\n", + " Document(\n", + " page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n", + " metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n", + " ),\n", + " Document(\n", + " page_content=\"Toys come alive and have a blast doing so\",\n", + " metadata={\"year\": 1995, \"genre\": \"animated\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n", + " metadata={\n", + " \"year\": 1979,\n", + " \"director\": \"Andrei Tarkovsky\",\n", + " \"genre\": \"thriller\",\n", + " \"rating\": 9.9,\n", + " },\n", + " ),\n", + "]\n", + "\n", + "vectorstore = PineconeVectorStore.from_documents(\n", + " docs, index_name=\"sample\", embedding=OpenAIEmbeddings()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "22ac5ef6-ce18-427f-a91c-62b38a8b41e9", + "metadata": {}, + "source": [ + "## Retriever\n", + "\n", + "To obtain scores from a vectorstore retriever, we wrap the underlying vectorstore's `.similarity_search_with_score` method in a short function that packages scores into the associated document's metadata.\n", + "\n", + "We add a `@chain` decorator to the function to create a [Runnable](/docs/concepts/#langchain-expression-language) that can be used similarly to a typical retriever." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "7e5677c3-f6ee-4974-ab5f-a0f50c199d45", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "\n", + "from langchain_core.documents import Document\n", + "from langchain_core.runnables import chain\n", + "\n", + "\n", + "@chain\n", + "def retriever(query: str) -> List[Document]:\n", + " docs, scores = zip(*vectorstore.similarity_search_with_score(query))\n", + " for doc, score in zip(docs, scores):\n", + " doc.metadata[\"score\"] = score\n", + "\n", + " return docs" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "c9cad75e-b955-4012-989c-3c1820b49ba9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),\n", + " Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0, 'score': 0.792038262}),\n", + " Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979.0, 'score': 0.751571238}),\n", + " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0, 'score': 0.747471571}))" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result = retriever.invoke(\"dinosaur\")\n", + "result" + ] + }, + { + "cell_type": "markdown", + "id": "6671308a-be8d-4c15-ae1f-5bd07b342560", + "metadata": {}, + "source": [ + "Note that similarity scores from the retrieval step are included in the metadata of the above documents." + ] + }, + { + "cell_type": "markdown", + "id": "af2e73a0-46a1-47e2-8103-68aaa637642a", + "metadata": {}, + "source": [ + "## SelfQueryRetriever\n", + "\n", + "`SelfQueryRetriever` will use a LLM to generate a query that is potentially structured-- for example, it can construct filters for the retrieval on top of the usual semantic-similarity driven selection. See [this guide](/docs/how_to/self_query) for more detail.\n", + "\n", + "`SelfQueryRetriever` includes a short (1 - 2 line) method `_get_docs_with_query` that executes the vectorstore search. We can subclass `SelfQueryRetriever` and override this method to propagate similarity scores.\n", + "\n", + "First, following the [how-to guide](/docs/how_to/self_query), we will need to establish some metadata on which to filter:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "8280b829-2e81-4454-8adc-9a0930047fa2", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chains.query_constructor.base import AttributeInfo\n", + "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "metadata_field_info = [\n", + " AttributeInfo(\n", + " name=\"genre\",\n", + " description=\"The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']\",\n", + " type=\"string\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"year\",\n", + " description=\"The year the movie was released\",\n", + " type=\"integer\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"director\",\n", + " description=\"The name of the movie director\",\n", + " type=\"string\",\n", + " ),\n", + " AttributeInfo(\n", + " name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n", + " ),\n", + "]\n", + "document_content_description = \"Brief summary of a movie\"\n", + "llm = ChatOpenAI(temperature=0)" + ] + }, + { + "cell_type": "markdown", + "id": "0a6c6fa8-1e2f-45ee-83e9-a6cbd82292d2", + "metadata": {}, + "source": [ + "We then override the `_get_docs_with_query` to use the `similarity_search_with_score` method of the underlying vectorstore: " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "62c8f3fa-8b64-4afb-87c4-ccbbf9a8bc54", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Any, Dict\n", + "\n", + "\n", + "class CustomSelfQueryRetriever(SelfQueryRetriever):\n", + " def _get_docs_with_query(\n", + " self, query: str, search_kwargs: Dict[str, Any]\n", + " ) -> List[Document]:\n", + " \"\"\"Get docs, adding score information.\"\"\"\n", + " docs, scores = zip(\n", + " *vectorstore.similarity_search_with_score(query, **search_kwargs)\n", + " )\n", + " for doc, score in zip(docs, scores):\n", + " doc.metadata[\"score\"] = score\n", + "\n", + " return docs" + ] + }, + { + "cell_type": "markdown", + "id": "56e40109-1db6-44c7-a6e6-6989175e267c", + "metadata": {}, + "source": [ + "Invoking this retriever will now include similarity scores in the document metadata. Note that the underlying structured-query capabilities of `SelfQueryRetriever` are retained." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "3359a1ee-34ff-41b6-bded-64c05785b333", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993.0, 'score': 0.84429127}),)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever = CustomSelfQueryRetriever.from_llm(\n", + " llm,\n", + " vectorstore,\n", + " document_content_description,\n", + " metadata_field_info,\n", + ")\n", + "\n", + "\n", + "result = retriever.invoke(\"dinosaur movie with rating less than 8\")\n", + "result" + ] + }, + { + "cell_type": "markdown", + "id": "689ab3ba-3494-448b-836e-05fbe1ffd51c", + "metadata": {}, + "source": [ + "## MultiVectorRetriever\n", + "\n", + "`MultiVectorRetriever` allows you to associate multiple vectors with a single document. This can be useful in a number of applications. For example, we can index small chunks of a larger document and run the retrieval on the chunks, but return the larger \"parent\" document when invoking the retriever. [ParentDocumentRetriever](/docs/how_to/parent_document_retriever/), a subclass of `MultiVectorRetriever`, includes convenience methods for populating a vectorstore to support this. Further applications are detailed in this [how-to guide](/docs/how_to/multi_vector/).\n", + "\n", + "To propagate similarity scores through this retriever, we can again subclass `MultiVectorRetriever` and override a method. This time we will override `_get_relevant_documents`.\n", + "\n", + "First, we prepare some fake data. We generate fake \"whole documents\" and store them in a document store; here we will use a simple [InMemoryStore](https://api.python.langchain.com/en/latest/stores/langchain_core.stores.InMemoryBaseStore.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "a112e545-7b53-4fcd-9c4a-7a42a5cc646d", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.storage import InMemoryStore\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "# The storage layer for the parent documents\n", + "docstore = InMemoryStore()\n", + "fake_whole_documents = [\n", + " (\"fake_id_1\", Document(page_content=\"fake whole document 1\")),\n", + " (\"fake_id_2\", Document(page_content=\"fake whole document 2\")),\n", + "]\n", + "docstore.mset(fake_whole_documents)" + ] + }, + { + "cell_type": "markdown", + "id": "453b7415-4a6d-45d4-a329-9c1d7271d1b2", + "metadata": {}, + "source": [ + "Next we will add some fake \"sub-documents\" to our vectorstore. We can link these sub-documents to the parent documents by populating the `\"doc_id\"` key in its metadata." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "314519c0-dde4-41ea-a1ab-d3cf1c17c63f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['62a85353-41ff-4346-bff7-be6c8ec2ed89',\n", + " '5d4a0e83-4cc5-40f1-bc73-ed9cbad0ee15',\n", + " '8c1d9a56-120f-45e4-ba70-a19cd19a38f4']" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs = [\n", + " Document(\n", + " page_content=\"A snippet from a larger document discussing cats.\",\n", + " metadata={\"doc_id\": \"fake_id_1\"},\n", + " ),\n", + " Document(\n", + " page_content=\"A snippet from a larger document discussing discourse.\",\n", + " metadata={\"doc_id\": \"fake_id_1\"},\n", + " ),\n", + " Document(\n", + " page_content=\"A snippet from a larger document discussing chocolate.\",\n", + " metadata={\"doc_id\": \"fake_id_2\"},\n", + " ),\n", + "]\n", + "\n", + "vectorstore.add_documents(docs)" + ] + }, + { + "cell_type": "markdown", + "id": "e391f7f3-5a58-40fd-89fa-a0815c5146f7", + "metadata": {}, + "source": [ + "To propagate the scores, we subclass `MultiVectorRetriever` and override its `_get_relevant_documents` method. Here we will make two changes:\n", + "\n", + "1. We will add similarity scores to the metadata of the corresponding \"sub-documents\" using the `similarity_search_with_score` method of the underlying vectorstore as above;\n", + "2. We will include a list of these sub-documents in the metadata of the retrieved parent document. This surfaces what snippets of text were identified by the retrieval, together with their corresponding similarity scores." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "1de61de7-1b58-41d6-9dea-939fef7d741d", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "\n", + "from langchain.retrievers import MultiVectorRetriever\n", + "from langchain_core.callbacks import CallbackManagerForRetrieverRun\n", + "\n", + "\n", + "class CustomMultiVectorRetriever(MultiVectorRetriever):\n", + " def _get_relevant_documents(\n", + " self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n", + " ) -> List[Document]:\n", + " \"\"\"Get documents relevant to a query.\n", + " Args:\n", + " query: String to find relevant documents for\n", + " run_manager: The callbacks handler to use\n", + " Returns:\n", + " List of relevant documents\n", + " \"\"\"\n", + " results = self.vectorstore.similarity_search_with_score(\n", + " query, **self.search_kwargs\n", + " )\n", + "\n", + " # Map doc_ids to list of sub-documents, adding scores to metadata\n", + " id_to_doc = defaultdict(list)\n", + " for doc, score in results:\n", + " doc_id = doc.metadata.get(\"doc_id\")\n", + " if doc_id:\n", + " doc.metadata[\"score\"] = score\n", + " id_to_doc[doc_id].append(doc)\n", + "\n", + " # Fetch documents corresponding to doc_ids, retaining sub_docs in metadata\n", + " docs = []\n", + " for _id, sub_docs in id_to_doc.items():\n", + " docstore_docs = self.docstore.mget([_id])\n", + " if docstore_docs:\n", + " if doc := docstore_docs[0]:\n", + " doc.metadata[\"sub_docs\"] = sub_docs\n", + " docs.append(doc)\n", + "\n", + " return docs" + ] + }, + { + "cell_type": "markdown", + "id": "7af27b38-631c-463f-9d66-bcc985f06a4f", + "metadata": {}, + "source": [ + "Invoking this retriever, we can see that it identifies the correct parent document, including the relevant snippet from the sub-document with similarity score." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "dc42a1be-22e1-4ade-b1bd-bafb85f2424f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='fake whole document 1', metadata={'sub_docs': [Document(page_content='A snippet from a larger document discussing cats.', metadata={'doc_id': 'fake_id_1', 'score': 0.831276655})]})]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever = CustomMultiVectorRetriever(vectorstore=vectorstore, docstore=docstore)\n", + "\n", + "retriever.invoke(\"cat\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/how_to/index.mdx b/docs/docs/how_to/index.mdx index 486d4835313..d0ba3154f52 100644 --- a/docs/docs/how_to/index.mdx +++ b/docs/docs/how_to/index.mdx @@ -143,6 +143,7 @@ Retrievers are responsible for taking a query and returning relevant documents. - [How to: generate multiple queries to retrieve data for](/docs/how_to/MultiQueryRetriever) - [How to: use contextual compression to compress the data retrieved](/docs/how_to/contextual_compression) - [How to: write a custom retriever class](/docs/how_to/custom_retriever) +- [How to: add similarity scores to retriever results](/docs/how_to/add_scores_retriever) - [How to: combine the results from multiple retrievers](/docs/how_to/ensemble_retriever) - [How to: reorder retrieved results to put most relevant documents not in the middle](/docs/how_to/long_context_reorder) - [How to: generate multiple embeddings per document](/docs/how_to/multi_vector)