mirror of
https://github.com/hwchase17/langchain.git
synced 2025-05-23 07:57:16 +00:00
"One Retriever to merge them all, One Retriever to expose them, One Retriever to bring them all and in and process them with Document formatters." Hi @dev2049! Here bothering people again! I'm using this simple idea to deal with merging the output of several retrievers into one. I'm aware of DocumentCompressorPipeline and ContextualCompressionRetriever but I don't think they allow us to do something like this. Also I was getting in trouble to get the pipeline working too. Please correct me if i'm wrong. This allow to do some sort of "retrieval" preprocessing and then using the retrieval with the curated results anywhere you could use a retriever. My use case is to generate diff indexes with diff embeddings and sources for a more colorful results then filtering them with one or many document formatters. I saw some people looking for something like this, here: https://github.com/hwchase17/langchain/issues/3991 and something similar here: https://github.com/hwchase17/langchain/issues/5555 This is just a proposal I know I'm missing tests , etc. If you think this is a worth it idea I can work on tests and anything you want to change. Let me know! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
122 lines
4.4 KiB
Plaintext
122 lines
4.4 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "fc0db1bc",
|
|
"metadata": {},
|
|
"source": [
|
|
"# LOTR (Merger Retriever)\n",
|
|
"\n",
|
|
"Lord of the Retrievers, also known as MergerRetriever, takes a list of retrievers as input and merges the results of their get_relevant_documents() methods into a single list. The merged results will be a list of documents that are relevant to the query and that have been ranked by the different retrievers.\n",
|
|
"\n",
|
|
"The MergerRetriever class can be used to improve the accuracy of document retrieval in a number of ways. First, it can combine the results of multiple retrievers, which can help to reduce the risk of bias in the results. Second, it can rank the results of the different retrievers, which can help to ensure that the most relevant documents are returned first."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9fbcc58f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import chromadb\n",
|
|
"from langchain.retrievers.merger_retriever import MergerRetriever\n",
|
|
"from langchain.vectorstores import Chroma\n",
|
|
"from langchain.embeddings import HuggingFaceEmbeddings\n",
|
|
"from langchain.embeddings import OpenAIEmbeddings\n",
|
|
"from langchain.document_transformers import EmbeddingsRedundantFilter\n",
|
|
"from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
|
|
"from langchain.retrievers import ContextualCompressionRetriever\n",
|
|
"\n",
|
|
"# Get 3 diff embeddings.\n",
|
|
"all_mini = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
|
|
"multi_qa_mini = HuggingFaceEmbeddings(model_name=\"multi-qa-MiniLM-L6-dot-v1\")\n",
|
|
"filter_embeddings = OpenAIEmbeddings()\n",
|
|
"\n",
|
|
"ABS_PATH = os.path.dirname(os.path.abspath(__file__))\n",
|
|
"DB_DIR = os.path.join(ABS_PATH, \"db\")\n",
|
|
"\n",
|
|
"# Instantiate 2 diff cromadb indexs, each one with a diff embedding.\n",
|
|
"client_settings = chromadb.config.Settings(\n",
|
|
" chroma_db_impl=\"duckdb+parquet\",\n",
|
|
" persist_directory=DB_DIR,\n",
|
|
" anonymized_telemetry=False,\n",
|
|
")\n",
|
|
"db_all = Chroma(\n",
|
|
" collection_name=\"project_store_all\",\n",
|
|
" persist_directory=DB_DIR,\n",
|
|
" client_settings=client_settings,\n",
|
|
" embedding_function=all_mini,\n",
|
|
")\n",
|
|
"db_multi_qa = Chroma(\n",
|
|
" collection_name=\"project_store_multi\",\n",
|
|
" persist_directory=DB_DIR,\n",
|
|
" client_settings=client_settings,\n",
|
|
" embedding_function=multi_qa_mini,\n",
|
|
")\n",
|
|
"\n",
|
|
"# Define 2 diff retrievers with 2 diff embeddings and diff search type.\n",
|
|
"retriever_all = db_all.as_retriever(\n",
|
|
" search_type=\"similarity\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
|
|
")\n",
|
|
"retriever_multi_qa = db_multi_qa.as_retriever(\n",
|
|
" search_type=\"mmr\", search_kwargs={\"k\": 5, \"include_metadata\": True}\n",
|
|
")\n",
|
|
"\n",
|
|
"# The Lord of the Retrievers will hold the ouput of boths retrievers and can be used as any other \n",
|
|
"# retriever on different types of chains.\n",
|
|
"lotr = MergerRetriever(retrievers=[retriever_all, retriever_multi_qa])\n"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"id": "c152339d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Remove redundant results from the merged retrievers."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "039faea6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\n",
|
|
"# We can remove redundant results from both retrievers using yet another embedding. \n",
|
|
"# Using multiples embeddings in diff steps could help reduce biases.\n",
|
|
"filter = EmbeddingsRedundantFilter(embeddings=filter_embeddings)\n",
|
|
"pipeline = DocumentCompressorPipeline(transformers=[filter])\n",
|
|
"compression_retriever = ContextualCompressionRetriever(\n",
|
|
" base_compressor=pipeline, base_retriever=lotr\n",
|
|
")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|