diff --git a/docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb b/docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb new file mode 100644 index 00000000000..1bab3e68f2d --- /dev/null +++ b/docs/docs/modules/data_connection/retrievers/custom_retriever.ipynb @@ -0,0 +1,309 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "b5fc1fc7-c4c5-418f-99da-006c604a7ea6", + "metadata": {}, + "source": [ + "---\n", + "title: Custom Retriever\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "ff6f3c79-0848-4956-9115-54f6b2134587", + "metadata": {}, + "source": [ + "# Custom Retriever\n", + "\n", + "## Overview\n", + "\n", + "Many LLM applications involve retrieving information from external data sources using a `Retriever`. \n", + "\n", + "A retriever is responsible for retrieving a list of relevant `Documents` to a given user `query`.\n", + "\n", + "The retrieved documents are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the to generate an appropriate response (e.g., answering a user question based on a knowledge base).\n", + "\n", + "## Interface\n", + "\n", + "To create your own retriever, you need to extend the `BaseRetriever` class and implement the following methods:\n", + "\n", + "| Method | Description | Required/Optional |\n", + "|--------------------------------|--------------------------------------------------|-------------------|\n", + "| `_get_relevant_documents` | Get documents relevant to a query. | Required |\n", + "| `_aget_relevant_documents` | Implement to provide async native support. | Optional |\n", + "\n", + "\n", + "The logic inside of `_get_relevant_documents` can involve arbitrary calls to a database or to the web using requests.\n", + "\n", + ":::{.callout-tip}\n", + "By inherting from `BaseRetriever`, your retriever automatically becomes a LangChain [Runnable](/docs/expression_language/interface) and will gain the standard `Runnable` functionality out of the box!\n", + ":::\n", + "\n", + "\n", + ":::{.callout-info}\n", + "You can use a `RunnableLambda` or `RunnableGenerator` to implement a retriever.\n", + "\n", + "The main benefit of implementing a retriever as a `BaseRetriever` vs. a `RunnableLambda` (a custom [runnable function](/docs/expression_language/primitives/functions)) is that a `BaseRetriever` is a well\n", + "known LangChain entity so some tooling for monitoring may implement specialized behavior for retrievers. Another difference\n", + "is that a `BaseRetriever` will behave slightly differently from `RunnableLambda` in some APIs; e.g., the `start` event\n", + "in `astream_events` API will be `on_retriever_start` instead of `on_chain_start`.\n", + ":::\n" + ] + }, + { + "cell_type": "markdown", + "id": "2be9fe82-0757-41d1-a647-15bed11fd3bf", + "metadata": {}, + "source": [ + "## Example\n", + "\n", + "Let's implement a toy retriever that returns all documents whose text contains the text in the user query." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "bdf61902-2984-493b-a002-d4fced6df590", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "\n", + "from langchain_core.callbacks import CallbackManagerForRetrieverRun\n", + "from langchain_core.documents import Document\n", + "from langchain_core.retrievers import BaseRetriever\n", + "\n", + "\n", + "class ToyRetriever(BaseRetriever):\n", + " \"\"\"A toy retriever that contains the top k documents that contain the user query.\n", + "\n", + " This retriever only implements the sync method _get_relevant_documents.\n", + "\n", + " If the retriever were to involve file access or network access, it could benefit\n", + " from a native async implementation of `_aget_relevant_documents`.\n", + "\n", + " As usual, with Runnables, there's a default async implementation that's provided\n", + " that delegates to the sync implementation running on another thread.\n", + " \"\"\"\n", + "\n", + " documents: List[Document]\n", + " \"\"\"List of documents to retrieve from.\"\"\"\n", + " k: int\n", + " \"\"\"Number of top results to return\"\"\"\n", + "\n", + " def _get_relevant_documents(\n", + " self, query: str, *, run_manager: CallbackManagerForRetrieverRun\n", + " ) -> List[Document]:\n", + " \"\"\"Sync implementations for retriever.\"\"\"\n", + " matching_documents = []\n", + " for document in documents:\n", + " if len(matching_documents) > self.k:\n", + " return matching_documents\n", + "\n", + " if query.lower() in document.page_content.lower():\n", + " matching_documents.append(document)\n", + " return matching_documents\n", + "\n", + " # Optional: Provide a more efficient native implementation by overriding\n", + " # _aget_relevant_documents\n", + " # async def _aget_relevant_documents(\n", + " # self, query: str, *, run_manager: AsyncCallbackManagerForRetrieverRun\n", + " # ) -> List[Document]:\n", + " # \"\"\"Asynchronously get documents relevant to a query.\n", + "\n", + " # Args:\n", + " # query: String to find relevant documents for\n", + " # run_manager: The callbacks handler to use\n", + "\n", + " # Returns:\n", + " # List of relevant documents\n", + " # \"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "2eac1f28-29c1-4888-b3aa-b4fa70c73b4c", + "metadata": {}, + "source": [ + "## Test it 🧪" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "ea868db5-48cc-4ec2-9b0a-1ab94c32b302", + "metadata": {}, + "outputs": [], + "source": [ + "documents = [\n", + " Document(\n", + " page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n", + " metadata={\"type\": \"dog\", \"trait\": \"loyalty\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Cats are independent pets that often enjoy their own space.\",\n", + " metadata={\"type\": \"cat\", \"trait\": \"independence\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Goldfish are popular pets for beginners, requiring relatively simple care.\",\n", + " metadata={\"type\": \"fish\", \"trait\": \"low maintenance\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Parrots are intelligent birds capable of mimicking human speech.\",\n", + " metadata={\"type\": \"bird\", \"trait\": \"intelligence\"},\n", + " ),\n", + " Document(\n", + " page_content=\"Rabbits are social animals that need plenty of space to hop around.\",\n", + " metadata={\"type\": \"rabbit\", \"trait\": \"social\"},\n", + " ),\n", + "]\n", + "retriever = ToyRetriever(documents=documents, k=3)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "18be85e9-6ef0-4ee0-ae5d-a0810c38b254", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n", + " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever.invoke(\"that\")" + ] + }, + { + "cell_type": "markdown", + "id": "13f76f6e-cf2b-4f67-859b-0ef8be98abbe", + "metadata": {}, + "source": [ + "It's a **runnable** so it'll benefit from the standard Runnable Interface! 🤩" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "3672e9fe-4365-4628-9d25-31924cfaf784", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'}),\n", + " Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'type': 'rabbit', 'trait': 'social'})]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "await retriever.ainvoke(\"that\")" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "e2c96eed-6813-421c-acf2-6554839840ee", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'type': 'dog', 'trait': 'loyalty'})],\n", + " [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'type': 'cat', 'trait': 'independence'})]]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever.batch([\"dog\", \"cat\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "978b6636-bf36-42c2-969c-207718f084cf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'event': 'on_retriever_start', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'name': 'ToyRetriever', 'tags': [], 'metadata': {}, 'data': {'input': 'bar'}}\n", + "{'event': 'on_retriever_stream', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'name': 'ToyRetriever', 'data': {'chunk': []}}\n", + "{'event': 'on_retriever_end', 'name': 'ToyRetriever', 'run_id': 'f96f268d-8383-4921-b175-ca583924d9ff', 'tags': [], 'metadata': {}, 'data': {'output': []}}\n" + ] + } + ], + "source": [ + "async for event in retriever.astream_events(\"bar\", version=\"v1\"):\n", + " print(event)" + ] + }, + { + "cell_type": "markdown", + "id": "7b45c404-37bf-4370-bb7c-26556777ff46", + "metadata": {}, + "source": [ + "## Contributing\n", + "\n", + "We appreciate contributions of interesting retrievers!\n", + "\n", + "Here's a checklist to help make sure your contribution gets added to LangChain:\n", + "\n", + "Documentation:\n", + "\n", + "* The retriever contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n", + "* The class doc-string for the model contains a link to any relevant APIs used for the retriever (e.g., if the retriever is retrieving from wikipedia, it'll be good to link to the wikipedia API!)\n", + "\n", + "Tests:\n", + "\n", + "* [ ] Add unit or integration tests to verify that `invoke` and `ainvoke` work.\n", + "\n", + "Optimizations:\n", + "\n", + "If the retriever is connecting to external data sources (e.g., an API or a file), it'll almost certainly benefit from an async native optimization!\n", + " \n", + "* [ ] Provide a native async implementation of `_aget_relevant_documents` (used by `ainvoke`)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/docs/modules/data_connection/retrievers/index.mdx b/docs/docs/modules/data_connection/retrievers/index.mdx index 5ec7fbe0e42..108e5d342e7 100644 --- a/docs/docs/modules/data_connection/retrievers/index.mdx +++ b/docs/docs/modules/data_connection/retrievers/index.mdx @@ -80,23 +80,4 @@ chain.invoke("What did the president say about technology?") ## Custom Retriever -Since the retriever interface is so simple, it's pretty easy to write a custom one. - -```python -from langchain_core.retrievers import BaseRetriever -from langchain_core.callbacks import CallbackManagerForRetrieverRun -from langchain_core.documents import Document -from typing import List - - -class CustomRetriever(BaseRetriever): - - def _get_relevant_documents( - self, query: str, *, run_manager: CallbackManagerForRetrieverRun - ) -> List[Document]: - return [Document(page_content=query)] - -retriever = CustomRetriever() - -retriever.get_relevant_documents("bar") -``` \ No newline at end of file +See the [documentation here](/docs/modules/data_connection/retrievers/custom_retriever) to implement a custom retriever.