docs: update retriever template, add arxiv retriever (#24947)

2025-06-26 08:33:49 +00:00 · 2024-08-01 16:53:18 -04:00 · 2024-08-01 16:53:18 -04:00 · 9cb69a8746
commit 9cb69a8746
parent db3ceb4d0a
9 changed files with 271 additions and 226 deletions
--- a/docs/docs/integrations/retrievers/arxiv.ipynb
+++ b/docs/docs/integrations/retrievers/arxiv.ipynb
@ -2,14 +2,49 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "9fc6205b",
+   "id": "00a924a0-57e2-43fa-95dc-3ea48a56d3a5",
   "metadata": {},
   "source": [
-    "# Arxiv\n",
+    "---\n",
+    "sidebar_label: Arxiv\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f1b8ddb-8b06-4e7e-b0bb-8786dea15e2b",
+   "metadata": {},
+   "source": [
+    "# ArxivRetriever\n",
+    "\n",
+    "## Overview\n",
    "\n",
    ">[arXiv](https://arxiv.org/) is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.\n",
    "\n",
-    "This notebook shows how to retrieve scientific articles from `Arxiv.org` into the Document format that is used downstream."
+    "This notebook shows how to retrieve scientific articles from Arxiv.org into the [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) format that is used downstream.\n",
+    "\n",
+    "For detailed documentation of all `ArxivRetriever` features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html).\n",
+    "\n",
+    "### Integration details\n",
+    "\n",
+    "| Retriever | Source | Package |\n",
+    "| :--- | :--- | :---: |\n",
+    "[ArxivRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html) | Scholarly articles on [arxiv.org](https://arxiv.org/) | langchain_community |\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75d179b4-abc3-48db-9f8b-1cdb46d3aa77",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# os.environ[\"LANGSMITH_API_KEY\"] = getpass.getpass(\"Enter your LangSmith API key: \")\n",
+    "# os.environ[\"LANGSMITH_TRACING\"] = \"true\""
   ]
  },
  {
@ -17,15 +52,9 @@
   "id": "51489529-5dcd-4b86-bda6-de0a39d8ffd1",
   "metadata": {},
   "source": [
-    "## Installation"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1435c804-069d-4ade-9a7b-006b97b767c1",
-   "metadata": {},
-   "source": [
-    "First, you need to install `arxiv` python package."
+    "### Installation\n",
+    "\n",
+    "This retriever lives in the `langchain-community` package. We will also need the [arxiv](https://pypi.org/project/arxiv/) dependency:"
   ]
  },
  {
@ -37,7 +66,7 @@
   },
   "outputs": [],
   "source": [
-    "%pip install --upgrade --quiet  arxiv"
+    "%pip install -qU langchain-community arxiv"
   ]
  },
  {
@ -45,54 +74,44 @@
   "id": "6c15470b-a16b-4e0d-bc6a-6998bafbb5a4",
   "metadata": {},
   "source": [
-    "`ArxivRetriever` has these arguments:\n",
+    "## Instantiation\n",
+    "\n",
+    "`ArxivRetriever` parameters include:\n",
    "- optional `load_max_docs`: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.\n",
    "- optional `load_all_available_meta`: default=False. By default only the most important fields downloaded: `Published` (date when document was published/last updated), `Title`, `Authors`, `Summary`. If True, other fields also downloaded.\n",
+    "- `get_full_documents`: boolean, default False. Determines whether to fetch full text of documents.\n",
    "\n",
-    "`get_relevant_documents()` has one argument, `query`: free text which used to find documents in `Arxiv.org`"
+    "See [API reference](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html) for more detail."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "a13f9e92-24b3-4cea-8541-2584c1cdb2d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.retrievers import ArxivRetriever\n",
+    "\n",
+    "retriever = ArxivRetriever(\n",
+    "    load_max_docs=2,\n",
+    "    get_ful_documents=True,\n",
+    ")"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "ae3c3d16",
+   "id": "30c27047-16cf-46b5-bb29-754f1696f2bb",
   "metadata": {},
   "source": [
-    "## Examples"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6fafb73b-d6ec-4822-b161-edf0aaf5224a",
-   "metadata": {},
-   "source": [
-    "### Running retriever"
+    "## Usage\n",
+    "\n",
+    "`ArxivRetriever` supports retrieval by article identifier:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "id": "d0e6f506",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from langchain_community.retrievers import ArxivRetriever"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 18,
-   "id": "f381f642",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "retriever = ArxivRetriever(load_max_docs=2)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 2,
   "id": "20ae1a74",
   "metadata": {},
   "outputs": [],
@ -102,20 +121,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 3,
   "id": "1d5a5088",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "{'Published': '2016-05-26',\n",
+       "{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',\n",
+       " 'Published': datetime.date(2016, 5, 26),\n",
       " 'Title': 'Heat-bath random walks with Markov bases',\n",
-       " 'Authors': 'Caprice Stanley, Tobias Windisch',\n",
-       " 'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on\\nfibers of a fixed integer matrix can be bounded from above by a constant. We\\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\\nalso state explicit conditions on the set of moves so that the heat-bath random\\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\\ndimension.'}"
+       " 'Authors': 'Caprice Stanley, Tobias Windisch'}"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -126,17 +145,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 4,
   "id": "c0ccd0c7-f6a6-43e7-b842-5f57afb94224",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "'arXiv:1605.08386v1  [math.CO]  26 May 2016\\nHEAT-BATH RANDOM WALKS WITH MARKOV BASES\\nCAPRICE STANLEY AND TOBIAS WINDISCH\\nAbstract. Graphs on lattice points are studied whose edges come from a ﬁnite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on ﬁbers of a\\nﬁxed integer matrix can be bounded from above by a constant. We then study the mixing\\nbehaviour of heat-b'"
+       "'Graphs on lattice points are studied whose edges come from a finite set of\\nallowed moves of arbitrary length. We show that the diameter of these graphs on\\nfibers of a fixed integer matrix can be bounded from above by a constant. We\\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\\nalso state explicit conditions on the set of moves so that the heat-bath random\\nwalk, a ge'"
      ]
     },
-     "execution_count": 10,
+     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -147,159 +166,143 @@
  },
  {
   "cell_type": "markdown",
-   "id": "2670363b-3806-4c7e-b14d-90a4d5d2a200",
+   "id": "c525c5c2-0961-4f4c-a208-dd6ceed76ea1",
   "metadata": {},
   "source": [
-    "### Question Answering on facts"
+    "`ArxivRetriever` also supports retrieval based on natural language text:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 5,
+   "id": "4cd3d079-4496-4ab8-adff-b86e6418bc74",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "docs = retriever.invoke(\"What is the ImageBind model?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "9318c790-d388-45da-8d5c-57256619e2a1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',\n",
+       " 'Published': datetime.date(2023, 5, 31),\n",
+       " 'Title': 'ImageBind: One Embedding Space To Bind Them All',\n",
+       " 'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs[0].metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2670363b-3806-4c7e-b14d-90a4d5d2a200",
+   "metadata": {},
+   "source": [
+    "## Use within a chain\n",
+    "\n",
+    "Like other retrievers, `ArxivRetriever` can be incorporated into LLM applications via [chains](/docs/how_to/sequence/).\n",
+    "\n",
+    "We will need a LLM or chat model:\n",
+    "\n",
+    "```{=mdx}\n",
+    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
+    "\n",
+    "<ChatModelTabs customVarName=\"llm\" />\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "bcbeeaf5-79d1-4e29-8589-11dfb26761af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
   "id": "bb3601df-53ea-4826-bdbe-554387bc3ad4",
   "metadata": {
    "tags": []
   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      " ········\n"
-     ]
-    }
-   ],
-   "source": [
-    "# get a token: https://platform.openai.com/account/api-keys\n",
-    "\n",
-    "from getpass import getpass\n",
-    "\n",
-    "OPENAI_API_KEY = getpass()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "id": "e9c1a114-0410-4804-be30-05f34a9760f9",
-   "metadata": {
-    "tags": []
-   },
   "outputs": [],
   "source": [
-    "import os\n",
+    "from langchain_core.output_parsers import StrOutputParser\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_core.runnables import RunnablePassthrough\n",
    "\n",
-    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY"
+    "prompt = ChatPromptTemplate.from_template(\n",
+    "    \"\"\"Answer the question based only on the context provided.\n",
+    "\n",
+    "Context: {context}\n",
+    "\n",
+    "Question: {question}\"\"\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def format_docs(docs):\n",
+    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
+    "\n",
+    "\n",
+    "chain = (\n",
+    "    {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
+    "    | prompt\n",
+    "    | llm\n",
+    "    | StrOutputParser()\n",
+    ")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
-   "id": "51a33cc9-ec42-4afc-8a2d-3bfff476aa59",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from langchain.chains import ConversationalRetrievalChain\n",
-    "from langchain_openai import ChatOpenAI\n",
-    "\n",
-    "model = ChatOpenAI(model=\"gpt-3.5-turbo\")  # switch to 'gpt-4'\n",
-    "qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 20,
-   "id": "ea537767-a8bf-4adf-ae03-b353c9145d58",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "-> **Question**: What are Heat-bath random walks with Markov base? \n",
-      "\n",
-      "**Answer**: I'm not sure, as I don't have enough context to provide a definitive answer. The term \"Heat-bath random walks with Markov base\" is not mentioned in the given text. Could you provide more information or context about where you encountered this term? \n",
-      "\n",
-      "-> **Question**: What is the ImageBind model? \n",
-      "\n",
-      "**Answer**: ImageBind is an approach developed by Facebook AI Research to learn a joint embedding across six different modalities, including images, text, audio, depth, thermal, and IMU data. The approach uses the binding property of images to align each modality's embedding to image embeddings and achieve an emergent alignment across all modalities. This enables novel multimodal capabilities, including cross-modal retrieval, embedding-space arithmetic, and audio-to-image generation, among others. The approach sets a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Additionally, it shows strong few-shot recognition results and serves as a new way to evaluate vision models for visual and non-visual tasks. \n",
-      "\n",
-      "-> **Question**: How does Compositional Reasoning with Large Language Models works? \n",
-      "\n",
-      "**Answer**: Compositional reasoning with large language models refers to the ability of these models to correctly identify and represent complex concepts by breaking them down into smaller, more basic parts and combining them in a structured way. This involves understanding the syntax and semantics of language and using that understanding to build up more complex meanings from simpler ones. \n",
-      "\n",
-      "In the context of the paper \"Does CLIP Bind Concepts? Probing Compositionality in Large Image Models\", the authors focus specifically on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way. They examine CLIP's ability to compose concepts in a single-object setting, as well as in situations where concept binding is needed. \n",
-      "\n",
-      "The authors situate their work within the tradition of research on compositional distributional semantics models (CDSMs), which seek to bridge the gap between distributional models and formal semantics by building architectures which operate over vectors yet still obey traditional theories of linguistic composition. They compare the performance of CLIP with several architectures from research on CDSMs to evaluate its ability to encode and reason about compositional concepts. \n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "questions = [\n",
-    "    \"What are Heat-bath random walks with Markov base?\",\n",
-    "    \"What is the ImageBind model?\",\n",
-    "    \"How does Compositional Reasoning with Large Language Models works?\",\n",
-    "]\n",
-    "chat_history = []\n",
-    "\n",
-    "for question in questions:\n",
-    "    result = qa({\"question\": question, \"chat_history\": chat_history})\n",
-    "    chat_history.append((question, result[\"answer\"]))\n",
-    "    print(f\"-> **Question**: {question} \\n\")\n",
-    "    print(f\"**Answer**: {result['answer']} \\n\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 22,
-   "id": "8e0c3fc6-ae62-4036-a885-dc60176a7745",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "-> **Question**: What are Heat-bath random walks with Markov base? Include references to answer. \n",
-      "\n",
-      "**Answer**: Heat-bath random walks with Markov base (HB-MB) is a class of stochastic processes that have been studied in the field of statistical mechanics and condensed matter physics. In these processes, a particle moves in a lattice by making a transition to a neighboring site, which is chosen according to a probability distribution that depends on the energy of the particle and the energy of its surroundings.\n",
-      "\n",
-      "The HB-MB process was introduced by Bortz, Kalos, and Lebowitz in 1975 as a way to simulate the dynamics of interacting particles in a lattice at thermal equilibrium. The method has been used to study a variety of physical phenomena, including phase transitions, critical behavior, and transport properties.\n",
-      "\n",
-      "References:\n",
-      "\n",
-      "Bortz, A. B., Kalos, M. H., & Lebowitz, J. L. (1975). A new algorithm for Monte Carlo simulation of Ising spin systems. Journal of Computational Physics, 17(1), 10-18.\n",
-      "\n",
-      "Binder, K., & Heermann, D. W. (2010). Monte Carlo simulation in statistical physics: an introduction. Springer Science & Business Media. \n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "questions = [\n",
-    "    \"What are Heat-bath random walks with Markov base? Include references to answer.\",\n",
-    "]\n",
-    "chat_history = []\n",
-    "\n",
-    "for question in questions:\n",
-    "    result = qa({\"question\": question, \"chat_history\": chat_history})\n",
-    "    chat_history.append((question, result[\"answer\"]))\n",
-    "    print(f\"-> **Question**: {question} \\n\")\n",
-    "    print(f\"**Answer**: {result['answer']} \\n\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "09794ab5-759c-4b56-95d4-2454d4d86da1",
+   "execution_count": 9,
+   "id": "62889c3c-8a49-4c76-9141-d777311af1f4",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chain.invoke(\"What is the ImageBind model?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e419acb8-d7ac-42a1-916f-c796f23dce9b",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "For detailed documentation of all `ArxivRetriever` features and configurations head to the [API reference](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html)."
+   ]
  }
 ],
 "metadata": {
@ -318,7 +321,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.4"
  }
 },
 "nbformat": 4,
--- a/docs/docs/integrations/retrievers/azure_ai_search.ipynb
+++ b/docs/docs/integrations/retrievers/azure_ai_search.ipynb
@ -28,9 +28,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[AzureAISearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.azure_ai_search.AzureAISearchRetriever.html) | ✅ | ❌ | ✅ | langchain_community.retrievers |\n",
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[AzureAISearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.azure_ai_search.AzureAISearchRetriever.html) | ❌ | ✅ | langchain_community |\n",
    "\n",
    "\n",
    "## Setup\n",
--- a/docs/docs/integrations/retrievers/bedrock.ipynb
+++ b/docs/docs/integrations/retrievers/bedrock.ipynb
@ -29,9 +29,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[AmazonKnowledgeBasesRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_aws.retrievers.bedrock.AmazonKnowledgeBasesRetriever.html) | ✅ | ❌ | ✅ | langchain_aws.retrievers |\n"
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[AmazonKnowledgeBasesRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_aws.retrievers.bedrock.AmazonKnowledgeBasesRetriever.html) | ❌ | ✅ | langchain_aws |\n"
   ]
  },
  {
--- a/docs/docs/integrations/retrievers/elasticsearch_retriever.ipynb
+++ b/docs/docs/integrations/retrievers/elasticsearch_retriever.ipynb
@ -26,9 +26,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[ElasticsearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html) | ✅ | ✅ | ✅ | langchain_elasticsearch |\n",
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[ElasticsearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html) | ✅ | ✅ | langchain_elasticsearch |\n",
    "\n",
    "\n",
    "## Setup\n",
--- a/docs/docs/integrations/retrievers/google_vertex_ai_search.ipynb
+++ b/docs/docs/integrations/retrievers/google_vertex_ai_search.ipynb
@ -29,9 +29,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[VertexAISearchRetriever](https://api.python.langchain.com/en/latest/vertex_ai_search/langchain_google_community.vertex_ai_search.VertexAISearchRetriever.html) | ✅ | ❌ | ✅ | langchain_google_community.vertex_ai_search |\n",
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[VertexAISearchRetriever](https://api.python.langchain.com/en/latest/vertex_ai_search/langchain_google_community.vertex_ai_search.VertexAISearchRetriever.html) | ❌ | ✅ | langchain_google_community |\n",
    "\n",
    "\n",
    "## Setup\n",
--- a/docs/docs/integrations/retrievers/index.mdx
+++ b/docs/docs/integrations/retrievers/index.mdx
@ -16,14 +16,25 @@ For specifics on how to use retrievers, see the [relevant how-to guides here](/d

 Note that all [vector stores](/docs/concepts/#vector-stores) can be [cast to retrievers](/docs/how_to/vectorstore_retriever/).
 Refer to the vector store [integration docs](/docs/integrations/vectorstores/) for available vector stores.
-This table lists custom retrievers, implemented via subclassing [BaseRetriever](/docs/how_to/custom_retriever/).
+This page lists custom retrievers, implemented via subclassing [BaseRetriever](/docs/how_to/custom_retriever/).

+## Bring-your-own documents

-| Retriever | Bring your own docs | Self-host | Cloud offering | Package |
-|-----------|---------------------|-----------|----------------|---------|
-| [AmazonKnowledgeBasesRetriever](/docs/integrations/retrievers/bedrock) | ✅ | ❌ | ✅ | [langchain_aws](https://api.python.langchain.com/en/latest/retrievers/langchain_aws.retrievers.bedrock.AmazonKnowledgeBasesRetriever.html) |
-| [AzureAISearchRetriever](/docs/integrations/retrievers/azure_ai_search) | ✅ | ❌ | ✅ | [langchain_community](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.azure_ai_search.AzureAISearchRetriever.html) |
-| [ElasticsearchRetriever](/docs/integrations/retrievers/elasticsearch_retriever) | ✅ | ✅ | ✅ | [langchain_elasticsearch](https://api.python.langchain.com/en/latest/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html) |
-| [MilvusCollectionHybridSearchRetriever](/docs/integrations/retrievers/milvus_hybrid_search) | ✅ | ❌ | ✅ | [langchain_milvus](https://api.python.langchain.com/en/latest/retrievers/langchain_milvus.retrievers.milvus_hybrid_search.MilvusCollectionHybridSearchRetriever.html) |
-| [TavilySearchAPIRetriever](/docs/integrations/retrievers/tavily) | ❌ | ❌ | ❌ | [langchain_community](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.tavily_search_api.TavilySearchAPIRetriever.html) |
-| [VertexAISearchRetriever](/docs/integrations/retrievers/google_vertex_ai_search) | ✅ | ❌ | ✅ | [langchain_google_community](https://api.python.langchain.com/en/latest/vertex_ai_search/langchain_google_community.vertex_ai_search.VertexAISearchRetriever.html) |
+The below retrievers allow you to index and search a custom corpus of documents.
+
+| Retriever | Self-host | Cloud offering | Package |
+|-----------|-----------|----------------|---------|
+| [AmazonKnowledgeBasesRetriever](/docs/integrations/retrievers/bedrock) | ❌ | ✅ | [langchain_aws](https://api.python.langchain.com/en/latest/retrievers/langchain_aws.retrievers.bedrock.AmazonKnowledgeBasesRetriever.html) |
+| [AzureAISearchRetriever](/docs/integrations/retrievers/azure_ai_search) | ❌ | ✅ | [langchain_community](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.azure_ai_search.AzureAISearchRetriever.html) |
+| [ElasticsearchRetriever](/docs/integrations/retrievers/elasticsearch_retriever) | ✅ | ✅ | [langchain_elasticsearch](https://api.python.langchain.com/en/latest/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html) |
+| [MilvusCollectionHybridSearchRetriever](/docs/integrations/retrievers/milvus_hybrid_search) | ✅ | ❌ | [langchain_milvus](https://api.python.langchain.com/en/latest/retrievers/langchain_milvus.retrievers.milvus_hybrid_search.MilvusCollectionHybridSearchRetriever.html) |
+| [VertexAISearchRetriever](/docs/integrations/retrievers/google_vertex_ai_search) | ❌ | ✅ | [langchain_google_community](https://api.python.langchain.com/en/latest/vertex_ai_search/langchain_google_community.vertex_ai_search.VertexAISearchRetriever.html) |
+
+## External index
+
+The below retrievers will search over an external index (e.g., constructed from Internet data or similar).
+
+| Retriever | Source | Package |
+|-----------|--------|---------|
+| [ArxivRetriever](/docs/integrations/retrievers/arxiv) | Scholarly articles on [arxiv.org](https://arxiv.org/) | [langchain_community](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html) |
+| [TavilySearchAPIRetriever](/docs/integrations/retrievers/tavily) | Internet search | [langchain_community](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.tavily_search_api.TavilySearchAPIRetriever.html) |
--- a/docs/docs/integrations/retrievers/milvus_hybrid_search.ipynb
+++ b/docs/docs/integrations/retrievers/milvus_hybrid_search.ipynb
@ -25,9 +25,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[MilvusCollectionHybridSearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_milvus.retrievers.milvus_hybrid_search.MilvusCollectionHybridSearchRetriever.html) | ✅ | ❌ | ✅ | langchain_milvus |\n",
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[MilvusCollectionHybridSearchRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_milvus.retrievers.milvus_hybrid_search.MilvusCollectionHybridSearchRetriever.html) | ✅ | ❌ | langchain_milvus |\n",
    "\n",
    "\n",
    "\n",
--- a/docs/docs/integrations/retrievers/tavily.ipynb
+++ b/docs/docs/integrations/retrievers/tavily.ipynb
@ -22,9 +22,9 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[TavilySearchAPIRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.tavily_search_api.TavilySearchAPIRetriever.html) | ❌ | ❌ | ❌ | langchain_community.retrievers |\n",
+    "| Retriever | Source | Package |\n",
+    "| :--- | :--- | :---: |\n",
+    "[TavilySearchAPIRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.tavily_search_api.TavilySearchAPIRetriever.html) | Internet search | langchain_community |\n",
    "\n",
    "## Setup"
   ]
--- a/libs/cli/langchain_cli/integration_template/docs/retrievers.ipynb
+++ b/libs/cli/langchain_cli/integration_template/docs/retrievers.ipynb
@ -24,10 +24,19 @@
    "\n",
    "### Integration details\n",
    "\n",
-    "| Retriever | Bring your own docs | Self-host | Cloud offering | Package |\n",
-    "| :--- | :--- | :---: | :---: | :---: |\n",
-    "[__ModuleName__Retriever](https://api.python.langchain.com/en/latest/retrievers/__package_name__.retrievers.__module_name__.__ModuleName__Retriever.html) | ❌ | ❌ | ❌ | __package_name__ |\n",
+    "TODO: Select one of the tables below, as appropriate.\n",
    "\n",
+    "1: Bring-your-own data (i.e., index and search a custom corpus of documents):\n",
+    "\n",
+    "| Retriever | Self-host | Cloud offering | Package |\n",
+    "| :--- | :--- | :---: | :---: |\n",
+    "[__ModuleName__Retriever](https://api.python.langchain.com/en/latest/retrievers/__package_name__.retrievers.__module_name__.__ModuleName__Retriever.html) | ❌ | ❌ | __package_name__ |\n",
+    "\n",
+    "2: External index (e.g., constructed from Internet data or similar)):\n",
+    "\n",
+    "| Retriever | Source | Package |\n",
+    "| :--- | :--- | :---: |\n",
+    "[__ModuleName__Retriever](https://api.python.langchain.com/en/latest/retrievers/__package_name__.retrievers.__module_name__.__ModuleName__Retriever.html) | Source description | __package_name__ |\n",
    "\n",
    "## Setup\n",
    "\n",
@ -124,7 +133,32 @@
   "id": "dfe8aad4-8626-4330-98a9-7ea1ca5d2e0e",
   "metadata": {},
   "source": [
-    "## Use within a chain"
+    "## Use within a chain\n",
+    "\n",
+    "Like other retrievers, __ModuleName__Retriever can be incorporated into LLM applications via [chains](/docs/how_to/sequence/).\n",
+    "\n",
+    "We will need a LLM or chat model:\n",
+    "\n",
+    "```{=mdx}\n",
+    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
+    "\n",
+    "<ChatModelTabs customVarName=\"llm\" />\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25b647a3-f8f2-4541-a289-7a241e43f9df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | output: false\n",
+    "# | echo: false\n",
+    "\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\", temperature=0)"
   ]
  },
  {
@ -137,7 +171,6 @@
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
-    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "prompt = ChatPromptTemplate.from_template(\n",
    "    \"\"\"Answer the question based only on the context provided.\n",
@ -147,8 +180,6 @@
    "Question: {question}\"\"\"\n",
    ")\n",
    "\n",
-    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\")\n",
-    "\n",
    "\n",
    "def format_docs(docs):\n",
    "    return \"\\n\\n\".join(doc.page_content for doc in docs)\n",