Update intro

2026-02-21 14:43:07 +00:00 · 2023-11-07 08:54:18 -08:00
parent ee30dd0865
commit 21bb184ecf
1 changed files with 79 additions and 57 deletions
--- a/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb
+++ b/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb
@@ -5,25 +5,52 @@
   "id": "67f41c2c-869d-44c3-b243-a7ba105d504f",
   "metadata": {},
   "source": [
-    "# Biomedical RAG DB Introduction\n",
-    "\n",
-    "This notebook provides some general context for RAG on different types of documents relevant in biomedical research:\n",
+    "# Biomedical RAG Introduction\n",
    "\n",
+    "This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:\n",
+    " \n",
    "* URLs\n",
-    "* Academic papers\n",
-    "* Clinical Trial Data\n",
+    "* Academic papers (PDF)\n",
+    "* Clinical Trial Data (JSON)\n",
    "\n",
-    "## Dependencies "
+    "## Dependencies \n",
+    "\n",
+    "### Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "27507adc-aeb4-4ed0-a4c3-593e03b01cb7",
+   "id": "aad3785d-4ee3-4675-8fb4-f393d4d0b8c9",
   "metadata": {},
   "outputs": [],
   "source": [
-    "! pip install langchain unstructured[all-docs] pypdf langchainhub anthropic openai chromadb gpt4all"
+    "! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b5e09ba1-a99c-417a-857a-90d081fd7a07",
+   "metadata": {},
+   "source": [
+    "### API keys\n",
+    "\n",
+    "Required (to access OpenAI LLMs)\n",
+    "* `OPENAI_API_KEY`\n",
+    "\n",
+    "Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:\n",
+    "* `ANTHROPIC_API_KEY` \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b17fe580-11fc-4c35-934c-db259e35c969",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "# os.environ['OPENAI_API_KEY'] = 'xxx'"
   ]
  },
  {
@@ -61,19 +88,19 @@
    "\n",
    "#### URLs\n",
    "\n",
-    "We can load from urls or [CSVs](https://python.langchain.com/docs/integrations/document_loaders/csv) (e.g., from [Clinical Trials](https://clinicaltrials.gov/study/NCT04296890)). "
+    "We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). "
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 24,
+   "execution_count": 3,
   "id": "729327fc-6710-4c2f-8ae5-05c0c1b6ce91",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import WebBaseLoader\n",
    "# Define loader for URL\n",
-    "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
+    "loader = WebBaseLoader(\"https://berthub.eu/articles/posts/amazing-dna/\")\n",
    "blog = loader.load()"
   ]
  },
@@ -84,30 +111,14 @@
   "source": [
    "#### JSON\n",
    "\n",
-    "Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs.\n",
+    "Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.\n",
    "\n",
-    "You can download the full zip file (`~2GB`) with the above link.\n",
-    " \n",
-    "We can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json).\n",
-    "\n",
-    "With `extract_metadata`, the parsed JSON data (a dictionary) is provided to the JSONLoader. \n",
-    "\n",
-    "Here, it corresponds to the `ProtocolSection` of the study.\n",
-    "\n",
-    "The function first attempts to access the `IdentificationModule` within the sample dictionary using the get method. \n",
-    "\n",
-    "If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
-    "\n",
-    "Next, the function attempts to access NCTId from the previously fetched value. \n",
-    "\n",
-    "If NCTId is present, its value is returned; otherwise, None is returned.\n",
-    "\n",
-    "We can perform this for each desired metadata field."
+    "The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 5,
   "id": "b7aafb2e-42ba-46f2-879d-44a73bca7c67",
   "metadata": {},
   "outputs": [],
@@ -131,7 +142,7 @@
   "id": "c20f516a-0c44-422f-9fc4-3d13a9854ec4",
   "metadata": {},
   "source": [
-    "Let's look at the structure of each record."
+    "We can look at the structure of each record."
   ]
  },
  {
@@ -207,6 +218,28 @@
    "print(df.columns)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "bd7ae843-62dc-45f1-b233-ba44c27ff593",
+   "metadata": {},
+   "source": [
+    "To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.\n",
+    "\n",
+    "`JSONLoader` accepts a field that we can use to supply metadata.\n",
+    "\n",
+    "The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.\n",
+    " \n",
+    "With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. \n",
+    "\n",
+    "If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
+    "\n",
+    "Next, the function attempts to access NCTId from the previously fetched value. \n",
+    "\n",
+    "If NCTId is present, its value is returned; otherwise, None is returned.\n",
+    "\n",
+    "We can perform this for each desired metadata field."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 1,
@@ -262,17 +295,6 @@
    "* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5)."
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6d0784c5-ad6e-4142-b4ab-0cea0801cb19",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "# os.environ['OPENAI_API_KEY'] = 'xxx'"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": 26,
@@ -300,9 +322,11 @@
    "\n",
    "# LLM\n",
    "from langchain.chat_models import ChatOpenAI\n",
-    "llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",temperature=0)\n",
+    "llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",\n",
+    "                        temperature=0)\n",
    "\n",
    "# Chain\n",
+    "# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.\n",
    "from langchain.schema.output_parser import StrOutputParser\n",
    "chain = prompt | llm_openai | StrOutputParser()\n",
    "\n",
@@ -343,7 +367,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 6,
   "id": "e7a8bbfa-9e85-41f5-a2f9-9ac2b8435ebd",
   "metadata": {},
   "outputs": [],
@@ -352,16 +376,6 @@
    "from langchain.chat_models import ChatAnthropic"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "07fc90c2-70b7-4bda-963a-505934208913",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# os.environ['ANTHROPIC_API_KEY'] = 'xxx'"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "52e821f8-2a9b-48e6-9fa7-3ca1329d6d56",
@@ -370,6 +384,17 @@
    "The summarization prompt is a fun, silly example to highlight the flexibility of Claude2."
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2dfe47ad-cb4f-48ae-b18f-fa7d9a62a44a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Prompt from the Hub\n",
+    "prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 22,
@@ -388,9 +413,6 @@
    }
   ],
   "source": [
-    "# Prompt from the Hub\n",
-    "prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")\n",
-    "\n",
    "# LLM\n",
    "llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)\n",
    "\n",