diff --git a/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb b/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb index 189653fdec2..4202f724892 100644 --- a/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb +++ b/templates/rag-biomedical/rag_biomedical/biomedical_rag_introduction.ipynb @@ -5,25 +5,52 @@ "id": "67f41c2c-869d-44c3-b243-a7ba105d504f", "metadata": {}, "source": [ - "# Biomedical RAG DB Introduction\n", - "\n", - "This notebook provides some general context for RAG on different types of documents relevant in biomedical research:\n", + "# Biomedical RAG Introduction\n", "\n", + "This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:\n", + " \n", "* URLs\n", - "* Academic papers\n", - "* Clinical Trial Data\n", + "* Academic papers (PDF)\n", + "* Clinical Trial Data (JSON)\n", "\n", - "## Dependencies " + "## Dependencies \n", + "\n", + "### Packages" ] }, { "cell_type": "code", "execution_count": null, - "id": "27507adc-aeb4-4ed0-a4c3-593e03b01cb7", + "id": "aad3785d-4ee3-4675-8fb4-f393d4d0b8c9", "metadata": {}, "outputs": [], "source": [ - "! pip install langchain unstructured[all-docs] pypdf langchainhub anthropic openai chromadb gpt4all" + "! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas" + ] + }, + { + "cell_type": "markdown", + "id": "b5e09ba1-a99c-417a-857a-90d081fd7a07", + "metadata": {}, + "source": [ + "### API keys\n", + "\n", + "Required (to access OpenAI LLMs)\n", + "* `OPENAI_API_KEY`\n", + "\n", + "Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:\n", + "* `ANTHROPIC_API_KEY` \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b17fe580-11fc-4c35-934c-db259e35c969", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "# os.environ['OPENAI_API_KEY'] = 'xxx'" ] }, { @@ -61,19 +88,19 @@ "\n", "#### URLs\n", "\n", - "We can load from urls or [CSVs](https://python.langchain.com/docs/integrations/document_loaders/csv) (e.g., from [Clinical Trials](https://clinicaltrials.gov/study/NCT04296890)). " + "We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). " ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 3, "id": "729327fc-6710-4c2f-8ae5-05c0c1b6ce91", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import WebBaseLoader\n", "# Define loader for URL\n", - "loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n", + "loader = WebBaseLoader(\"https://berthub.eu/articles/posts/amazing-dna/\")\n", "blog = loader.load()" ] }, @@ -84,30 +111,14 @@ "source": [ "#### JSON\n", "\n", - "Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs.\n", + "Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.\n", "\n", - "You can download the full zip file (`~2GB`) with the above link.\n", - " \n", - "We can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json).\n", - "\n", - "With `extract_metadata`, the parsed JSON data (a dictionary) is provided to the JSONLoader. \n", - "\n", - "Here, it corresponds to the `ProtocolSection` of the study.\n", - "\n", - "The function first attempts to access the `IdentificationModule` within the sample dictionary using the get method. \n", - "\n", - "If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n", - "\n", - "Next, the function attempts to access NCTId from the previously fetched value. \n", - "\n", - "If NCTId is present, its value is returned; otherwise, None is returned.\n", - "\n", - "We can perform this for each desired metadata field." + "The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "id": "b7aafb2e-42ba-46f2-879d-44a73bca7c67", "metadata": {}, "outputs": [], @@ -131,7 +142,7 @@ "id": "c20f516a-0c44-422f-9fc4-3d13a9854ec4", "metadata": {}, "source": [ - "Let's look at the structure of each record." + "We can look at the structure of each record." ] }, { @@ -207,6 +218,28 @@ "print(df.columns)" ] }, + { + "cell_type": "markdown", + "id": "bd7ae843-62dc-45f1-b233-ba44c27ff593", + "metadata": {}, + "source": [ + "To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.\n", + "\n", + "`JSONLoader` accepts a field that we can use to supply metadata.\n", + "\n", + "The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.\n", + " \n", + "With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. \n", + "\n", + "If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n", + "\n", + "Next, the function attempts to access NCTId from the previously fetched value. \n", + "\n", + "If NCTId is present, its value is returned; otherwise, None is returned.\n", + "\n", + "We can perform this for each desired metadata field." + ] + }, { "cell_type": "code", "execution_count": 1, @@ -262,17 +295,6 @@ "* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5)." ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "6d0784c5-ad6e-4142-b4ab-0cea0801cb19", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "# os.environ['OPENAI_API_KEY'] = 'xxx'" - ] - }, { "cell_type": "code", "execution_count": 26, @@ -300,9 +322,11 @@ "\n", "# LLM\n", "from langchain.chat_models import ChatOpenAI\n", - "llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",temperature=0)\n", + "llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",\n", + " temperature=0)\n", "\n", "# Chain\n", + "# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.\n", "from langchain.schema.output_parser import StrOutputParser\n", "chain = prompt | llm_openai | StrOutputParser()\n", "\n", @@ -343,7 +367,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "id": "e7a8bbfa-9e85-41f5-a2f9-9ac2b8435ebd", "metadata": {}, "outputs": [], @@ -352,16 +376,6 @@ "from langchain.chat_models import ChatAnthropic" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "07fc90c2-70b7-4bda-963a-505934208913", - "metadata": {}, - "outputs": [], - "source": [ - "# os.environ['ANTHROPIC_API_KEY'] = 'xxx'" - ] - }, { "cell_type": "markdown", "id": "52e821f8-2a9b-48e6-9fa7-3ca1329d6d56", @@ -370,6 +384,17 @@ "The summarization prompt is a fun, silly example to highlight the flexibility of Claude2." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dfe47ad-cb4f-48ae-b18f-fa7d9a62a44a", + "metadata": {}, + "outputs": [], + "source": [ + "# Prompt from the Hub\n", + "prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")" + ] + }, { "cell_type": "code", "execution_count": 22, @@ -388,9 +413,6 @@ } ], "source": [ - "# Prompt from the Hub\n", - "prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")\n", - "\n", "# LLM\n", "llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)\n", "\n",