Update intro

This commit is contained in:
Lance Martin
2023-11-07 08:54:18 -08:00
parent ee30dd0865
commit 21bb184ecf

View File

@@ -5,25 +5,52 @@
"id": "67f41c2c-869d-44c3-b243-a7ba105d504f",
"metadata": {},
"source": [
"# Biomedical RAG DB Introduction\n",
"\n",
"This notebook provides some general context for RAG on different types of documents relevant in biomedical research:\n",
"# Biomedical RAG Introduction\n",
"\n",
"This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:\n",
" \n",
"* URLs\n",
"* Academic papers\n",
"* Clinical Trial Data\n",
"* Academic papers (PDF)\n",
"* Clinical Trial Data (JSON)\n",
"\n",
"## Dependencies "
"## Dependencies \n",
"\n",
"### Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27507adc-aeb4-4ed0-a4c3-593e03b01cb7",
"id": "aad3785d-4ee3-4675-8fb4-f393d4d0b8c9",
"metadata": {},
"outputs": [],
"source": [
"! pip install langchain unstructured[all-docs] pypdf langchainhub anthropic openai chromadb gpt4all"
"! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas"
]
},
{
"cell_type": "markdown",
"id": "b5e09ba1-a99c-417a-857a-90d081fd7a07",
"metadata": {},
"source": [
"### API keys\n",
"\n",
"Required (to access OpenAI LLMs)\n",
"* `OPENAI_API_KEY`\n",
"\n",
"Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:\n",
"* `ANTHROPIC_API_KEY` \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b17fe580-11fc-4c35-934c-db259e35c969",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"# os.environ['OPENAI_API_KEY'] = 'xxx'"
]
},
{
@@ -61,19 +88,19 @@
"\n",
"#### URLs\n",
"\n",
"We can load from urls or [CSVs](https://python.langchain.com/docs/integrations/document_loaders/csv) (e.g., from [Clinical Trials](https://clinicaltrials.gov/study/NCT04296890)). "
"We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). "
]
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 3,
"id": "729327fc-6710-4c2f-8ae5-05c0c1b6ce91",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import WebBaseLoader\n",
"# Define loader for URL\n",
"loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
"loader = WebBaseLoader(\"https://berthub.eu/articles/posts/amazing-dna/\")\n",
"blog = loader.load()"
]
},
@@ -84,30 +111,14 @@
"source": [
"#### JSON\n",
"\n",
"Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs.\n",
"Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.\n",
"\n",
"You can download the full zip file (`~2GB`) with the above link.\n",
" \n",
"We can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json).\n",
"\n",
"With `extract_metadata`, the parsed JSON data (a dictionary) is provided to the JSONLoader. \n",
"\n",
"Here, it corresponds to the `ProtocolSection` of the study.\n",
"\n",
"The function first attempts to access the `IdentificationModule` within the sample dictionary using the get method. \n",
"\n",
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
"\n",
"Next, the function attempts to access NCTId from the previously fetched value. \n",
"\n",
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
"\n",
"We can perform this for each desired metadata field."
"The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 5,
"id": "b7aafb2e-42ba-46f2-879d-44a73bca7c67",
"metadata": {},
"outputs": [],
@@ -131,7 +142,7 @@
"id": "c20f516a-0c44-422f-9fc4-3d13a9854ec4",
"metadata": {},
"source": [
"Let's look at the structure of each record."
"We can look at the structure of each record."
]
},
{
@@ -207,6 +218,28 @@
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"id": "bd7ae843-62dc-45f1-b233-ba44c27ff593",
"metadata": {},
"source": [
"To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.\n",
"\n",
"`JSONLoader` accepts a field that we can use to supply metadata.\n",
"\n",
"The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.\n",
" \n",
"With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. \n",
"\n",
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
"\n",
"Next, the function attempts to access NCTId from the previously fetched value. \n",
"\n",
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
"\n",
"We can perform this for each desired metadata field."
]
},
{
"cell_type": "code",
"execution_count": 1,
@@ -262,17 +295,6 @@
"* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d0784c5-ad6e-4142-b4ab-0cea0801cb19",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"# os.environ['OPENAI_API_KEY'] = 'xxx'"
]
},
{
"cell_type": "code",
"execution_count": 26,
@@ -300,9 +322,11 @@
"\n",
"# LLM\n",
"from langchain.chat_models import ChatOpenAI\n",
"llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",temperature=0)\n",
"llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",\n",
" temperature=0)\n",
"\n",
"# Chain\n",
"# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"chain = prompt | llm_openai | StrOutputParser()\n",
"\n",
@@ -343,7 +367,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 6,
"id": "e7a8bbfa-9e85-41f5-a2f9-9ac2b8435ebd",
"metadata": {},
"outputs": [],
@@ -352,16 +376,6 @@
"from langchain.chat_models import ChatAnthropic"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07fc90c2-70b7-4bda-963a-505934208913",
"metadata": {},
"outputs": [],
"source": [
"# os.environ['ANTHROPIC_API_KEY'] = 'xxx'"
]
},
{
"cell_type": "markdown",
"id": "52e821f8-2a9b-48e6-9fa7-3ca1329d6d56",
@@ -370,6 +384,17 @@
"The summarization prompt is a fun, silly example to highlight the flexibility of Claude2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dfe47ad-cb4f-48ae-b18f-fa7d9a62a44a",
"metadata": {},
"outputs": [],
"source": [
"# Prompt from the Hub\n",
"prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
@@ -388,9 +413,6 @@
}
],
"source": [
"# Prompt from the Hub\n",
"prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")\n",
"\n",
"# LLM\n",
"llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)\n",
"\n",