Compare commits

...

7 Commits

Author SHA1 Message Date
Lance Martin
21bb184ecf Update intro 2023-11-07 08:54:18 -08:00
Lance Martin
ee30dd0865 Merge branch 'master' into rlm/biomedical-rag 2023-11-07 06:43:35 -08:00
Lance Martin
75277d14e7 Split ntbks, update README 2023-11-05 12:14:37 -08:00
Lance Martin
b9efa32204 Merge branch 'master' into rlm/biomedical-rag 2023-11-04 20:06:37 -07:00
Lance Martin
265a13584e Create template 2023-11-04 20:04:12 -07:00
Lance Martin
d4e42ecea6 Merge branch 'master' into rlm/biomedical-rag 2023-11-03 15:49:06 -07:00
Lance Martin
6e0cbc18d7 rag on biomedical data, docs 2023-10-31 16:40:08 -07:00
12 changed files with 4232 additions and 0 deletions

View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2023 LangChain, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -0,0 +1,76 @@
# rag-biomedical
This template performs RAG over clinical trial data from [Clinical Trials .gov](https://classic.clinicaltrials.gov/api/gui/ref/download_all).
It builds a vectorstore from a sub-set of cinical trial data using `build_db.ipynb` with specified metadata fields.
It then uses [self-query retriever](https://python.langchain.com/docs/integrations/retrievers/self_query/chroma_self_query) to query using these metadata filters.
## Database
The vectorstore is created using `build_db.ipynb`.
Also see more general context on biomedical RAG in `biomedical_rag_introduction.ipynb`.
## Environment Setup
Set the `OPENAI_API_KEY` environment variable to access the OpenAI models.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U langchain-cli
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package rag-biomedical
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add rag-biomedical
```
And add the following code to your `server.py` file:
```python
from rag_biomedical import chain as rag_biomedical_chain
add_routes(app, rag_biomedical_chain, path="/rag-biomedical")
```
(Optional) Let's now configure LangSmith.
LangSmith will help us trace, monitor and debug LangChain applications.
LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/).
If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/rag-biomedical/playground](http://127.0.0.1:8000/rag-biomedical/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/rag-biomedical")
```

2499
templates/rag-biomedical/poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,28 @@
[tool.poetry]
name = "rag-chroma"
version = "0.1.0"
description = ""
authors = [
"Erick Friis <erick@langchain.dev>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
langchain = ">=0.0.325"
openai = ">=0.28.1"
tiktoken = ">=0.5.1"
chromadb = ">=0.4.14"
[tool.poetry.group.dev.dependencies]
langchain-cli = ">=0.0.15"
[tool.langserve]
export_module = "rag_chroma"
export_attr = "chain"
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"

View File

@@ -0,0 +1,51 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "681a5d1e",
"metadata": {},
"source": [
"## Run Template\n",
"\n",
"In `server.py`, set -\n",
"```\n",
"add_routes(app, chain_rag_conv, path=\"/rag-biomedical\")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d774be2a",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"rag_app = RemoteRunnable(\"http://localhost:8001/rag-biomedical\")\n",
"rag_app.invoke(\"\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,3 @@
from rag_biomedical.chain import chain
__all__ = ["chain"]

View File

@@ -0,0 +1,840 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "67f41c2c-869d-44c3-b243-a7ba105d504f",
"metadata": {},
"source": [
"# Biomedical RAG Introduction\n",
"\n",
"This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:\n",
" \n",
"* URLs\n",
"* Academic papers (PDF)\n",
"* Clinical Trial Data (JSON)\n",
"\n",
"## Dependencies \n",
"\n",
"### Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aad3785d-4ee3-4675-8fb4-f393d4d0b8c9",
"metadata": {},
"outputs": [],
"source": [
"! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas"
]
},
{
"cell_type": "markdown",
"id": "b5e09ba1-a99c-417a-857a-90d081fd7a07",
"metadata": {},
"source": [
"### API keys\n",
"\n",
"Required (to access OpenAI LLMs)\n",
"* `OPENAI_API_KEY`\n",
"\n",
"Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:\n",
"* `ANTHROPIC_API_KEY` \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b17fe580-11fc-4c35-934c-db259e35c969",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"# os.environ['OPENAI_API_KEY'] = 'xxx'"
]
},
{
"cell_type": "markdown",
"id": "7d9b3675-2181-42a0-a491-4f1df69fb763",
"metadata": {},
"source": [
"## Document Loading \n",
"\n",
"### PDFs\n",
"\n",
"We can load academic papers ([e.g., Clinical trials of interest for 2023](https://www.nature.com/articles/s41591-022-02132-3)) via [PDF loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "849a98d2-f39a-4715-965b-40168190effd",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import PyPDFLoader\n",
"# Define loader for PDF\n",
"path = \"/Users/rlm/Desktop/GENE-workshop/s41591-022-02132-3.pdf\"\n",
"loader = PyPDFLoader(path)\n",
"pdf_pages = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "3dcbc23e-ad10-4723-be6e-6a184f733439",
"metadata": {},
"source": [
"Explore a broad set of loaders [here](https://python.langchain.com/docs/integrations/document_loaders) and [here](https://integrations.langchain.com/).\n",
"\n",
"#### URLs\n",
"\n",
"We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "729327fc-6710-4c2f-8ae5-05c0c1b6ce91",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import WebBaseLoader\n",
"# Define loader for URL\n",
"loader = WebBaseLoader(\"https://berthub.eu/articles/posts/amazing-dna/\")\n",
"blog = loader.load()"
]
},
{
"cell_type": "markdown",
"id": "70e80761-2850-4fb2-a86e-5177d25e6954",
"metadata": {},
"source": [
"#### JSON\n",
"\n",
"Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.\n",
"\n",
"The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b7aafb2e-42ba-46f2-879d-44a73bca7c67",
"metadata": {},
"outputs": [],
"source": [
"# Define the directory path\n",
"import os \n",
"dir_path = \"/Users/rlm/Desktop/Clinical-Trials/AllAPIJSON/NCT0000xxxx\"\n",
"\n",
"# List all files in the directory\n",
"all_files = os.listdir(dir_path)\n",
"\n",
"# Filter only JSON files\n",
"json_files = [file for file in all_files if file.endswith('.json')]\n",
"\n",
"# Sort and select the first 5 JSON files (you can customize the sorting if needed)\n",
"first_5_jsons = sorted(json_files)[:5]"
]
},
{
"cell_type": "markdown",
"id": "c20f516a-0c44-422f-9fc4-3d13a9854ec4",
"metadata": {},
"source": [
"We can look at the structure of each record."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b07fec7b-f828-461b-a757-907052693874",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['FullStudy.Rank',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.NCTId',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.OrgStudyIdInfo.OrgStudyId',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.SecondaryIdInfoList.SecondaryIdInfo',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgFullName',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgClass',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.BriefTitle',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StatusVerifiedDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.OverallStatus',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.ExpandedAccessInfo.HasExpandedAccess',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitQCDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstPostDateStruct.StudyFirstPostDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstPostDateStruct.StudyFirstPostDateType',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdateSubmitDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdatePostDateStruct.LastUpdatePostDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdatePostDateStruct.LastUpdatePostDateType',\n",
" 'FullStudy.Study.ProtocolSection.SponsorCollaboratorsModule.LeadSponsor.LeadSponsorName',\n",
" 'FullStudy.Study.ProtocolSection.SponsorCollaboratorsModule.LeadSponsor.LeadSponsorClass',\n",
" 'FullStudy.Study.ProtocolSection.DescriptionModule.BriefSummary',\n",
" 'FullStudy.Study.ProtocolSection.DescriptionModule.DetailedDescription',\n",
" 'FullStudy.Study.ProtocolSection.ConditionsModule.ConditionList.Condition',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.StudyType',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.PhaseList.Phase',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignInterventionModel',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignPrimaryPurpose',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignMaskingInfo.DesignMasking',\n",
" 'FullStudy.Study.ProtocolSection.ArmsInterventionsModule.InterventionList.Intervention',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.EligibilityCriteria',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.HealthyVolunteers',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.Gender',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.MinimumAge',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.MaximumAge',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.StdAgeList.StdAge',\n",
" 'FullStudy.Study.ProtocolSection.ContactsLocationsModule.LocationList.Location',\n",
" 'FullStudy.Study.DerivedSection.MiscInfoModule.VersionHolder',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionMeshList.ConditionMesh',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionAncestorList.ConditionAncestor',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionBrowseLeafList.ConditionBrowseLeaf',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionBrowseBranchList.ConditionBrowseBranch',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionMeshList.InterventionMesh',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionAncestorList.InterventionAncestor',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionBrowseLeafList.InterventionBrowseLeaf',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionBrowseBranchList.InterventionBrowseBranch'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import json\n",
"import pandas as pd\n",
"\n",
"# Assuming first_5_jsons is a list of file paths.\n",
"with open(dir_path+'/'+first_5_jsons[0], 'r') as f:\n",
" data = json.load(f)\n",
"\n",
"# Normalize the JSON data to a DataFrame\n",
"df = pd.json_normalize(data)\n",
"\n",
"# Print the columns to understand the structure\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"id": "bd7ae843-62dc-45f1-b233-ba44c27ff593",
"metadata": {},
"source": [
"To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.\n",
"\n",
"`JSONLoader` accepts a field that we can use to supply metadata.\n",
"\n",
"The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.\n",
" \n",
"With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. \n",
"\n",
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
"\n",
"Next, the function attempts to access NCTId from the previously fetched value. \n",
"\n",
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
"\n",
"We can perform this for each desired metadata field."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ecaf0a48-50a9-41ca-8ce4-d819263d3ef4",
"metadata": {},
"outputs": [],
"source": [
"from typing import Any, Dict\n",
"from langchain.docstore.document import Document\n",
"from langchain.document_loaders import JSONLoader\n",
"\n",
"def extract_metadata(sample: Dict[str, Any], default_metadata: Dict[str, Any]) -> Dict[str, Any]:\n",
" nctid = sample.get('IdentificationModule', {}).get('NCTId', None)\n",
" study_type = sample.get('DesignModule', {}).get('StudyType', None)\n",
" phase_list = sample.get('DesignModule', {}).get('PhaseList', {}).get('Phase', None)\n",
" metadata = {\n",
" **default_metadata,\n",
" 'NCTId': nctid,\n",
" 'StudyType': study_type,\n",
" 'PhaseList': str(phase_list) # Metadata needs to be str\n",
" }\n",
" return metadata\n",
"\n",
"def load_json(dir: str) -> Document:\n",
" \n",
" loader = JSONLoader(\n",
" file_path=dir,\n",
" jq_schema='.FullStudy.Study.ProtocolSection',\n",
" metadata_func=extract_metadata,\n",
" text_content=False\n",
" )\n",
" \n",
" data = loader.load()\n",
" return data[0]\n",
"\n",
"# Process each of the selected JSON files\n",
"clinical_trial_data_sample = []\n",
"for json_file in first_5_jsons:\n",
" file_path = os.path.join(dir_path, json_file)\n",
" data = load_json(file_path)\n",
" clinical_trial_data_sample.append(data)"
]
},
{
"cell_type": "markdown",
"id": "5d88873e-0449-4248-aa66-a29b0d59cbcb",
"metadata": {},
"source": [
"## Summarization\n",
"\n",
"### Prompt Definition \n",
"\n",
"* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5)."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "f8e41ffd-08ff-49a4-a971-6ec54d61ea63",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"The article discusses 11 clinical trials that are expected to shape medicine in 2023. These trials cover a range of medical conditions, including Parkinson's disease, Alzheimer's disease, ovarian cancer, muscular dystrophy, cervical cancer, weight loss, sleeping sickness, metastatic breast cancer, and COVID-19 vaccination in individuals with HIV. Leading researchers provide insights into the significance and potential outcomes of these trials. The article highlights the challenges faced by the biopharmaceutical industry in 2022, including clinical trial failures and disruptions caused by COVID-19. Despite these challenges, the article emphasizes the potential for new advancements and breakthroughs in medicine in the coming year.\""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Prompt \n",
"from langchain.prompts import ChatPromptTemplate\n",
"template = \"\"\"Summarize the following context:\n",
"{context}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"# LLM\n",
"from langchain.chat_models import ChatOpenAI\n",
"llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",\n",
" temperature=0)\n",
"\n",
"# Chain\n",
"# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"chain = prompt | llm_openai | StrOutputParser()\n",
"\n",
"# Docs\n",
"all_pages = ' '.join([p.page_content for p in pdf_pages])\n",
"\n",
"# Invoke\n",
"chain.invoke({\"context\" : all_pages})"
]
},
{
"cell_type": "markdown",
"id": "639698ba-baad-497e-87d6-747f72e1596d",
"metadata": {},
"source": [
"### LangSmith \n",
"\n",
"* Use LangSmith to view the trace [here](https://smith.langchain.com/public/8fa348b6-3aa5-494f-bb3e-4e61bdc97744/r).\n",
"* Reference on LangSmith: [here](https://docs.smith.langchain.com/)"
]
},
{
"cell_type": "markdown",
"id": "53c14cf0-b15c-40da-af9a-da3fe6c3a129",
"metadata": {},
"source": [
"### Vary the LLM and Prompt\n",
"\n",
"Claude2 has a [100k token context window](https://www.anthropic.com/index/claude-2).\n",
"\n",
"We can also use LangChain hub to explore different summarization prompts.\n",
" \n",
"* [LangChain Docs](https://python.langchain.com/docs/integrations/chat/anthropic)\n",
"* [LangChain Hub](https://blog.langchain.dev/langchain-prompt-hub/)\n",
"* [Review of interesting prompts](https://blog.langchain.dev/the-prompt-landscape/)\n",
"* [Example summarization prompt](https://smith.langchain.com/hub/hwchase17/anthropic-paper-qa?ref=blog.langchain.dev)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e7a8bbfa-9e85-41f5-a2f9-9ac2b8435ebd",
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"from langchain.chat_models import ChatAnthropic"
]
},
{
"cell_type": "markdown",
"id": "52e821f8-2a9b-48e6-9fa7-3ca1329d6d56",
"metadata": {},
"source": [
"The summarization prompt is a fun, silly example to highlight the flexibility of Claude2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2dfe47ad-cb4f-48ae-b18f-fa7d9a62a44a",
"metadata": {},
"outputs": [],
"source": [
"# Prompt from the Hub\n",
"prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "d29f57da-2369-4ab0-9ca2-b0d52ba7af05",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"' <kindergarten_abstract>\\nThe doctors did a bunch of studies to see what medicines and tests work best. They looked at medicines for Parkinson\\'s, Alzheimer\\'s, and other diseases. They also looked at tests for cancer and COVID vaccines. The results will help doctors treat patients better next year.\\n</kindergarten_abstract>\\n\\n<moosewood_methods>\\nIngredients:\\n- 11 leading medical experts\\n- A dash of speculation\\n- A pinch of prognostication\\n\\nInstructions:\\n1. Gather the experts in a room with coffee and pastries. Make sure they are well-caffeinated. \\n2. Ask each expert to name one clinical trial they are most excited about in 2023. Let them ramble on for a bit about the details. Take notes.\\n3. Stir the trials together and look for common themes. Do the trials focus on new drug treatments, improved screening methods, or preventative measures? Categorize accordingly. \\n4. Sprinkle in some educated guesses about when results will be announced and potential impacts on medical practice.\\n5. Bake at 350 degrees for 1 news article. Let cool before serving.\\n</moosewood_methods>\\n\\n<homer_results>\\nOh Muses, sing of trials to come, \\nThat shape the fate of mortal men,\\nNew cures for ills that plague us now, \\nAnd tests to find disease again.\\n\\nOn treatments for the shaking palsied,\\nAnd medicines to still the mind, \\nThe wise doctors have pondered long,\\nHopeful results they soon shall find.\\n\\nOf cancer screens and vaccines too,\\nTo keep dread viruses at bay,\\nThe trials promise better paths, \\nTo health and wellness show the way.\\n\\nThough challenges remain ahead,\\nScience marches steadily on,\\nGuided by patient volunteers,\\nWho progress help us carry on.\\n</homer_results>\\n\\n<grouchy_critique>\\nHmph, another puff piece about \"exciting\" clinical trials. The authors interviewed a bunch of optimistic \"experts\" who picked their favorite trials without any critical assessment. Of course they\\'re excited - they probably run the trials or have a stake in the treatments! But most early stage trials fail, and positive results often don\\'t replicate. Wake me up when phase 3 results are published in reputable journals, not press releases. This is just hype and speculation masquerading as news. Do some actual investigative reporting on research quality, instead of asking researchers to promote their own work! Bah!\\n</grouchy_critique>'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# LLM\n",
"llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)\n",
"\n",
"# Chain\n",
"chain = prompt | llm_anthropic | StrOutputParser()\n",
"\n",
"# Invoke the chain\n",
"chain.invoke({\"text\" : all_pages})"
]
},
{
"cell_type": "markdown",
"id": "39a42165-57a4-43c6-bed3-bf2555c39f80",
"metadata": {},
"source": [
"## RAG \n",
"\n",
"We may want to perform question-answering based on document context. \n",
"\n",
"### PDF\n",
"\n",
"We can apply this to the PDF."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "d4eeeea4-d934-4a4f-9188-21f418185311",
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
"all_splits = text_splitter.split_documents(pdf_pages)\n",
"\n",
"# Embed and add to vectorDB\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.vectorstores import Chroma\n",
"vectorstore = Chroma.from_documents(\n",
" documents=all_splits,\n",
" collection_name=\"rag\",\n",
" embedding=OpenAIEmbeddings(),\n",
")\n",
"retriever = vectorstore.as_retriever()\n",
"\n",
"# Prompt\n",
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
"rag_prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"# RAG chain\n",
"from langchain.schema.runnable import RunnableParallel, RunnablePassthrough\n",
"chain = (\n",
" RunnableParallel({\"context\": retriever, \"question\": RunnablePassthrough()})\n",
" | rag_prompt\n",
" | llm_openai\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "976710f7-ebfc-41c3-8525-323315b2ab35",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Based on the given context, one example of a clinical trial focused on cancer is the trial for mirvetuximab soravtan-sine from ImmunoGen for platinum-resistant ovarian cancer.'"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\"What are some exmaple clincal trials that are focused on cancer?\")"
]
},
{
"cell_type": "markdown",
"id": "5034e7c8-7dbb-4467-a3f7-14ffa2cfeab5",
"metadata": {},
"source": [
"Look at LangSmith trace [here](https://smith.langchain.com/public/c87b797b-78ef-42de-a5f3-3986af379943/r)."
]
},
{
"cell_type": "markdown",
"id": "15e3b622-086e-4d16-891c-f05a41c27cfc",
"metadata": {},
"source": [
"## Private RAG \n",
"\n",
"We may want to perform question-answering based on document context without passing anything to external APIs.\n",
"\n",
"We can use [Ollama.ai](https://ollama.ai/).\n",
"\n",
"Download the app, and then pull your LLM of choice:\n",
"\n",
"e.g., `ollama pull zephyr` for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha), a fine-tuned LLM on Mistral.\n",
"\n",
"Also, we will use local embeddings from GPT4All (CPU-optimized BERT embeddings)."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "b1ed0d54-ad4a-4a55-9858-0777ee5274f2",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOllama\n",
"from langchain.embeddings import GPT4AllEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "c8381170-0e30-40bc-a92c-303e7bdea0c8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found model file at /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n"
]
}
],
"source": [
"# Add to vectorDB\n",
"vectorstore_private = Chroma.from_documents(\n",
" documents=all_splits,\n",
" collection_name=\"rag-private\",\n",
" embedding=GPT4AllEmbeddings(),\n",
")\n",
"retriever_private = vectorstore_private.as_retriever()\n",
"\n",
"# LLM\n",
"ollama_llm = \"zephyr\"\n",
"llm_private = ChatOllama(model=ollama_llm)\n",
"\n",
"# RAG chain\n",
"from langchain.schema.runnable import RunnableParallel, RunnablePassthrough\n",
"chain_private = (\n",
" RunnableParallel({\"context\": retriever, \"question\": RunnablePassthrough()})\n",
" | rag_prompt\n",
" | llm_private\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "afaa3d1e-a94f-41bc-bd60-b8c36adae219",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Two examples of clinical trials that are focused on cancer mentioned in the given context are:\\n1. Mirvetuximab soravtan-sin, a antibody-drug conjugate (ADC) for ovarian cancer. This trial resulted in accelerated approval by the US Food and Drug Administration based on results from a single-arm study enrolling 106 patients with platinum-resistant ovarian cancer whose tumors had high expression of a protein called folate receptor alpha (FRA). The author, Robert L. Coleman, expects this to be the most imminent and important upcoming trial result in his field in 2023.\\n2. ADCs for previously treated cervical cancer, which are currently in development. Successful approval of these trials will provide a solid framework for clinical trials evaluating novel combinations in several disease settings.\\nNote: The first example provided is specifically for ovarian cancer, while the second example is more general and encompasses other types of cancer as well.'"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain_private.invoke(\"What are some exmaple clincal trials that are focused on cancer?\")"
]
},
{
"cell_type": "markdown",
"id": "9e18308f-2c96-4ee1-a60f-b719252b291d",
"metadata": {},
"source": [
"### Clinical Trial Data\n",
"\n",
"We can apply this to the extracted clinical trial data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a8927509-9fc1-4ab5-9d35-ce11751f719a",
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)\n",
"all_splits = text_splitter.split_documents(clinical_trial_data_sample)"
]
},
{
"cell_type": "markdown",
"id": "9d38a3c3-891c-46fe-b49f-1def73f3c5f6",
"metadata": {},
"source": [
"We preserve the metadata with each split."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "83934e88-568e-4ba9-8cfe-cf16464f714b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json',\n",
" 'seq_num': 1,\n",
" 'NCTId': 'NCT00000102',\n",
" 'StudyType': 'Interventional',\n",
" 'PhaseList': \"['Phase 1', 'Phase 2']\"}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_splits[0].metadata"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e1346f7e-e7cd-4d66-b818-c421b2cec7aa",
"metadata": {},
"outputs": [],
"source": [
"# Embed and add to vectorDB\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"vectorstore = Chroma.from_documents(\n",
" documents=all_splits,\n",
" collection_name=\"rag-biomedical\",\n",
" persist_directory='./vectorstore',\n",
" embedding=OpenAIEmbeddings(),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "672c7d03-7284-4088-b96b-0fe22b670eff",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NCT00000104', 'NCT00000104', 'NCT00000105', 'NCT00000105']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test\n",
"retriever = vectorstore.as_retriever()\n",
"docs = retriever.get_relevant_documents(\"What was the focus of trial NCT00000102?\")\n",
"[doc.metadata[\"NCTId\"] for doc in docs]"
]
},
{
"cell_type": "markdown",
"id": "7ec75d10-7e9d-4394-aab5-2cda268b5887",
"metadata": {},
"source": [
"We get a mix of results.\n",
"\n",
"[Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) allows for metadata filtering. \n",
"\n",
"* [LangChain guide](https://python.langchain.com/docs/integrations/vectorstores/chroma)\n",
"* [Chroma guide](https://docs.trychroma.com/usage-guide#filtering-by-metadata)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "48839cd8-bfb3-4405-94b1-f75de7b4ddb6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = vectorstore.get(where={\"NCTId\": \"NCT00000102\"})\n",
"len(docs)"
]
},
{
"cell_type": "markdown",
"id": "328e15c7-004b-46c7-b359-116e6d9f0eae",
"metadata": {},
"source": [
"We can build a retriever that reasons about metadata filter(s) from the user question."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "c4d9206e-0ba2-4c94-99c9-0c3b19812a7c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Provide context about the metadata\n",
"metadata_field_info = [\n",
" \n",
" AttributeInfo(\n",
" name=\"NCTId\",\n",
" description=\"The unique identifier assigned to each clinical trial when registered on ClinicalTrials.gov. \",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"StudyType\",\n",
" description=\"The nature of the study, indicating whether participants receive specific interventions or are merely observed for specific outcomes.\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"PhaseList\",\n",
" description=\"This pertains to the phase of the study in drug trials.\",\n",
" type=\"string\",\n",
" )\n",
"]\n",
"\n",
"# Overall context for the data\n",
"document_content_description = \"Information about clinical trial on ClinicalTrials.gov\"\n",
"\n",
"# LLM\n",
"llm = OpenAI(temperature=0,)\n",
"\n",
"# Retriever\n",
"retriever_self_query = SelfQueryRetriever.from_llm(\n",
" llm, \n",
" vectorstore, \n",
" document_content_description, \n",
" metadata_field_info, \n",
" verbose=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "a446e627-f6b1-4911-9722-154c7c24a906",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NCT00000102', 'NCT00000102', 'NCT00000102', 'NCT00000102']"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = retriever_self_query.get_relevant_documents(\"What was the focus of trial NCT00000102?\")\n",
"[doc.metadata[\"NCTId\"] for doc in docs]"
]
},
{
"cell_type": "markdown",
"id": "48803980-2ffc-4ba8-a5e5-72b6e44295eb",
"metadata": {},
"source": [
"## Other Resources\n",
"\n",
"* [Semi-Structured RAG](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb)\n",
"* [Multi-Modal RAG](https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_structured_and_multi_modal_RAG.ipynb)\n",
"* [Plate Reader Template](https://github.com/langchain-ai/langchain/tree/master/templates/plate-chain)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,453 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "dea8122d-2dce-4ca1-9115-07df0ecc039d",
"metadata": {},
"source": [
"# RAG on clinical trial data\n",
"\n",
"ClinicalTrials.gov is a publicly accessible database maintained by the U.S. National Library of Medicine (NLM) at the National Institutes of Health (NIH). It provides information on both privately and publicly funded clinical studies conducted around the world. We can download the database of clinical trials [here](https://classic.clinicaltrials.gov/api/gui/ref/download_all).\n",
" \n",
"Each trial is captured as a JSON file, with a mixture of useful semantic context (trials description, etc) as well as trial metadata (such as the trial ID `NCTId` and phases). This data is well suited to semantic search with metadata filtering, with some fields indexed for semantic search and others for `metadata filtering`.\n",
" \n",
"We wil use JSONLoader (docs [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/json)) to load the JSON data and specify metadata fields by passing the `extract_metadata` function to `metadata_func`.\n",
"\n",
"## Data Loding\n",
"\n",
"Download the database of clinical trials [here](https://classic.clinicaltrials.gov/api/gui/ref/download_all).\n",
"\n",
"There are `471,942` JSON files in the downloaded directory `AllAPIJSON` and its subdirectories.\n",
"\n",
"Let's grab 5 from one of the sub-directories as a sample."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bd333b7-4cb0-4e99-9940-b2eb8b889e2f",
"metadata": {},
"outputs": [],
"source": [
"# Define the directory path, selecting one of the sub-directories to read from\n",
"import os \n",
"path_to_db = \"/Users/rlm/Desktop/Clinical-Trials/AllAPIJSON/NCT0000xxxx\"\n",
"\n",
"# List all files in the directory\n",
"all_files = os.listdir(path_to_db)\n",
"\n",
"# Filter only JSON files\n",
"json_files = [file for file in all_files if file.endswith('.json')]\n",
"\n",
"# Sort and select the first 5 JSON files (you can customize the sorting if needed)\n",
"first_5_jsons = sorted(json_files)[:5]"
]
},
{
"cell_type": "markdown",
"id": "431508bb-3cfd-4f5a-9582-3eaed191a811",
"metadata": {},
"source": [
"Let's look at the structure of each record."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b64d3979-21dd-4053-90a6-36f14a8fc7fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['FullStudy.Rank',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.NCTId',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.OrgStudyIdInfo.OrgStudyId',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.SecondaryIdInfoList.SecondaryIdInfo',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgFullName',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.Organization.OrgClass',\n",
" 'FullStudy.Study.ProtocolSection.IdentificationModule.BriefTitle',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StatusVerifiedDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.OverallStatus',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.ExpandedAccessInfo.HasExpandedAccess',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstSubmitQCDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstPostDateStruct.StudyFirstPostDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.StudyFirstPostDateStruct.StudyFirstPostDateType',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdateSubmitDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdatePostDateStruct.LastUpdatePostDate',\n",
" 'FullStudy.Study.ProtocolSection.StatusModule.LastUpdatePostDateStruct.LastUpdatePostDateType',\n",
" 'FullStudy.Study.ProtocolSection.SponsorCollaboratorsModule.LeadSponsor.LeadSponsorName',\n",
" 'FullStudy.Study.ProtocolSection.SponsorCollaboratorsModule.LeadSponsor.LeadSponsorClass',\n",
" 'FullStudy.Study.ProtocolSection.DescriptionModule.BriefSummary',\n",
" 'FullStudy.Study.ProtocolSection.DescriptionModule.DetailedDescription',\n",
" 'FullStudy.Study.ProtocolSection.ConditionsModule.ConditionList.Condition',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.StudyType',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.PhaseList.Phase',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignInterventionModel',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignPrimaryPurpose',\n",
" 'FullStudy.Study.ProtocolSection.DesignModule.DesignInfo.DesignMaskingInfo.DesignMasking',\n",
" 'FullStudy.Study.ProtocolSection.ArmsInterventionsModule.InterventionList.Intervention',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.EligibilityCriteria',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.HealthyVolunteers',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.Gender',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.MinimumAge',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.MaximumAge',\n",
" 'FullStudy.Study.ProtocolSection.EligibilityModule.StdAgeList.StdAge',\n",
" 'FullStudy.Study.ProtocolSection.ContactsLocationsModule.LocationList.Location',\n",
" 'FullStudy.Study.DerivedSection.MiscInfoModule.VersionHolder',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionMeshList.ConditionMesh',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionAncestorList.ConditionAncestor',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionBrowseLeafList.ConditionBrowseLeaf',\n",
" 'FullStudy.Study.DerivedSection.ConditionBrowseModule.ConditionBrowseBranchList.ConditionBrowseBranch',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionMeshList.InterventionMesh',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionAncestorList.InterventionAncestor',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionBrowseLeafList.InterventionBrowseLeaf',\n",
" 'FullStudy.Study.DerivedSection.InterventionBrowseModule.InterventionBrowseBranchList.InterventionBrowseBranch'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import json\n",
"import pandas as pd\n",
"\n",
"# Assuming first_5_jsons is a list of file paths.\n",
"with open(path_to_db+'/'+first_5_jsons[0], 'r') as f:\n",
" data = json.load(f)\n",
"\n",
"# Normalize the JSON data to a DataFrame\n",
"df = pd.json_normalize(data)\n",
"\n",
"# Print the columns to understand the structure\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"id": "6ff6b543-f0f1-4f6d-9aa1-e7bc70d3050a",
"metadata": {},
"source": [
"We can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json).\n",
"\n",
"With `extract_metadata`, the parsed JSON data (a dictionary) is provided to the JSONLoader. \n",
"\n",
"Here, it corresponds to the `ProtocolSection` of the study.\n",
"\n",
"The function first attempts to access the `IdentificationModule` within the sample dictionary using the get method. \n",
"\n",
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
"\n",
"Next, the function attempts to access NCTId from the previously fetched value. \n",
"\n",
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
"\n",
"We can perform this for each desired metadata field."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5014c010-c638-48d8-bd77-7a4106831323",
"metadata": {},
"outputs": [],
"source": [
"from typing import Any, Dict\n",
"from langchain.docstore.document import Document\n",
"from langchain.document_loaders import JSONLoader\n",
"\n",
"def extract_metadata(sample: Dict[str, Any], default_metadata: Dict[str, Any]) -> Dict[str, Any]:\n",
" nctid = sample.get('IdentificationModule', {}).get('NCTId', None)\n",
" study_type = sample.get('DesignModule', {}).get('StudyType', None)\n",
" phase_list = sample.get('DesignModule', {}).get('PhaseList', {}).get('Phase', None)\n",
" metadata = {\n",
" **default_metadata,\n",
" 'NCTId': nctid,\n",
" 'StudyType': study_type,\n",
" 'PhaseList': str(phase_list) # Metadata needs to be str\n",
" }\n",
" return metadata\n",
"\n",
"def load_json(dir: str) -> Document:\n",
" \n",
" loader = JSONLoader(\n",
" file_path=dir,\n",
" jq_schema='.FullStudy.Study.ProtocolSection',\n",
" metadata_func=extract_metadata,\n",
" text_content=False\n",
" )\n",
" \n",
" data = loader.load()\n",
" return data[0]"
]
},
{
"cell_type": "markdown",
"id": "a3dc5c64-7050-422e-b6e0-86bad0ae93a0",
"metadata": {},
"source": [
"Read the JSON objects."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f611e119-e835-44bb-b36a-f3c50f3f858d",
"metadata": {},
"outputs": [],
"source": [
"# Process each of the selected JSON files\n",
"clinical_trial_data_sample = []\n",
"for json_file in first_5_jsons:\n",
" file_path = os.path.join(dir_path, json_file)\n",
" data = load_json(file_path)\n",
" clinical_trial_data_sample.append(data)"
]
},
{
"cell_type": "markdown",
"id": "5ad4e48e-550a-4296-af54-47bc5f99228c",
"metadata": {},
"source": [
"Split documents for embedding and vectorstore."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9bcec5de-1bf8-420a-9e3c-c2fee442dd24",
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)\n",
"all_splits = text_splitter.split_documents(clinical_trial_data_sample)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f4223b6e-6879-4b33-acdf-bd124e77d1f5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json',\n",
" 'seq_num': 1,\n",
" 'NCTId': 'NCT00000102',\n",
" 'StudyType': 'Interventional',\n",
" 'PhaseList': \"['Phase 1', 'Phase 2']\"}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_splits[0].metadata"
]
},
{
"cell_type": "markdown",
"id": "8ed2cda2-8759-45b7-b1d5-8ce7ac739f9d",
"metadata": {},
"source": [
"Embed and add to vectorDB.\n",
"\n",
"We save this vectorstore to the directory for future use."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3e7c6c12-d785-43d4-aecf-fe9743aa1f8c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"vectorstore = Chroma.from_documents(\n",
" documents=all_splits,\n",
" collection_name=\"rag-biomedical\",\n",
" persist_directory='./vectorstore',\n",
" embedding=OpenAIEmbeddings(),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c2e6d62e-c591-430f-84f2-a9251a6db780",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NCT00000104', 'NCT00000105', 'NCT00000102', 'NCT00000107']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test\n",
"retriever = vectorstore.as_retriever()\n",
"docs = retriever.get_relevant_documents(\"What was the focus of trial NCT00000102?\")\n",
"[doc.metadata[\"NCTId\"] for doc in docs]"
]
},
{
"cell_type": "markdown",
"id": "9176cff6-6dcf-4c76-8bf2-ebd700526fe1",
"metadata": {},
"source": [
"We get a mix of results.\n",
"\n",
"[Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) allows for metadata filtering. \n",
"\n",
"* [LangChain guide](https://python.langchain.com/docs/integrations/vectorstores/chroma)\n",
"* [Chroma guide](https://docs.trychroma.com/usage-guide#filtering-by-metadata)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "81730b9a-4e44-4d0f-8137-584d68808db5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'NCTId': 'NCT00000102',\n",
" 'PhaseList': \"['Phase 1', 'Phase 2']\",\n",
" 'StudyType': 'Interventional',\n",
" 'seq_num': 1,\n",
" 'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json'},\n",
" {'NCTId': 'NCT00000102',\n",
" 'PhaseList': \"['Phase 1', 'Phase 2']\",\n",
" 'StudyType': 'Interventional',\n",
" 'seq_num': 1,\n",
" 'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json'},\n",
" {'NCTId': 'NCT00000102',\n",
" 'PhaseList': \"['Phase 1', 'Phase 2']\",\n",
" 'StudyType': 'Interventional',\n",
" 'seq_num': 1,\n",
" 'source': '/Users/rlm/Desktop/GENE-workshop/AllAPIJSON/NCT0000xxxx/NCT00000102.json'}]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = vectorstore.get(where={\"NCTId\": \"NCT00000102\"})\n",
"docs['metadatas']"
]
},
{
"cell_type": "markdown",
"id": "62bb8e32-6a41-4808-b729-6394929e5b80",
"metadata": {},
"source": [
"We can build a retriever that reasons about metadata filter(s) from the user question."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "e48d3eee-53bc-4854-b868-e20bf5c6a397",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Provide context about the metadata\n",
"metadata_field_info = [\n",
" \n",
" AttributeInfo(\n",
" name=\"NCTId\",\n",
" description=\"The unique identifier assigned to each clinical trial when registered on ClinicalTrials.gov. \",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"StudyType\",\n",
" description=\"The nature of the study, indicating whether participants receive specific interventions or are merely observed for specific outcomes.\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"PhaseList\",\n",
" description=\"This pertains to the phase of the study in drug trials.\",\n",
" type=\"string\",\n",
" )\n",
"]\n",
"\n",
"# Overall context for the data\n",
"document_content_description = \"Information about clinical trial on ClinicalTrials.gov\"\n",
"\n",
"# LLM\n",
"llm = OpenAI(temperature=0,)\n",
"\n",
"# Retriever\n",
"retriever_self_query = SelfQueryRetriever.from_llm(\n",
" llm, \n",
" vectorstore, \n",
" document_content_description, \n",
" metadata_field_info, \n",
" verbose=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "93acf5e7-1a26-4bbc-afc6-ffd7f5893fa5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['NCT00000102', 'NCT00000102', 'NCT00000102']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = retriever_self_query.get_relevant_documents(\"What was the focus of trial NCT00000102?\")\n",
"[doc.metadata[\"NCTId\"] for doc in docs]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,78 @@
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.vectorstores import Chroma
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
# Embed a single document as a test
vectorstore = Chroma(persist_directory="./vectorstore",
collection_name="rag-biomedical",
embedding_function=OpenAIEmbeddings())
# Provide context about the metadata
metadata_field_info = [
AttributeInfo(
name="NCTId",
description="The unique identifier assigned to each clinical trial when registered on ClinicalTrials.gov. ",
type="string",
),
AttributeInfo(
name="StudyType",
description="The nature of the study, indicating whether participants receive specific interventions or are merely observed for specific outcomes.",
type="string",
),
AttributeInfo(
name="PhaseList",
description="This pertains to the phase of the study in drug trials.",
type="string",
)
]
# Overall context for the data
document_content_description = "Information about clinical trial on ClinicalTrials.gov"
# LLM
llm = OpenAI(temperature=0,)
# Retriever
retriever_self_query = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
verbose=True
)
# RAG prompt
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
# LLM
model = ChatOpenAI()
# RAG chain
chain = (
RunnableParallel({"context": retriever_self_query, "question": RunnablePassthrough()})
| prompt
| model
| StrOutputParser()
)
# Add typing for input
class Question(BaseModel):
__root__: str
chain = chain.with_types(input_type=Question)

View File

@@ -0,0 +1,183 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "052f3bef-847c-43e9-aa3f-08f7b5f8e7b5",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"vectorstore_path = Path(__file__).parent / \"vectorstore\"\n",
"rel = vectorstore_path.relative_to(Path.cwd())\n",
"rel"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "9e7e6b47-919e-4956-add5-090bad49c535",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'/Users/rlm/Desktop/Code/langchain-main/langchain/templates/rag-biomedical/rag_biomedical'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pwd"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "51b34cee-02ba-48e9-a2ce-8263e51213f4",
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"vectorstore = Chroma(persist_directory=\"./vectorstore\", \n",
" collection_name=\"rag-biomedical\",\n",
" embedding_function=OpenAIEmbeddings())"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "dc148a4e-9a64-4d2c-b8f0-6c9571c2e98d",
"metadata": {},
"outputs": [],
"source": [
"from langchain.llms import OpenAI\n",
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
"from langchain.chains.query_constructor.base import AttributeInfo\n",
"\n",
"# Provide context about the metadata\n",
"metadata_field_info = [\n",
" \n",
" AttributeInfo(\n",
" name=\"NCTId\",\n",
" description=\"The unique identifier assigned to each clinical trial when registered on ClinicalTrials.gov. \",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"StudyType\",\n",
" description=\"The nature of the study, indicating whether participants receive specific interventions or are merely observed for specific outcomes.\",\n",
" type=\"string\",\n",
" ),\n",
" AttributeInfo(\n",
" name=\"PhaseList\",\n",
" description=\"This pertains to the phase of the study in drug trials.\",\n",
" type=\"string\",\n",
" )\n",
"]\n",
"\n",
"# Overall context for the data\n",
"document_content_description = \"Information about clinical trial on ClinicalTrials.gov\"\n",
"\n",
"# LLM\n",
"llm = OpenAI(temperature=0,)\n",
"\n",
"# Retriever\n",
"retriever_self_query = SelfQueryRetriever.from_llm(\n",
" llm, \n",
" vectorstore, \n",
" document_content_description, \n",
" metadata_field_info, \n",
" verbose=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "fbbda2d5-af87-42a0-a2ad-8bf72f3c5658",
"metadata": {},
"outputs": [],
"source": [
"# retriever_self_query.get_relevant_documents(\"What is the focus of trial NCT00000102?\")"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "f58d91a1-ac8d-4432-b3ce-fdfc10942af0",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable import RunnableParallel, RunnablePassthrough\n",
"\n",
"# RAG prompt\n",
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"# LLM\n",
"model = ChatOpenAI()\n",
"\n",
"# RAG chain\n",
"chain = (\n",
" RunnableParallel({\"context\": retriever_self_query, \"question\": RunnablePassthrough()})\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "ea413fa6-5ba4-4d1d-b8d7-04b18727468d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The focus of trial NCT00000102 is to evaluate the effects of extended release nifedipine (Procardia XL), a blood pressure medication, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The trial aims to assess both acute and chronic effects of nifedipine and evaluate its potential to decrease the dosage of glucocorticoid medication needed to suppress the HPA axis in order to reduce the deleterious effects of glucocorticoid treatment in CAH.'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\"What is the focus of trial NCT00000102?\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}