mirror of
https://github.com/hwchase17/langchain.git
synced 2026-02-21 14:43:07 +00:00
Update intro
This commit is contained in:
@@ -5,25 +5,52 @@
|
||||
"id": "67f41c2c-869d-44c3-b243-a7ba105d504f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Biomedical RAG DB Introduction\n",
|
||||
"\n",
|
||||
"This notebook provides some general context for RAG on different types of documents relevant in biomedical research:\n",
|
||||
"# Biomedical RAG Introduction\n",
|
||||
"\n",
|
||||
"This notebook provides ccontext for RAG on varios types of documents that are relevant in biomedical research:\n",
|
||||
" \n",
|
||||
"* URLs\n",
|
||||
"* Academic papers\n",
|
||||
"* Clinical Trial Data\n",
|
||||
"* Academic papers (PDF)\n",
|
||||
"* Clinical Trial Data (JSON)\n",
|
||||
"\n",
|
||||
"## Dependencies "
|
||||
"## Dependencies \n",
|
||||
"\n",
|
||||
"### Packages"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "27507adc-aeb4-4ed0-a4c3-593e03b01cb7",
|
||||
"id": "aad3785d-4ee3-4675-8fb4-f393d4d0b8c9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install langchain unstructured[all-docs] pypdf langchainhub anthropic openai chromadb gpt4all"
|
||||
"! pip install chromadb tiktoken pypdf langchainhub anthropic openai gpt4all pandas"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b5e09ba1-a99c-417a-857a-90d081fd7a07",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### API keys\n",
|
||||
"\n",
|
||||
"Required (to access OpenAI LLMs)\n",
|
||||
"* `OPENAI_API_KEY`\n",
|
||||
"\n",
|
||||
"Optional [Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api) and [LangSmith](https://docs.smith.langchain.com/) for Anthropic LLMs and tracing:\n",
|
||||
"* `ANTHROPIC_API_KEY` \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b17fe580-11fc-4c35-934c-db259e35c969",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"# os.environ['OPENAI_API_KEY'] = 'xxx'"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -61,19 +88,19 @@
|
||||
"\n",
|
||||
"#### URLs\n",
|
||||
"\n",
|
||||
"We can load from urls or [CSVs](https://python.langchain.com/docs/integrations/document_loaders/csv) (e.g., from [Clinical Trials](https://clinicaltrials.gov/study/NCT04296890)). "
|
||||
"We can load from [urls](https://python.langchain.com/docs/integrations/document_loaders/url). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 24,
|
||||
"execution_count": 3,
|
||||
"id": "729327fc-6710-4c2f-8ae5-05c0c1b6ce91",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.document_loaders import WebBaseLoader\n",
|
||||
"# Define loader for URL\n",
|
||||
"loader = WebBaseLoader(\"https://lilianweng.github.io/posts/2023-06-23-agent/\")\n",
|
||||
"loader = WebBaseLoader(\"https://berthub.eu/articles/posts/amazing-dna/\")\n",
|
||||
"blog = loader.load()"
|
||||
]
|
||||
},
|
||||
@@ -84,30 +111,14 @@
|
||||
"source": [
|
||||
"#### JSON\n",
|
||||
"\n",
|
||||
"Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs.\n",
|
||||
"Cliical trial data can be downloaded [as a database](https://classic.clinicaltrials.gov/api/gui/ref/download_all) of JSONs from clinicaltrials.gov.\n",
|
||||
"\n",
|
||||
"You can download the full zip file (`~2GB`) with the above link.\n",
|
||||
" \n",
|
||||
"We can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json).\n",
|
||||
"\n",
|
||||
"With `extract_metadata`, the parsed JSON data (a dictionary) is provided to the JSONLoader. \n",
|
||||
"\n",
|
||||
"Here, it corresponds to the `ProtocolSection` of the study.\n",
|
||||
"\n",
|
||||
"The function first attempts to access the `IdentificationModule` within the sample dictionary using the get method. \n",
|
||||
"\n",
|
||||
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
|
||||
"\n",
|
||||
"Next, the function attempts to access NCTId from the previously fetched value. \n",
|
||||
"\n",
|
||||
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
|
||||
"\n",
|
||||
"We can perform this for each desired metadata field."
|
||||
"The full zip file is `~2GB` and we can [load each JSON record](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) with `JSONLoader`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 5,
|
||||
"id": "b7aafb2e-42ba-46f2-879d-44a73bca7c67",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -131,7 +142,7 @@
|
||||
"id": "c20f516a-0c44-422f-9fc4-3d13a9854ec4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's look at the structure of each record."
|
||||
"We can look at the structure of each record."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -207,6 +218,28 @@
|
||||
"print(df.columns)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bd7ae843-62dc-45f1-b233-ba44c27ff593",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To build a vectostore, we should think about what we want as `metadata` and what we want to `embed` and semantically retrieve.\n",
|
||||
"\n",
|
||||
"`JSONLoader` accepts a field that we can use to supply metadata.\n",
|
||||
"\n",
|
||||
"The `jq_schema` is a dict dictionary that corresponds to the `ProtocolSection` of the study.\n",
|
||||
" \n",
|
||||
"With `extract_metadata`, the function first attempts to access the `IdentificationModule` within the sample dictionary using the `get` method. \n",
|
||||
"\n",
|
||||
"If IdentificationModule is present, it'll return its value; otherwise, it'll return an empty dictionary ({}).\n",
|
||||
"\n",
|
||||
"Next, the function attempts to access NCTId from the previously fetched value. \n",
|
||||
"\n",
|
||||
"If NCTId is present, its value is returned; otherwise, None is returned.\n",
|
||||
"\n",
|
||||
"We can perform this for each desired metadata field."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
@@ -262,17 +295,6 @@
|
||||
"* See list of OpenAI models [here](https://platform.openai.com/docs/models/gpt-3-5)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6d0784c5-ad6e-4142-b4ab-0cea0801cb19",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"# os.environ['OPENAI_API_KEY'] = 'xxx'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
@@ -300,9 +322,11 @@
|
||||
"\n",
|
||||
"# LLM\n",
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",temperature=0)\n",
|
||||
"llm_openai = ChatOpenAI(model=\"gpt-3.5-turbo-16k\",\n",
|
||||
" temperature=0)\n",
|
||||
"\n",
|
||||
"# Chain\n",
|
||||
"# The StrOutputParser is a built-in output parser that converts the output of a language model into a string.\n",
|
||||
"from langchain.schema.output_parser import StrOutputParser\n",
|
||||
"chain = prompt | llm_openai | StrOutputParser()\n",
|
||||
"\n",
|
||||
@@ -343,7 +367,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 6,
|
||||
"id": "e7a8bbfa-9e85-41f5-a2f9-9ac2b8435ebd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -352,16 +376,6 @@
|
||||
"from langchain.chat_models import ChatAnthropic"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "07fc90c2-70b7-4bda-963a-505934208913",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# os.environ['ANTHROPIC_API_KEY'] = 'xxx'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "52e821f8-2a9b-48e6-9fa7-3ca1329d6d56",
|
||||
@@ -370,6 +384,17 @@
|
||||
"The summarization prompt is a fun, silly example to highlight the flexibility of Claude2."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2dfe47ad-cb4f-48ae-b18f-fa7d9a62a44a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Prompt from the Hub\n",
|
||||
"prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
@@ -388,9 +413,6 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Prompt from the Hub\n",
|
||||
"prompt = hub.pull(\"hwchase17/anthropic-paper-qa\")\n",
|
||||
"\n",
|
||||
"# LLM\n",
|
||||
"llm_anthropic = ChatAnthropic(temperature=0, model='claude-2', max_tokens=10000)\n",
|
||||
"\n",
|
||||
|
||||
Reference in New Issue
Block a user