Compare commits

...

3 Commits

Author SHA1 Message Date
Lance Martin
cc51af26b6 Minor updates 2024-04-17 12:52:05 -07:00
Lance Martin
16152b3cdd Add hallucination and doc relevance 2024-04-16 16:56:41 -07:00
Lance Martin
c152fc5733 RAG guide 2024-04-16 15:31:48 -07:00
5 changed files with 559 additions and 0 deletions

View File

@@ -0,0 +1,559 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2e7db2b1-8f9c-46bd-9c50-b6cfb0a38a22",
"metadata": {},
"source": [
"# RAG Evaluation\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/guides/evaluation/examples/rag.ipynb)\n",
"\n",
"RAG (Retrieval Augmented Generation) is one of the most popular LLM applications.\n",
"\n",
"For an in-depth review, see our RAG series of notebooks and videos [here](https://github.com/langchain-ai/rag-from-scratch)).\n",
"\n",
"## Types of RAG eval\n",
"\n",
"There are at least 4 types of RAG eval that users of typically interested in:\n",
"\n",
"![](../../../../../static/img/langsmith_rag_eval.png)\n",
"\n",
"\n",
"Each of these evals has something in common: it will compare text (e.g., answer vs reference answer, etc).\n",
"\n",
"We can use various built-in `LangChainStringEvaluator` types for this (see [here](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#overview)).\n",
"\n",
"All `LangChainStringEvaluator` implementations can accept 3 inputs:\n",
"\n",
"```\n",
"prediction: The prediction string.\n",
"reference: The reference string.\n",
"input: The input string.\n",
"```\n",
"\n",
"Below, we will use this to perform eval.\n",
"\n",
"## RAG Chain \n",
"\n",
"To start, we build a RAG chain. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d809e9a0-44bc-4e9f-8eee-732ef077538c",
"metadata": {},
"outputs": [],
"source": [
"! pip install langchain-community langchain chromdb tiktoken"
]
},
{
"cell_type": "markdown",
"id": "760cab79-2d5e-4324-ba4a-54b6f4094cb0",
"metadata": {},
"source": [
"We build an `index` using a set of LangChain docs."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6f7c0017-f4dd-4071-aa48-40957ffb4e9d",
"metadata": {},
"outputs": [],
"source": [
"### INDEX\n",
"\n",
"from bs4 import BeautifulSoup as Soup\n",
"from langchain_community.vectorstores import Chroma\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader\n",
"\n",
"# Load\n",
"url = \"https://python.langchain.com/docs/expression_language/\"\n",
"loader = RecursiveUrlLoader(url=url, max_depth=20, extractor=lambda x: Soup(x, \"html.parser\").text)\n",
"docs = loader.load()\n",
"\n",
"# Split\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
"splits = text_splitter.split_documents(docs)\n",
"\n",
"# Embed\n",
"vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
"\n",
"# Index\n",
"retriever = vectorstore.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "c365fb82-78a6-40b6-bd59-daaa1e79d6c8",
"metadata": {},
"source": [
"Next, we build a `RAG chain` that returns an `answer` and the retrieved documents as `contexts`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "68e249d7-bc6c-4631-b099-6daaeeddf38a",
"metadata": {},
"outputs": [],
"source": [
"### RAG \n",
"\n",
"import openai\n",
"from langsmith import traceable\n",
"from langsmith.wrappers import wrap_openai\n",
"\n",
"class RagBot:\n",
" def __init__(self, retriever, model: str = \"gpt-4-turbo-preview\"):\n",
" self._retriever = retriever\n",
" # Wrapping the client instruments the LLM\n",
" self._client = wrap_openai(openai.Client())\n",
" self._model = model\n",
"\n",
" @traceable\n",
" def get_answer(self, question: str):\n",
" similar = self._retriever.invoke(question)\n",
" response = self._client.chat.completions.create(\n",
" model=self._model,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": \"You are a helpful AI assistant.\"\n",
" \" Use the following docs to help answer the user's question.\\n\\n\"\n",
" f\"## Docs\\n\\n{similar}\",\n",
" },\n",
" {\"role\": \"user\", \"content\": question},\n",
" ],\n",
" )\n",
" \n",
" # Evaluators will expect \"answer\" and \"contexts\"\n",
" return {\n",
" \"answer\": response.choices[0].message.content,\n",
" \"contexts\": [str(doc) for doc in similar],\n",
" }\n",
"\n",
"rag_bot = RagBot(retriever)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6101d155-a1ab-460c-8c3e-f1f44e09a8b7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'LangChain Expression Language (LCEL) is a declarative language that simplifies the composition of chains for working with language models and related '"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response = rag_bot.get_answer(\"What is LCEL?\")\n",
"response[\"answer\"][:150]"
]
},
{
"cell_type": "markdown",
"id": "432e8ec7-a085-4224-ad38-0087e1d553f1",
"metadata": {},
"source": [
"## RAG Dataset \n",
"\n",
"Next, we build a dataset of QA pairs based upon the [documentation](https://python.langchain.com/docs/expression_language/) that we indexed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22f0daeb-6a61-4f8d-a4fc-4c7d22b6dc61",
"metadata": {},
"outputs": [],
"source": [
"os.environ['LANGCHAIN_TRACING_V2'] = 'true'\n",
"os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'\n",
"os.environ['LANGCHAIN_API_KEY'] = <your-api-key>"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0f29304f-d79b-40e9-988a-343732102af9",
"metadata": {},
"outputs": [],
"source": [
"from langsmith import Client \n",
"\n",
"# QA\n",
"inputs = [\n",
" \"How can I directly pass a string to a runnable and use it to construct the input needed for my prompt?\",\n",
" \"How can I make the output of my LCEL chain a string?\",\n",
" \"How can I apply a custom function to one of the inputs of an LCEL chain?\"\n",
"]\n",
"\n",
"outputs = [\n",
" \"Use RunnablePassthrough. from langchain_core.runnables import RunnableParallel, RunnablePassthrough; from langchain_core.prompts import ChatPromptTemplate; from langchain_openai import ChatOpenAI; prompt = ChatPromptTemplate.from_template('Tell a joke about: {input}'); model = ChatOpenAI(); runnable = ({'input' : RunnablePassthrough()} | prompt | model); runnable.invoke('flowers')\",\n",
" \"Use StrOutputParser. from langchain_openai import ChatOpenAI; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.output_parsers import StrOutputParser; prompt = ChatPromptTemplate.from_template('Tell me a short joke about {topic}'); model = ChatOpenAI(model='gpt-3.5-turbo') #gpt-4 or other LLMs can be used here; output_parser = StrOutputParser(); chain = prompt | model | output_parser\",\n",
" \"Use RunnableLambda with itemgetter to extract the relevant key. from operator import itemgetter; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.runnables import RunnableLambda; from langchain_openai import ChatOpenAI; def length_function(text): return len(text); chain = ({'prompt_input': itemgetter('foo') | RunnableLambda(length_function),} | prompt | model); chain.invoke({'foo':'hello world'})\"\n",
"]\n",
"\n",
"qa_pairs = [{\"question\": q, \"answer\": a} for q, a in zip(inputs, outputs)]\n",
"\n",
"# Create dataset\n",
"client = Client()\n",
"dataset_name = \"RAG_test_LCEL\"\n",
"dataset = client.create_dataset(\n",
" dataset_name=dataset_name,\n",
" description=\"QA pairs about LCEL.\",\n",
")\n",
"client.create_examples(\n",
" inputs=[{\"question\": q} for q in inputs],\n",
" outputs=[{\"answer\": a} for a in outputs],\n",
" dataset_id=dataset.id,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "92cf3a0f-621f-468d-818d-a6f2d4b53823",
"metadata": {},
"source": [
"## RAG Evaluators\n",
"\n",
"### Type 1: Reference Answer\n",
"\n",
"First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
"\n",
"This is shown on the far right (blue) in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"For comparing questions and answers, common built-in `LangChainStringEvaluator` options are `QA` and `CoTQA` [here different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations).\n",
"\n",
"We will use `CoT_QA` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://smith.langchain.com/hub/langchain-ai/cot_qa).\n",
"\n",
"But, all `LangChainStringEvaluator` expose a common interface to pass your inputs:\n",
"\n",
"1. `question` from the dataset -> `input` \n",
"2. `answer` from the dataset -> `reference` \n",
"3. `answer` from the LLM -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow.png)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1cbe0b4a-2a30-4f40-b3aa-5cc67c6a7802",
"metadata": {},
"outputs": [],
"source": [
"# RAG chain\n",
"def predict_rag_answer(example: dict):\n",
" \"\"\"Use this for answer evaluation\"\"\"\n",
" response = rag_bot.get_answer(example[\"question\"])\n",
" return {\"answer\": response[\"answer\"]}\n",
"\n",
"def predict_rag_answer_with_context(example: dict):\n",
" \"\"\"Use this for evaluation of retrieved documents and hallucinations\"\"\"\n",
" response = rag_bot.get_answer(example[\"question\"])\n",
" return {\"answer\": response[\"answer\"], \"contexts\": response[\"contexts\"]}"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a7a3827d-a92f-4a7a-a572-5123fbd9c334",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for experiment: 'rag-qa-oai-e8604ab3' at:\n",
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=a176a91c-a5f0-42ab-b2f4-fedaa1cbf17d\n",
"\n",
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e459fbab745f4ce4bb399609910a807f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"# Evaluator \n",
"qa_evalulator = [LangChainStringEvaluator(\"cot_qa\", \n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"answer\"], \n",
" \"reference\": run.outputs[\"contexts\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
" ))]\n",
"dataset_name = \"RAG_test_LCEL\"\n",
"experiment_results = evaluate(\n",
" predict_rag_answer,\n",
" data=dataset_name,\n",
" evaluators=qa_evalulator,\n",
" experiment_prefix=\"rag-qa-oai\",\n",
" metadata={\"variant\": \"LCEL context, gpt-3.5-turbo\"},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "60ba4123-c691-4aa0-ba76-e567e8aaf09f",
"metadata": {},
"source": [
"### Type 2: Answer Hallucination\n",
"\n",
"Second, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents.\n",
"\n",
"This is shown in the red in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
"\n",
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://smith.langchain.com/hub/wfh/labeled-score-string).\n",
"\n",
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
"\n",
"1. `contexts` from LLM chain -> `reference` \n",
"2. `answer` from the LLM chain -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow_hallucination.png)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "7f0872a5-e989-415d-9fed-5846efaa9488",
"metadata": {},
"outputs": [],
"source": [
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"answer_hallucination_evaluator = LangChainStringEvaluator(\n",
" \"labeled_score_string\", \n",
" config={\n",
" \"criteria\": { \n",
" \"accuracy\": \"\"\"Is the Assistant's Answer grounded in the Ground Truth documentation? A score of 0 means that the\n",
" Assistant answer contains is not at all based upon / grounded in the Groun Truth documentation. A score of 5 means \n",
" that the Assistant answer contains some information (e.g., a hallucination) that is not captured in the Ground Truth \n",
" documentation. A score of 10 means that the Assistant answer is fully based upon the in the Ground Truth documentation.\"\"\"\n",
" },\n",
" # If you want the score to be saved on a scale from 0 to 1\n",
" \"normalize_by\": 10,\n",
" },\n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"answer\"], \n",
" \"reference\": run.outputs[\"contexts\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "6d5bf61b-3903-4cde-9ecf-67f0e0874521",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for experiment: 'rag-qa-oai-hallucination-fad2e13c' at:\n",
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=9a1e9e7d-cf87-4b89-baf6-f5498a160627\n",
"\n",
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "891904d8d44444e98c6a03faa43e147a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dataset_name = \"RAG_test_LCEL\"\n",
" \n",
"experiment_results = evaluate(\n",
" predict_rag_answer_with_context,\n",
" data=dataset_name,\n",
" evaluators=[answer_hallucination_evaluator],\n",
" experiment_prefix=\"rag-qa-oai-hallucination\",\n",
" # Any experiment metadata can be specified here\n",
" metadata={\n",
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "480a27cb-1a31-4194-b160-8cdcfbf24eea",
"metadata": {},
"source": [
"### Type 3: Document Relevance to Question\n",
"\n",
"Finally, lets consider the case in which we want to compare our RAG chain document retrieval to the question.\n",
"\n",
"This is shown in green in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
"\n",
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://smith.langchain.com/hub/wfh/labeled-score-string).\n",
"\n",
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
"\n",
"1. `question` from LLM chain -> `reference` \n",
"2. `contexts` from the LLM chain -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow_doc_relevance.png)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "df247034-14ed-40b1-b313-b0fef7286546",
"metadata": {},
"outputs": [],
"source": [
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"docs_relevance_evaluator = LangChainStringEvaluator(\n",
" \"labeled_score_string\", \n",
" config={\n",
" \"criteria\": { \n",
" \"accuracy\": \"\"\"The Assistant's Answer is a set of documents retrieved from a vectorstore. The Ground Truth is a question\n",
" used for retrieval. You will score whether the Assistant's Answer (retrieved docs) are relevant to the Ground Truth \n",
" question. A score of 0 means that the Assistant answer contains documents that are not at all relevant to the \n",
" Ground Truth question. A score of 5 means that the Assistant answer contains some documents are relevant to the Ground Truth \n",
" question. A score of 10 means that all of the Assistant answer documents are all relevant to the Ground Truth question\"\"\"\n",
" },\n",
" # If you want the score to be saved on a scale from 0 to 1\n",
" \"normalize_by\": 10,\n",
" },\n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"contexts\"], \n",
" \"reference\": example.inputs[\"question\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "cfe988dc-2aaa-42f4-93ff-c3c9fe6b3124",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for experiment: 'rag-qa-oai-doc-relevance-82244196' at:\n",
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=3bbf09c9-69de-47ba-9d3c-7bcedf5cd48f\n",
"\n",
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4e4091f1053b4d34871aa87428297e12",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"experiment_results = evaluate(\n",
" predict_rag_answer_with_context,\n",
" data=dataset_name,\n",
" evaluators=[docs_relevance_evaluator],\n",
" experiment_prefix=\"rag-qa-oai-doc-relevance\",\n",
" # Any experiment metadata can be specified here\n",
" metadata={\n",
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
" },\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2f09b6e-667a-47fe-b3f9-8634783f7666",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

BIN
docs/static/img/langsmith_rag_eval.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

BIN
docs/static/img/langsmith_rag_flow.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB