Add hallucination and doc relevance

This commit is contained in:
Lance Martin
2024-04-16 16:56:41 -07:00
parent c152fc5733
commit 16152b3cdd
3 changed files with 193 additions and 27 deletions

View File

@@ -19,15 +19,21 @@
"![](../../../../../static/img/langsmith_rag_eval.png)\n",
"\n",
"\n",
"We will discuss each below.\n",
"Each of these evals has something in common: it will compare text (e.g., answer vs reference answer, etc).\n",
"\n",
"### Reference Answer\n",
"We can use various built-in `LangChainStringEvaluator` types for this (see [here](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#overview)).\n",
"\n",
"First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
"All `LangChainStringEvaluator` implementations can accept 3 inputs:\n",
"\n",
"This is shown on the far right (blue) above.\n",
"```\n",
"prediction: The prediction string.\n",
"reference: The reference string.\n",
"input: The input string.\n",
"```\n",
"\n",
"#### RAG Chain \n",
"Below, we will use this to perform eval.\n",
"\n",
"## RAG Chain \n",
"\n",
"To start, we build a RAG chain. "
]
@@ -161,7 +167,7 @@
"id": "432e8ec7-a085-4224-ad38-0087e1d553f1",
"metadata": {},
"source": [
"#### RAG Dataset \n",
"## RAG Dataset \n",
"\n",
"Next, we build a dataset of QA pairs based upon the [documentation](https://python.langchain.com/docs/expression_language/) that we indexed."
]
@@ -207,21 +213,27 @@
"id": "92cf3a0f-621f-468d-818d-a6f2d4b53823",
"metadata": {},
"source": [
"## RAG Evaluators\n",
"\n",
"### Type 1: Reference Answer\n",
"\n",
"First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
"\n",
"This is shown on the far right (blue) in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"There are [several different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations) that can be used to compare our RAG chain answer to a reference answer.\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"< `TODO:` Update table to link to the eval prompts. > \n",
"For comparing questions and answers, common built-in `LangChainStringEvaluator` options are `QA` and `CoTQA` [here different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations).\n",
"\n",
"Here, we will use `CoT_QA` as an LLM-as-judge evaluator.\n",
"We will use `CoT_QA` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43).\n",
"\n",
"[Here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43) is the prompt used by `CoT_QA`.\n",
"But, all `LangChainStringEvaluator` expose a common interface to pass your inputs:\n",
"\n",
"Our evaluator will connect our dataset and RAG chain outputs to the evaluator prompt inputs:\n",
"\n",
"1. `question` from the dataset -> `question` in the prompt, the RAG chain input\n",
"2. `answer` from the dataset -> `context` in the prompt, the ground truth answer\n",
"3. `answer` from the LLM using `predict_rag_answer` function below -> `result` in the prompt, the RAG chain result\n",
"1. `question` from the dataset -> `input` \n",
"2. `answer` from the dataset -> `reference` \n",
"3. `answer` from the LLM -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow.png)"
]
@@ -280,7 +292,13 @@
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"# Evaluator \n",
"qa_evalulator = [LangChainStringEvaluator(\"cot_qa\")]\n",
"qa_evalulator = [LangChainStringEvaluator(\"cot_qa\", \n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"answer\"], \n",
" \"reference\": run.outputs[\"contexts\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
" ))]\n",
"dataset_name = \"RAG_test_LCEL\"\n",
"experiment_results = evaluate(\n",
" predict_rag_answer,\n",
@@ -296,19 +314,98 @@
"id": "60ba4123-c691-4aa0-ba76-e567e8aaf09f",
"metadata": {},
"source": [
"### Answer Hallucination\n",
"### Type 2: Answer Hallucination\n",
"\n",
"Next, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents."
"Second, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents.\n",
"\n",
"This is shown in the red in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
"\n",
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
"\n",
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
"\n",
"1. `contexts` from LLM chain -> `reference` \n",
"2. `answer` from the LLM chain -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow_hallucination.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 12,
"id": "7f0872a5-e989-415d-9fed-5846efaa9488",
"metadata": {},
"outputs": [],
"source": [
"xxx"
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"answer_hallucination_evaluator = LangChainStringEvaluator(\n",
" \"labeled_score_string\", \n",
" config={\n",
" \"criteria\": { \n",
" \"accuracy\": \"Is the prediction grounded in the reference?\"\n",
" },\n",
" # If you want the score to be saved on a scale from 0 to 1\n",
" \"normalize_by\": 10,\n",
" },\n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"answer\"], \n",
" \"reference\": run.outputs[\"contexts\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6d5bf61b-3903-4cde-9ecf-67f0e0874521",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for experiment: 'rag-qa-oai-hallucination-94fa7798' at:\n",
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=5d82d039-0596-40a6-b901-6fe5a2e4223b\n",
"\n",
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "72dcf5fab4f24130a72390d947f48b54",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dataset_name = \"RAG_test_LCEL\"\n",
" \n",
"experiment_results = evaluate(\n",
" predict_rag_answer_with_context,\n",
" data=dataset_name,\n",
" evaluators=[answer_hallucination_evaluator],\n",
" experiment_prefix=\"rag-qa-oai-hallucination\",\n",
" # Any experiment metadata can be specified here\n",
" metadata={\n",
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
" },\n",
")"
]
},
{
@@ -316,28 +413,97 @@
"id": "480a27cb-1a31-4194-b160-8cdcfbf24eea",
"metadata": {},
"source": [
"### Retrieval\n",
"### Type 3: Document Relevance to Question\n",
"\n",
"Finally, lets consider the case in which we want to compare our retrieved documents to the question."
"Finally, lets consider the case in which we want to compare our RAG chain document retrieval to the question.\n",
"\n",
"This is shown in green in the top figure.\n",
"\n",
"#### Eval flow\n",
"\n",
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
"\n",
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
"\n",
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
"\n",
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
"\n",
"1. `question` from LLM chain -> `reference` \n",
"2. `contexts` from the LLM chain -> `prediction` \n",
"\n",
"![](../../../../../static/img/langsmith_rag_flow_doc_relevance.png)"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 16,
"id": "df247034-14ed-40b1-b313-b0fef7286546",
"metadata": {},
"outputs": [],
"source": [
"xxx"
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
"\n",
"docs_relevance_evaluator = LangChainStringEvaluator(\n",
" \"labeled_score_string\", \n",
" config={\n",
" \"criteria\": { \n",
" \"accuracy\": \"Is the prediction relevant to the reference?\"\n",
" },\n",
" # If you want the score to be saved on a scale from 0 to 1\n",
" \"normalize_by\": 10,\n",
" },\n",
" prepare_data=lambda run, example: {\n",
" \"prediction\": run.outputs[\"contexts\"], \n",
" \"reference\": example.inputs[\"question\"],\n",
" \"input\": example.inputs[\"question\"],\n",
" } \n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 17,
"id": "cfe988dc-2aaa-42f4-93ff-c3c9fe6b3124",
"metadata": {},
"outputs": [],
"source": []
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"View the evaluation results for experiment: 'rag-qa-oai-doc-relevance-1ac405db' at:\n",
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=75be8a78-e92d-4f8a-a73b-d6512903add0\n",
"\n",
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2d70afcc5b3c49b59a3b64a952dfd14b",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"experiment_results = evaluate(\n",
" predict_rag_answer_with_context,\n",
" data=dataset_name,\n",
" evaluators=[docs_relevance_evaluator],\n",
" experiment_prefix=\"rag-qa-oai-doc-relevance\",\n",
" # Any experiment metadata can be specified here\n",
" metadata={\n",
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
" },\n",
")"
]
},
{
"cell_type": "code",

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB