Add hallucination and doc relevance

2026-01-24 05:50:18 +00:00 · 2024-04-16 16:56:41 -07:00
parent c152fc5733
commit 16152b3cdd
3 changed files with 193 additions and 27 deletions
--- a/docs/docs/guides/productionization/evaluation/examples/rag.ipynb
+++ b/docs/docs/guides/productionization/evaluation/examples/rag.ipynb
@@ -19,15 +19,21 @@
    "![](../../../../../static/img/langsmith_rag_eval.png)\n",
    "\n",
    "\n",
-    "We will discuss each below.\n",
+    "Each of these evals has something in common: it will compare text (e.g., answer vs reference answer, etc).\n",
    "\n",
-    "### Reference Answer\n",
+    "We can use various built-in `LangChainStringEvaluator` types for this (see [here](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#overview)).\n",
    "\n",
-    "First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
+    "All `LangChainStringEvaluator` implementations can accept 3 inputs:\n",
    "\n",
-    "This is shown on the far right (blue) above.\n",
+    "```\n",
+    "prediction: The prediction string.\n",
+    "reference: The reference string.\n",
+    "input: The input string.\n",
+    "```\n",
    "\n",
-    "#### RAG Chain \n",
+    "Below, we will use this to perform eval.\n",
+    "\n",
+    "## RAG Chain \n",
    "\n",
    "To start, we build a RAG chain. "
   ]
@@ -161,7 +167,7 @@
   "id": "432e8ec7-a085-4224-ad38-0087e1d553f1",
   "metadata": {},
   "source": [
-    "#### RAG Dataset \n",
+    "## RAG Dataset \n",
    "\n",
    "Next, we build a dataset of QA pairs based upon the [documentation](https://python.langchain.com/docs/expression_language/) that we indexed."
   ]
@@ -207,21 +213,27 @@
   "id": "92cf3a0f-621f-468d-818d-a6f2d4b53823",
   "metadata": {},
   "source": [
+    "## RAG Evaluators\n",
+    "\n",
+    "### Type 1: Reference Answer\n",
+    "\n",
+    "First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
+    "\n",
+    "This is shown on the far right (blue) in the top figure.\n",
+    "\n",
    "#### Eval flow\n",
    "\n",
-    "There are [several different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations) that can be used to compare our RAG chain answer to a reference answer.\n",
+    "We will use a `LangChainStringEvaluator`, as mentioned above.\n",
    "\n",
-    "< `TODO:` Update table to link to the eval prompts. > \n",
+    "For comparing questions and answers, common built-in `LangChainStringEvaluator` options are `QA` and `CoTQA` [here different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations).\n",
    "\n",
-    "Here, we will use `CoT_QA` as an LLM-as-judge evaluator.\n",
+    "We will use `CoT_QA` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43).\n",
    "\n",
-    "[Here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43) is the prompt used by `CoT_QA`.\n",
+    "But, all `LangChainStringEvaluator` expose a common interface to pass your inputs:\n",
    "\n",
-    "Our evaluator will connect our dataset and RAG chain outputs to the evaluator prompt inputs:\n",
-    "\n",
-    "1. `question` from the dataset -> `question` in the prompt, the RAG chain input\n",
-    "2. `answer` from the dataset -> `context` in the prompt, the ground truth answer\n",
-    "3. `answer` from the LLM using `predict_rag_answer` function below -> `result` in the prompt, the RAG chain result\n",
+    "1. `question` from the dataset -> `input` \n",
+    "2. `answer` from the dataset -> `reference` \n",
+    "3. `answer` from the LLM -> `prediction` \n",
    "\n",
    "![](../../../../../static/img/langsmith_rag_flow.png)"
   ]
@@ -280,7 +292,13 @@
    "from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
    "\n",
    "# Evaluator \n",
-    "qa_evalulator = [LangChainStringEvaluator(\"cot_qa\")]\n",
+    "qa_evalulator = [LangChainStringEvaluator(\"cot_qa\",     \n",
+    "                                          prepare_data=lambda run, example: {\n",
+    "                                              \"prediction\": run.outputs[\"answer\"], \n",
+    "                                              \"reference\": run.outputs[\"contexts\"],\n",
+    "                                              \"input\": example.inputs[\"question\"],\n",
+    "                                          }  \n",
+    "                                         ))]\n",
    "dataset_name = \"RAG_test_LCEL\"\n",
    "experiment_results = evaluate(\n",
    "    predict_rag_answer,\n",
@@ -296,19 +314,98 @@
   "id": "60ba4123-c691-4aa0-ba76-e567e8aaf09f",
   "metadata": {},
   "source": [
-    "### Answer Hallucination\n",
+    "### Type 2: Answer Hallucination\n",
    "\n",
-    "Next, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents."
+    "Second, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents.\n",
+    "\n",
+    "This is shown in the red in the top figure.\n",
+    "\n",
+    "#### Eval flow\n",
+    "\n",
+    "We will use a `LangChainStringEvaluator`, as mentioned above.\n",
+    "\n",
+    "For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
+    "\n",
+    "We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
+    "\n",
+    "Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
+    "\n",
+    "1. `contexts` from  LLM chain -> `reference` \n",
+    "2. `answer` from the LLM chain -> `prediction` \n",
+    "\n",
+    "![](../../../../../static/img/langsmith_rag_flow_hallucination.png)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
   "id": "7f0872a5-e989-415d-9fed-5846efaa9488",
   "metadata": {},
   "outputs": [],
   "source": [
-    "xxx"
+    "from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
+    "\n",
+    "answer_hallucination_evaluator = LangChainStringEvaluator(\n",
+    "    \"labeled_score_string\", \n",
+    "    config={\n",
+    "        \"criteria\": { \n",
+    "            \"accuracy\": \"Is the prediction grounded in the reference?\"\n",
+    "        },\n",
+    "        # If you want the score to be saved on a scale from 0 to 1\n",
+    "        \"normalize_by\": 10,\n",
+    "    },\n",
+    "    prepare_data=lambda run, example: {\n",
+    "        \"prediction\": run.outputs[\"answer\"], \n",
+    "        \"reference\": run.outputs[\"contexts\"],\n",
+    "        \"input\": example.inputs[\"question\"],\n",
+    "    }  \n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "6d5bf61b-3903-4cde-9ecf-67f0e0874521",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "View the evaluation results for experiment: 'rag-qa-oai-hallucination-94fa7798' at:\n",
+      "https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=5d82d039-0596-40a6-b901-6fe5a2e4223b\n",
+      "\n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "72dcf5fab4f24130a72390d947f48b54",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0it [00:00, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dataset_name = \"RAG_test_LCEL\"\n",
+    "    \n",
+    "experiment_results = evaluate(\n",
+    "    predict_rag_answer_with_context,\n",
+    "    data=dataset_name,\n",
+    "    evaluators=[answer_hallucination_evaluator],\n",
+    "    experiment_prefix=\"rag-qa-oai-hallucination\",\n",
+    "    # Any experiment metadata can be specified here\n",
+    "    metadata={\n",
+    "      \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
+    "    },\n",
+    ")"
   ]
  },
  {
@@ -316,28 +413,97 @@
   "id": "480a27cb-1a31-4194-b160-8cdcfbf24eea",
   "metadata": {},
   "source": [
-    "### Retrieval\n",
+    "### Type 3: Document Relevance to Question\n",
    "\n",
-    "Finally, lets consider the case in which we want to compare our retrieved documents to the question."
+    "Finally, lets consider the case in which we want to compare our RAG chain document retrieval to the question.\n",
+    "\n",
+    "This is shown in green in the top figure.\n",
+    "\n",
+    "#### Eval flow\n",
+    "\n",
+    "We will use a `LangChainStringEvaluator`, as mentioned above.\n",
+    "\n",
+    "For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
+    "\n",
+    "We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
+    "\n",
+    "Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
+    "\n",
+    "1. `question` from  LLM chain -> `reference` \n",
+    "2. `contexts` from the LLM chain -> `prediction` \n",
+    "\n",
+    "![](../../../../../static/img/langsmith_rag_flow_doc_relevance.png)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 16,
   "id": "df247034-14ed-40b1-b313-b0fef7286546",
   "metadata": {},
   "outputs": [],
   "source": [
-    "xxx"
+    "from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
+    "\n",
+    "docs_relevance_evaluator = LangChainStringEvaluator(\n",
+    "    \"labeled_score_string\", \n",
+    "    config={\n",
+    "        \"criteria\": { \n",
+    "            \"accuracy\": \"Is the prediction relevant to the reference?\"\n",
+    "        },\n",
+    "        # If you want the score to be saved on a scale from 0 to 1\n",
+    "        \"normalize_by\": 10,\n",
+    "    },\n",
+    "    prepare_data=lambda run, example: {\n",
+    "        \"prediction\": run.outputs[\"contexts\"], \n",
+    "        \"reference\": example.inputs[\"question\"],\n",
+    "        \"input\": example.inputs[\"question\"],\n",
+    "    }  \n",
+    ")"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 17,
   "id": "cfe988dc-2aaa-42f4-93ff-c3c9fe6b3124",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "View the evaluation results for experiment: 'rag-qa-oai-doc-relevance-1ac405db' at:\n",
+      "https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=75be8a78-e92d-4f8a-a73b-d6512903add0\n",
+      "\n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "2d70afcc5b3c49b59a3b64a952dfd14b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "0it [00:00, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "experiment_results = evaluate(\n",
+    "    predict_rag_answer_with_context,\n",
+    "    data=dataset_name,\n",
+    "    evaluators=[docs_relevance_evaluator],\n",
+    "    experiment_prefix=\"rag-qa-oai-doc-relevance\",\n",
+    "    # Any experiment metadata can be specified here\n",
+    "    metadata={\n",
+    "      \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
+    "    },\n",
+    ")"
+   ]
  },
  {
   "cell_type": "code",
--- a/docs/static/img/langsmith_rag_flow_doc_relevance.png
+++ b/docs/static/img/langsmith_rag_flow_doc_relevance.png
--- a/docs/static/img/langsmith_rag_flow_hallucination.png
+++ b/docs/static/img/langsmith_rag_flow_hallucination.png