mirror of
https://github.com/hwchase17/langchain.git
synced 2026-01-24 05:50:18 +00:00
Add hallucination and doc relevance
This commit is contained in:
@@ -19,15 +19,21 @@
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"We will discuss each below.\n",
|
||||
"Each of these evals has something in common: it will compare text (e.g., answer vs reference answer, etc).\n",
|
||||
"\n",
|
||||
"### Reference Answer\n",
|
||||
"We can use various built-in `LangChainStringEvaluator` types for this (see [here](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations#overview)).\n",
|
||||
"\n",
|
||||
"First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
|
||||
"All `LangChainStringEvaluator` implementations can accept 3 inputs:\n",
|
||||
"\n",
|
||||
"This is shown on the far right (blue) above.\n",
|
||||
"```\n",
|
||||
"prediction: The prediction string.\n",
|
||||
"reference: The reference string.\n",
|
||||
"input: The input string.\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"#### RAG Chain \n",
|
||||
"Below, we will use this to perform eval.\n",
|
||||
"\n",
|
||||
"## RAG Chain \n",
|
||||
"\n",
|
||||
"To start, we build a RAG chain. "
|
||||
]
|
||||
@@ -161,7 +167,7 @@
|
||||
"id": "432e8ec7-a085-4224-ad38-0087e1d553f1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### RAG Dataset \n",
|
||||
"## RAG Dataset \n",
|
||||
"\n",
|
||||
"Next, we build a dataset of QA pairs based upon the [documentation](https://python.langchain.com/docs/expression_language/) that we indexed."
|
||||
]
|
||||
@@ -207,21 +213,27 @@
|
||||
"id": "92cf3a0f-621f-468d-818d-a6f2d4b53823",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## RAG Evaluators\n",
|
||||
"\n",
|
||||
"### Type 1: Reference Answer\n",
|
||||
"\n",
|
||||
"First, lets consider the case in which we want to compare our RAG chain answer to a reference answer.\n",
|
||||
"\n",
|
||||
"This is shown on the far right (blue) in the top figure.\n",
|
||||
"\n",
|
||||
"#### Eval flow\n",
|
||||
"\n",
|
||||
"There are [several different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations) that can be used to compare our RAG chain answer to a reference answer.\n",
|
||||
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
|
||||
"\n",
|
||||
"< `TODO:` Update table to link to the eval prompts. > \n",
|
||||
"For comparing questions and answers, common built-in `LangChainStringEvaluator` options are `QA` and `CoTQA` [here different evaluators](https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations).\n",
|
||||
"\n",
|
||||
"Here, we will use `CoT_QA` as an LLM-as-judge evaluator.\n",
|
||||
"We will use `CoT_QA` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43).\n",
|
||||
"\n",
|
||||
"[Here](https://github.com/langchain-ai/langchain/blob/22da9f5f3f9fef24c5c75072b678b8a2f654b173/libs/langchain/langchain/evaluation/qa/eval_prompt.py#L43) is the prompt used by `CoT_QA`.\n",
|
||||
"But, all `LangChainStringEvaluator` expose a common interface to pass your inputs:\n",
|
||||
"\n",
|
||||
"Our evaluator will connect our dataset and RAG chain outputs to the evaluator prompt inputs:\n",
|
||||
"\n",
|
||||
"1. `question` from the dataset -> `question` in the prompt, the RAG chain input\n",
|
||||
"2. `answer` from the dataset -> `context` in the prompt, the ground truth answer\n",
|
||||
"3. `answer` from the LLM using `predict_rag_answer` function below -> `result` in the prompt, the RAG chain result\n",
|
||||
"1. `question` from the dataset -> `input` \n",
|
||||
"2. `answer` from the dataset -> `reference` \n",
|
||||
"3. `answer` from the LLM -> `prediction` \n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
@@ -280,7 +292,13 @@
|
||||
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
|
||||
"\n",
|
||||
"# Evaluator \n",
|
||||
"qa_evalulator = [LangChainStringEvaluator(\"cot_qa\")]\n",
|
||||
"qa_evalulator = [LangChainStringEvaluator(\"cot_qa\", \n",
|
||||
" prepare_data=lambda run, example: {\n",
|
||||
" \"prediction\": run.outputs[\"answer\"], \n",
|
||||
" \"reference\": run.outputs[\"contexts\"],\n",
|
||||
" \"input\": example.inputs[\"question\"],\n",
|
||||
" } \n",
|
||||
" ))]\n",
|
||||
"dataset_name = \"RAG_test_LCEL\"\n",
|
||||
"experiment_results = evaluate(\n",
|
||||
" predict_rag_answer,\n",
|
||||
@@ -296,19 +314,98 @@
|
||||
"id": "60ba4123-c691-4aa0-ba76-e567e8aaf09f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Answer Hallucination\n",
|
||||
"### Type 2: Answer Hallucination\n",
|
||||
"\n",
|
||||
"Next, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents."
|
||||
"Second, lets consider the case in which we want to compare our RAG chain answer to the retrieved documents.\n",
|
||||
"\n",
|
||||
"This is shown in the red in the top figure.\n",
|
||||
"\n",
|
||||
"#### Eval flow\n",
|
||||
"\n",
|
||||
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
|
||||
"\n",
|
||||
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
|
||||
"\n",
|
||||
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
|
||||
"\n",
|
||||
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
|
||||
"\n",
|
||||
"1. `contexts` from LLM chain -> `reference` \n",
|
||||
"2. `answer` from the LLM chain -> `prediction` \n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 12,
|
||||
"id": "7f0872a5-e989-415d-9fed-5846efaa9488",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"xxx"
|
||||
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
|
||||
"\n",
|
||||
"answer_hallucination_evaluator = LangChainStringEvaluator(\n",
|
||||
" \"labeled_score_string\", \n",
|
||||
" config={\n",
|
||||
" \"criteria\": { \n",
|
||||
" \"accuracy\": \"Is the prediction grounded in the reference?\"\n",
|
||||
" },\n",
|
||||
" # If you want the score to be saved on a scale from 0 to 1\n",
|
||||
" \"normalize_by\": 10,\n",
|
||||
" },\n",
|
||||
" prepare_data=lambda run, example: {\n",
|
||||
" \"prediction\": run.outputs[\"answer\"], \n",
|
||||
" \"reference\": run.outputs[\"contexts\"],\n",
|
||||
" \"input\": example.inputs[\"question\"],\n",
|
||||
" } \n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "6d5bf61b-3903-4cde-9ecf-67f0e0874521",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"View the evaluation results for experiment: 'rag-qa-oai-hallucination-94fa7798' at:\n",
|
||||
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=5d82d039-0596-40a6-b901-6fe5a2e4223b\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "72dcf5fab4f24130a72390d947f48b54",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"0it [00:00, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dataset_name = \"RAG_test_LCEL\"\n",
|
||||
" \n",
|
||||
"experiment_results = evaluate(\n",
|
||||
" predict_rag_answer_with_context,\n",
|
||||
" data=dataset_name,\n",
|
||||
" evaluators=[answer_hallucination_evaluator],\n",
|
||||
" experiment_prefix=\"rag-qa-oai-hallucination\",\n",
|
||||
" # Any experiment metadata can be specified here\n",
|
||||
" metadata={\n",
|
||||
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
|
||||
" },\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -316,28 +413,97 @@
|
||||
"id": "480a27cb-1a31-4194-b160-8cdcfbf24eea",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Retrieval\n",
|
||||
"### Type 3: Document Relevance to Question\n",
|
||||
"\n",
|
||||
"Finally, lets consider the case in which we want to compare our retrieved documents to the question."
|
||||
"Finally, lets consider the case in which we want to compare our RAG chain document retrieval to the question.\n",
|
||||
"\n",
|
||||
"This is shown in green in the top figure.\n",
|
||||
"\n",
|
||||
"#### Eval flow\n",
|
||||
"\n",
|
||||
"We will use a `LangChainStringEvaluator`, as mentioned above.\n",
|
||||
"\n",
|
||||
"For comparing documents and answers, common built-in `LangChainStringEvaluator` options are `Criteria` [here](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain/#using-reference-labels) because we want to supply custom criteria.\n",
|
||||
"\n",
|
||||
"We will use `labeled_score_string` as an LLM-as-judge evaluator, which uses the eval prompt defined [here](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/evaluation/criteria/prompt.py).\n",
|
||||
"\n",
|
||||
"Here, we only need to use two inputs of the `LangChainStringEvaluator` interface:\n",
|
||||
"\n",
|
||||
"1. `question` from LLM chain -> `reference` \n",
|
||||
"2. `contexts` from the LLM chain -> `prediction` \n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 16,
|
||||
"id": "df247034-14ed-40b1-b313-b0fef7286546",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"xxx"
|
||||
"from langsmith.evaluation import LangChainStringEvaluator, evaluate\n",
|
||||
"\n",
|
||||
"docs_relevance_evaluator = LangChainStringEvaluator(\n",
|
||||
" \"labeled_score_string\", \n",
|
||||
" config={\n",
|
||||
" \"criteria\": { \n",
|
||||
" \"accuracy\": \"Is the prediction relevant to the reference?\"\n",
|
||||
" },\n",
|
||||
" # If you want the score to be saved on a scale from 0 to 1\n",
|
||||
" \"normalize_by\": 10,\n",
|
||||
" },\n",
|
||||
" prepare_data=lambda run, example: {\n",
|
||||
" \"prediction\": run.outputs[\"contexts\"], \n",
|
||||
" \"reference\": example.inputs[\"question\"],\n",
|
||||
" \"input\": example.inputs[\"question\"],\n",
|
||||
" } \n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 17,
|
||||
"id": "cfe988dc-2aaa-42f4-93ff-c3c9fe6b3124",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"View the evaluation results for experiment: 'rag-qa-oai-doc-relevance-1ac405db' at:\n",
|
||||
"https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=75be8a78-e92d-4f8a-a73b-d6512903add0\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "2d70afcc5b3c49b59a3b64a952dfd14b",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"0it [00:00, ?it/s]"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"experiment_results = evaluate(\n",
|
||||
" predict_rag_answer_with_context,\n",
|
||||
" data=dataset_name,\n",
|
||||
" evaluators=[docs_relevance_evaluator],\n",
|
||||
" experiment_prefix=\"rag-qa-oai-doc-relevance\",\n",
|
||||
" # Any experiment metadata can be specified here\n",
|
||||
" metadata={\n",
|
||||
" \"variant\": \"LCEL context, gpt-3.5-turbo\",\n",
|
||||
" },\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
|
||||
BIN
docs/static/img/langsmith_rag_flow_doc_relevance.png
vendored
Normal file
BIN
docs/static/img/langsmith_rag_flow_doc_relevance.png
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 121 KiB |
BIN
docs/static/img/langsmith_rag_flow_hallucination.png
vendored
Normal file
BIN
docs/static/img/langsmith_rag_flow_hallucination.png
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 121 KiB |
Reference in New Issue
Block a user