langchain/docs/extras/guides/evaluation/comparison/pairwise_string.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2da95378",
   "metadata": {},
   "source": [
    "# Pairwise String Comparison\n",
    "\n",
    "Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The `StringComparison` evaluators facilitate this so you can answer questions like:\n",
    "\n",
    "- Which LLM or prompt produces a preferred output for a given question?\n",
    "- Which examples should I include for few-shot example selection?\n",
    "- Which output is better to include for fintetuning?\n",
    "\n",
    "The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the `pairwise_string` evaluator.\n",
    "\n",
    "Check out the reference docs for the [PairwiseStringEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain.html#langchain.evaluation.comparison.eval_chain.PairwiseStringEvalChain) for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f6790c46",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"labeled_pairwise_string\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "49ad9139",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Response A is incorrect as it states there are three dogs in the park, which contradicts the reference answer of four. Response B, on the other hand, is accurate as it matches the reference answer. Although Response B is not as detailed or elaborate as Response A, it is more important that the response is accurate. \\n\\nFinal Decision: [[B]]\\n',\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"there are three dogs\",\n",
    "    prediction_b=\"4\",\n",
    "    input=\"how many dogs are in the park?\",\n",
    "    reference=\"four\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed353b93-be71-4479-b9c0-8c97814c2e58",
   "metadata": {},
   "source": [
    "## Without References\n",
    "\n",
    "When references aren't available, you can still predict the preferred response.\n",
    "The results will reflect the evaluation model's preference, which is less reliable and may result\n",
    "in preferences that are factually incorrect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "586320da",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.evaluation import load_evaluator\n",
    "\n",
    "evaluator = load_evaluator(\"pairwise_string\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7f56c76e-a39b-4509-8b8a-8a2afe6c3da1",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': \"Response A is accurate but lacks depth and detail. It simply states that addition is a mathematical operation without explaining what it does or how it works. \\n\\nResponse B, on the other hand, provides a more detailed explanation. It not only identifies addition as a mathematical operation, but also explains that it involves adding two numbers to create a third number, the 'sum'. This response is more helpful and informative, providing a clearer understanding of what addition is.\\n\\nTherefore, the better response is B.\\n\",\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"Addition is a mathematical operation.\",\n",
    "    prediction_b=\"Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.\",\n",
    "    input=\"What is addition?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a25b60b2-627c-408a-be4b-a2e5cbc10726",
   "metadata": {},
   "source": [
    "## Customize the LLM\n",
    "\n",
    "By default, the loader uses `gpt-4` in the evaluation chain. You can customize this when loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "de84a958-1330-482b-b950-68bcf23f9e35",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatAnthropic\n",
    "\n",
    "llm = ChatAnthropic(temperature=0)\n",
    "\n",
    "evaluator = load_evaluator(\"labeled_pairwise_string\", llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e162153f-d50a-4a7c-a033-019dabbc954c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Here is my assessment:\\n\\nResponse B is better because it directly answers the question by stating the number \"4\", which matches the ground truth reference answer. Response A provides an incorrect number of dogs, stating there are three dogs when the reference says there are four. \\n\\nResponse B is more helpful, relevant, accurate and provides the right level of detail by simply stating the number that was asked for. Response A provides an inaccurate number, so is less helpful and accurate.\\n\\nIn summary, Response B better followed the instructions and answered the question correctly per the reference answer.\\n\\n[[B]]',\n",
       " 'value': 'B',\n",
       " 'score': 0}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"there are three dogs\",\n",
    "    prediction_b=\"4\",\n",
    "    input=\"how many dogs are in the park?\",\n",
    "    reference=\"four\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e89c13-d0ad-4f87-8fcb-814399bafa2a",
   "metadata": {},
   "source": [
    "## Customize the Evaluation Prompt\n",
    "\n",
    "You can use your own custom evaluation prompt to add more task-specific instructions or to instruct the evaluator to score the output.\n",
    "\n",
    "*Note: If you use a prompt that expects generates a result in a unique format, you may also have to pass in a custom output parser (`output_parser=your_parser()`) instead of the default `PairwiseStringResultOutputParser`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fb817efa-3a4d-439d-af8c-773b89d97ec9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "prompt_template = PromptTemplate.from_template(\n",
    "    \"\"\"Given the input context, which is most similar to the reference label: A or B?\n",
    "Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.\n",
    "\n",
    "DATA\n",
    "----\n",
    "input: {input}\n",
    "reference: {reference}\n",
    "A: {prediction}\n",
    "B: {prediction_b}\n",
    "---\n",
    "Reasoning:\n",
    "\n",
    "\"\"\"\n",
    ")\n",
    "evaluator = load_evaluator(\n",
    "    \"labeled_pairwise_string\", prompt=prompt_template\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d40aa4f0-cfd5-4cb4-83c8-8d2300a04c2f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input_variables=['input', 'prediction', 'prediction_b', 'reference'] output_parser=None partial_variables={} template='Given the input context, which is most similar to the reference label: A or B?\\nReason step by step and finally, respond with either [[A]] or [[B]] on its own line.\\n\\nDATA\\n----\\ninput: {input}\\nreference: {reference}\\nA: {prediction}\\nB: {prediction_b}\\n---\\nReasoning:\\n\\n' template_format='f-string' validate_template=True\n"
     ]
    }
   ],
   "source": [
    "# The prompt was assigned to the evaluator\n",
    "print(evaluator.prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9467bb42-7a31-4071-8f66-9ed2c6f06dcd",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'reasoning': 'Option A is more similar to the reference label because it mentions the same dog\\'s name, \"fido\". Option B mentions a different name, \"spot\". Therefore, A is more similar to the reference label. \\n',\n",
       " 'value': 'A',\n",
       " 'score': 1}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluator.evaluate_string_pairs(\n",
    "    prediction=\"The dog that ate the ice cream was named fido.\",\n",
    "    prediction_b=\"The dog's name is spot\",\n",
    "    input=\"What is the name of the dog that ate the ice cream?\",\n",
    "    reference=\"The dog's name is fido\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}