[Breaking] Update Evaluation Functionality (#7388)

- Migrate from deprecated langchainplus_sdk to `langsmith` package - Update the `run_on_dataset()` API to use an eval config - Update a number of evaluators, as well as the loading logic - Update docstrings / reference docs - Update tracer to share single HTTP session
2025-09-04 20:46:45 +00:00 · 2023-07-13 02:13:06 -07:00
parent 224199083b
commit a673a51efa
48 changed files with 3628 additions and 2548 deletions
--- a/docs/extras/guides/evaluation/criteria_eval_chain.ipynb
+++ b/docs/extras/guides/evaluation/criteria_eval_chain.ipynb
@@ -12,7 +12,7 @@
    "The `CriteriaEvalChain` is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
    "describe those criteria in regular language. In this example, you will use the `CriteriaEvalChain` to check whether an output is concise.\n",
    "\n",
-    "### Step 1: Create the Eval Chain\n",
+    "### Step 1: Load Eval Chain\n",
    "\n",
    "First, create the evaluation chain to predict whether outputs are \"concise\"."
   ]
@@ -27,11 +27,15 @@
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
-    "from langchain.evaluation.criteria import CriteriaEvalChain\n",
+    "from langchain.evaluation import load_evaluator, EvaluatorType\n",
    "\n",
-    "llm = ChatOpenAI(temperature=0)\n",
+    "eval_llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
    "criterion = \"conciseness\"\n",
-    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criterion)"
+    "eval_chain = load_evaluator(EvaluatorType.CRITERIA, llm=eval_llm, criteria=criterion)\n",
+    "\n",
+    "# Equivalent to:\n",
+    "# from langchain.evaluation import CriteriaEvalChain\n",
+    "# CriteriaEvalChain.from_llm(llm=eval_llm, criteria=criterion)"
   ]
  },
  {
@@ -80,7 +84,7 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "{'reasoning': '1. Conciseness: The submission is concise and to the point. It directly answers the question without any unnecessary information. Therefore, the submission meets the criterion of conciseness.\\n\\nY', 'value': 'Y', 'score': 1}\n"
+      "{'reasoning': 'The criterion for this task is conciseness. The submission should be concise and to the point.\\n\\nLooking at the submission, it provides a detailed explanation of the origin of the term \"synecdoche\". It explains the Greek roots of the word and how it entered the English language. \\n\\nWhile the explanation is detailed, it is also concise. It doesn\\'t include unnecessary information or go off on tangents. It sticks to the point, which is explaining the origin of the term.\\n\\nTherefore, the submission meets the criterion of conciseness.\\n\\nY', 'value': 'Y', 'score': 1}\n"
     ]
    }
   ],
@@ -89,40 +93,6 @@
    "print(eval_result)"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "8c4ec9dd-6557-4f23-8480-c822eb6ec552",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['conciseness',\n",
-       " 'relevance',\n",
-       " 'correctness',\n",
-       " 'coherence',\n",
-       " 'harmfulness',\n",
-       " 'maliciousness',\n",
-       " 'helpfulness',\n",
-       " 'controversiality',\n",
-       " 'mysogyny',\n",
-       " 'criminality',\n",
-       " 'insensitive']"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
-    "CriteriaEvalChain.get_supported_default_criteria()"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "c40b1ac7-8f95-48ed-89a2-623bcc746461",
@@ -133,6 +103,24 @@
    "Some criteria may be useful only when there are ground truth reference labels. You can pass these in as well."
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "0c41cd19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "eval_chain = load_evaluator(\n",
+    "    EvaluatorType.LABELED_CRITERIA,\n",
+    "    llm=eval_llm,\n",
+    "    criteria=\"correctness\",\n",
+    ")\n",
+    "\n",
+    "# Equivalent to\n",
+    "# from langchain.evaluation import LabeledCriteriaEvalChain\n",
+    "# LabeledCriteriaEvalChain.from_llm(llm=eval_llm, criteria=criterion)"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 5,
@@ -145,65 +133,18 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "With ground truth: 1\n",
-      "Withoutg ground truth: 0\n"
+      "With ground truth: 1\n"
     ]
    }
   ],
   "source": [
-    "eval_chain = CriteriaEvalChain.from_llm(\n",
-    "    llm=llm, criteria=\"correctness\", requires_reference=True\n",
-    ")\n",
-    "\n",
    "# We can even override the model's learned knowledge using ground truth labels\n",
    "eval_result = eval_chain.evaluate_strings(\n",
    "    input=\"What is the capital of the US?\",\n",
    "    prediction=\"Topeka, KS\",\n",
    "    reference=\"The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023\",\n",
    ")\n",
-    "print(f'With ground truth: {eval_result[\"score\"]}')\n",
-    "\n",
-    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=\"correctness\")\n",
-    "eval_result = eval_chain.evaluate_strings(\n",
-    "    input=\"What is the capital of the US?\",\n",
-    "    prediction=\"Topeka, KS\",\n",
-    ")\n",
-    "print(f'Withoutg ground truth: {eval_result[\"score\"]}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2eb7dedb-913a-4d9e-b48a-9521425d1008",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Multiple Criteria\n",
-    "\n",
-    "To check whether an output complies with all of a list of default criteria, pass in a list! Be sure to only include criteria that are relevant to the provided information, and avoid mixing criteria that measure opposing things (e.g., harmfulness and helpfulness)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "50c067f7-bc6e-4d6c-ba34-97a72023be27",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "{'reasoning': 'Conciseness:\\n- The submission is one sentence long, which is concise.\\n- The submission directly answers the question without any unnecessary information.\\nConclusion: The submission meets the conciseness criterion.\\n\\nCoherence:\\n- The submission is well-structured and organized.\\n- The submission provides the origin of the term synecdoche and explains the meaning of the Greek words it comes from.\\n- The submission is coherent and easy to understand.\\nConclusion: The submission meets the coherence criterion.', 'value': 'Final conclusion: Y', 'score': None}\n"
-     ]
-    }
-   ],
-   "source": [
-    "criteria = [\"conciseness\", \"coherence\"]\n",
-    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)\n",
-    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
-    "print(eval_result)"
+    "print(f'With ground truth: {eval_result[\"score\"]}')"
   ]
  },
  {
@@ -220,7 +161,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
   "metadata": {},
   "outputs": [
@@ -228,62 +169,22 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "{'reasoning': '1. Criteria: numeric: Does the output contain numeric information?\\n- The submission does not contain any numeric information.\\n- Conclusion: The submission meets the criteria.', 'value': 'Answer: Y', 'score': None}\n"
+      "{'reasoning': 'The criterion is asking if the output contains numeric information. The submission does mention the \"late 16th century,\" which is a numeric information. Therefore, the submission meets the criterion.\\n\\nY', 'value': 'Y', 'score': 1}\n"
     ]
    }
   ],
   "source": [
    "custom_criterion = {\"numeric\": \"Does the output contain numeric information?\"}\n",
    "\n",
-    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criterion)\n",
+    "eval_chain = load_evaluator(\n",
+    "    EvaluatorType.CRITERIA,\n",
+    "    llm=eval_llm,\n",
+    "    criteria=custom_criterion,\n",
+    ")\n",
    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
    "print(eval_result)"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "id": "6db12a16-0058-4a14-8064-8528540963d8",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Meets criteria:  1\n",
-      "Does not meet criteria:  0\n"
-     ]
-    }
-   ],
-   "source": [
-    "# You can specify multiple criteria in the dictionary. We recommend you keep the number criteria to a minimum, however for more reliable results.\n",
-    "\n",
-    "custom_criteria = {\n",
-    "    \"complements-user\": \"Does the submission complements the question or the person writing the question in some way?\",\n",
-    "    \"positive\": \"Does the submission maintain a positive sentiment throughout?\",\n",
-    "    \"active voice\": \"Does the submission maintain an active voice throughout, avoiding state of being verbs?\",\n",
-    "}\n",
-    "\n",
-    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criteria)\n",
-    "\n",
-    "# Example that complies\n",
-    "query = \"What's the population of lagos?\"\n",
-    "eval_result = eval_chain.evaluate_strings(\n",
-    "    prediction=\"I think that's a great question, you're really curious! About 30 million people live in Lagos, Nigeria, as of 2023.\",\n",
-    "    input=query,\n",
-    ")\n",
-    "print(\"Meets criteria: \", eval_result[\"score\"])\n",
-    "\n",
-    "# Example that does not comply\n",
-    "eval_result = eval_chain.evaluate_strings(\n",
-    "    prediction=\"The population of Lagos, Nigeria, is about 30 million people.\",\n",
-    "    input=query,\n",
-    ")\n",
-    "print(\"Does not meet criteria: \", eval_result[\"score\"])"
-   ]
-  },
  {
   "cell_type": "markdown",
   "id": "07485cce-8d52-43a0-bdad-76ec7dacfb51",
@@ -297,7 +198,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 7,
   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
   "metadata": {},
   "outputs": [
@@ -323,7 +224,7 @@
       "  ConstitutionalPrinciple(critique_request='Discuss the assistant’s last response to the human and whether it is in any way insensitive, sexist, racist, or socially inappropriate.', revision_request='Revise the assistant’s last response to remove content that is in any way insensitive, sexist, racist, or socially inappropriate.', name='insensitive'))]"
      ]
     },
-     "execution_count": 9,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -337,7 +238,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 8,
   "id": "f4061bc0-a88a-4d1c-88a0-8003bba084c7",
   "metadata": {
    "tags": []
@@ -346,25 +247,25 @@
    {
     "data": {
      "text/plain": [
-       "{'reasoning': 'harmful1: The submission includes a derogatory and insulting term (\"lilly-livered nincompoop\") to describe Will, which could be considered harmful and disrespectful. Therefore, the submission meets this criterion.\\nharmful2: The submission does not contain any explicit encouragement of dangerous or illegal activity, nor does it contain any explicit racism or sexism. However, the use of the term \"nincompoop\" could be seen as belittling and harmful to Will\\'s self-esteem. Therefore, the submission meets this criterion as well.\\n\\nY',\n",
-       " 'value': 'Y',\n",
-       " 'score': 1}"
+       "{'reasoning': 'The criterion asks to identify if the assistant\\'s response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.\\n\\nThe assistant\\'s response is \"I say that man is a lilly-livered nincompoop\". This is a negative statement about a person named Will. \\n\\nThe term \"lilly-livered\" is an old-fashioned term meaning cowardly, and \"nincompoop\" is a term meaning a foolish or stupid person. \\n\\nWhile this is a negative and potentially insulting statement, it does not appear to be harmful in a physical sense, unethical, racist, sexist, toxic, dangerous, or illegal. \\n\\nIt could be argued that the statement is harmful in a psychological or emotional sense, as it is insulting. However, without further context, it\\'s hard to definitively say if this meets the criteria of being harmful.\\n\\nN',\n",
+       " 'value': 'N',\n",
+       " 'score': 0}"
      ]
     },
-     "execution_count": 10,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
-    "eval_chain = CriteriaEvalChain.from_llm(\n",
-    "    llm=llm, criteria=[PRINCIPLES[\"harmful1\"], PRINCIPLES[\"harmful2\"]]\n",
+    "eval_chain = load_evaluator(\n",
+    "    EvaluatorType.CRITERIA, llm=eval_llm, criteria=PRINCIPLES[\"harmful1\"]\n",
    ")\n",
    "eval_result = eval_chain.evaluate_strings(\n",
    "    prediction=\"I say that man is a lilly-livered nincompoop\",\n",
    "    input=\"What do you think of Will?\",\n",
    ")\n",
-    "eval_result"
+    "print(eval_result)"
   ]
  },
  {
@@ -378,14 +279,6 @@
    "\n",
    "Remember when selecting criteria to decide whether they ought to require ground truth labels or not. Things like \"correctness\" are best evaluated with ground truth or with extensive context. Also, remember to pick aligned principles for a given chain so that the classification makes sense."
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "415eb393-c64f-41f1-98de-de99e8e3597e",
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
@@ -404,7 +297,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.3"
+   "version": "3.11.2"
  }
 },
 "nbformat": 4,
--- a/docs/extras/guides/evaluation/langsmith.ipynb
+++ b/docs/extras/guides/evaluation/langsmith.ipynb
@@ -0,0 +1,655 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1a4596ea-a631-416d-a2a4-3577c140493d",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# LangSmith Walkthrough\n",
+    "\n",
+    "LangChain makes it easy to prototype LLM applications and Agents. Even so, delivering a high-quality product to production can be deceptively difficult. You will likely have to heavily customize your prompts, chains, and other components to create a high-quality product.\n",
+    "\n",
+    "To aid the development process, we've designed tracing and callbacks at the core of LangChain. In this notebook, you will get started prototyping and testing an example LLM agent.\n",
+    "\n",
+    "When might this come in handy? You may find it useful when you want to:\n",
+    "\n",
+    "- Quickly debug a new chain, agent, or set of tools\n",
+    "- Visualize how components (chains, llms, retrievers, etc.) relate and are used\n",
+    "- Evaluate different prompts and LLMs for a single component\n",
+    "- Run a given chain several times over a dataset to ensure it consistently meets a quality bar.\n",
+    "- Capture usage traces and using LLMs or analytics pipelines to generate insights"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "138fbb8f-960d-4d26-9dd5-6d6acab3ee55",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "**Run the [local tracing server](https://docs.smith.langchain.com/docs/additional-resources/local_installation) OR [create a hosted LangSmith account](https://smith.langchain.com/) and connect with an API key.**\n",
+    "\n",
+    "To run the local server, execute the following comand in your terminal:\n",
+    "```\n",
+    "pip install --upgrade langsmith\n",
+    "langsmith start\n",
+    "```\n",
+    "\n",
+    "Now, let's get started debugging!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d77d064-41b4-41fb-82e6-2d16461269ec",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Debug your Chain \n",
+    "\n",
+    "First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.\n",
+    "You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable. This will automatically create a debug project for you.\n",
+    "\n",
+    "For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.smith.langchain.com/docs/)\n",
+    "\n",
+    "**NOTE:** You must also set your `OPENAI_API_KEY` and `SERPAPI_API_KEY` environment variables in order to run the following tutorial.\n",
+    "\n",
+    "**NOTE:** You can optionally set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables if using the hosted version."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "904db9a5-f387-4a57-914c-c8af8d39e249",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from uuid import uuid4\n",
+    "\n",
+    "unique_id = uuid4().hex[0:8]\n",
+    "os.environ[\"LANGCHAIN_TRACING_V2\"] = \"true\"\n",
+    "os.environ[\"LANGCHAIN_PROJECT\"] = f\"Tracing Walkthrough - {unique_id}\"\n",
+    "# os.environ[\"LANGCHAIN_ENDPOINT\"] = \"https://api.smith.langchain.com\"  # Uncomment this line to use the hosted version\n",
+    "# os.environ[\"LANGCHAIN_API_KEY\"] = \"<YOUR-LANGSMITH-API-KEY>\"  # Uncomment this line to use the hosted version.\n",
+    "\n",
+    "# Used by the agent in this tutorial\n",
+    "# os.environ[\"OPENAI_API_KEY\"] = \"<YOUR-OPENAI-API-KEY>\"\n",
+    "# os.environ[\"SERPAPI_API_KEY\"] = \"<YOUR-SERPAPI-API-KEY>\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ee7f34b-b65c-4e09-ad52-e3ace78d0221",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Create the langsmith client to interact with the API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "510b5ca0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "You can click the link below to view the UI\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<a href=\"https://dev.smith.langchain.com/\", target=\"_blank\" rel=\"noopener\">LangSmith Client</a>"
+      ],
+      "text/plain": [
+       "Client (API URL: https://dev.api.smith.langchain.com)"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langsmith import Client\n",
+    "\n",
+    "client = Client()\n",
+    "print(\"You can click the link below to view the UI\")\n",
+    "client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca27fa11-ddce-4af0-971e-c5c37d5b92ef",
+   "metadata": {},
+   "source": [
+    "Now, start prototyping your agent. We will use a math example using an older ReACT-style agent."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "7c801853-8e96-404d-984c-51ace59cbbef",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.agents import AgentType, initialize_agent, load_tools\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0)\n",
+    "tools = load_tools([\"serpapi\", \"llm-math\"], llm=llm)\n",
+    "agent = initialize_agent(\n",
+    "    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "19537902-b95c-4390-80a4-f6c9a937081e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "\n",
+    "inputs = [\n",
+    "    \"How many people live in canada as of 2023?\",\n",
+    "    \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
+    "    \"what is dua lipa's boyfriend age raised to the .43 power?\",\n",
+    "    \"how far is it from paris to boston in miles\",\n",
+    "    \"what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?\",\n",
+    "    \"what was the total number of points scored in the 2023 super bowl raised to the .23 power?\",\n",
+    "    \"how many more points were scored in the 2023 super bowl than in the 2022 super bowl?\",\n",
+    "    \"what is 153 raised to .1312 power?\",\n",
+    "    \"who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?\",\n",
+    "    \"what is 1213 divided by 4345?\",\n",
+    "]\n",
+    "results = []\n",
+    "\n",
+    "\n",
+    "async def arun(agent, input_example):\n",
+    "    try:\n",
+    "        return await agent.arun(input_example)\n",
+    "    except Exception as e:\n",
+    "        # The agent sometimes makes mistakes! These will be captured by the tracing.\n",
+    "        return e\n",
+    "\n",
+    "\n",
+    "for input_example in inputs:\n",
+    "    results.append(arun(agent, input_example))\n",
+    "results = await asyncio.gather(*results)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0405ff30-21fe-413d-85cf-9fa3c649efec",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.callbacks.tracers.langchain import wait_for_all_tracers\n",
+    "\n",
+    "# Logs are submitted in a background thread to avoid blocking execution.\n",
+    "# For the sake of this tutorial, we want to make sure\n",
+    "# they've been submitted before moving on. This is also\n",
+    "# useful for serverless deployments.\n",
+    "wait_for_all_tracers()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9decb964-be07-4b6c-9802-9825c8be7b64",
+   "metadata": {},
+   "source": [
+    "Assuming you've successfully configured the server earlier, your agent traces should show up in your server's UI. You can check by clicking on the link below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "b7bc3934-bb1a-452c-a723-f9cdb0b416f9",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<a href=\"https://dev.smith.langchain.com/\", target=\"_blank\" rel=\"noopener\">LangSmith Client</a>"
+      ],
+      "text/plain": [
+       "Client (API URL: https://dev.api.smith.langchain.com)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c43c311-4e09-4d57-9ef3-13afb96ff430",
+   "metadata": {},
+   "source": [
+    "## Test\n",
+    "\n",
+    "Once you've debugged a customized your LLM component, you will want to create tests and benchmark evaluations to measure its performance before putting it into a production environment.\n",
+    "\n",
+    "In this notebook, you will run evaluators to test an agent. You will do so in a few steps:\n",
+    "\n",
+    "1. Create a dataset\n",
+    "2. Select or create evaluators to measure performance\n",
+    "3. Define the LLM or Chain initializer to test\n",
+    "4. Run the chain and evaluators using the helper functions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "beab1a29-b79d-4a99-b5b1-0870c2d772b1",
+   "metadata": {},
+   "source": [
+    "### 1. Create Dataset\n",
+    "\n",
+    "Below, use the client to create a dataset from the Agent runs you just logged while debugging above. You will use these later to measure performance.\n",
+    "\n",
+    "For more information on datasets, including how to create them from CSVs or other files or how to create them in the web app, please refer to the [LangSmith documentation](https://docs.langchain.plus/docs)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "17580c4b-bd04-4dde-9d21-9d4edd25b00d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "dataset_name = f\"calculator-example-dataset-{unique_id}\"\n",
+    "\n",
+    "dataset = client.create_dataset(\n",
+    "    dataset_name, description=\"A calculator example dataset\"\n",
+    ")\n",
+    "\n",
+    "runs = client.list_runs(\n",
+    "    project_name=os.environ[\"LANGCHAIN_PROJECT\"],\n",
+    "    execution_order=1,  # Only return the top-level runs\n",
+    "    error=False,  # Only runs that succeed\n",
+    ")\n",
+    "for run in runs:\n",
+    "    client.create_example(inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8adfd29c-b258-49e5-94b4-74597a12ba16",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### 2. Define the Agent or LLM to Test\n",
+    "\n",
+    "You can evaluate any LLM or chain. Since chains can have memory, we will pass in a `chain_factory` (aka a `constructor` ) function to initialize for each call.\n",
+    "\n",
+    "In this case, you will test an agent that uses OpenAI's function calling endpoints, but it can be any simple chain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "f42d8ecc-d46a-448b-a89c-04b0f6907f75",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.agents import AgentType, initialize_agent, load_tools\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0613\", temperature=0)\n",
+    "tools = load_tools([\"serpapi\", \"llm-math\"], llm=llm)\n",
+    "\n",
+    "\n",
+    "# Since chains can be stateful (e.g. they can have memory), we provide\n",
+    "# a way to initialize a new chain for each row in the dataset. This is done\n",
+    "# by passing in a factory function that returns a new chain for each row.\n",
+    "def agent_factory():\n",
+    "    return initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False)\n",
+    "\n",
+    "\n",
+    "# If your chain is NOT stateful, your factory can return the object directly\n",
+    "# to improve runtime performance. For example:\n",
+    "# chain_factory = lambda: agent"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9cb9ef53",
+   "metadata": {},
+   "source": [
+    "### 3. Configure Evaluation\n",
+    "\n",
+    "Manually comparing the results of chains in the UI is effective, but it can be time consuming.\n",
+    "It can be helpful to use automated metrics and ai-assisted feedback to evaluate your component's performance.\n",
+    "\n",
+    "Below, we will create some pre-implemented run evaluators that do the following:\n",
+    "- Compare results against ground truth labels. (You used the debug outputs above for this)\n",
+    "- Measure semantic (dis)similarity using embedding distance\n",
+    "- Evaluate 'aspects' of the agent's response in a reference-free manner using custom criteria\n",
+    "\n",
+    "For a longer discussion of how to select an appropriate evaluator for your use case and how to create your own\n",
+    "custom evaluators, please refer to the [LangSmith documentation](https://docs.langchain.plus/docs/).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "a25dc281",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation import EvaluatorType\n",
+    "from langchain.smith import RunEvalConfig\n",
+    "\n",
+    "evaluation_config = RunEvalConfig(\n",
+    "    # Evaluators can either be an evaluator type (e.g., \"qa\", \"criteria\", \"embedding_distance\", etc.) or a configuration for that evaluator\n",
+    "    evaluators=[\n",
+    "        EvaluatorType.QA,  # \"Correctness\" against a reference answer\n",
+    "        EvaluatorType.EMBEDDING_DISTANCE,\n",
+    "        RunEvalConfig.Criteria(\"helpfulness\"),\n",
+    "        RunEvalConfig.Criteria(\n",
+    "            {\n",
+    "                \"fifth-grader-score\": \"Do you have to be smarter than a fifth grader to answer this question?\"\n",
+    "            }\n",
+    "        ),\n",
+    "    ]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07885b10",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### 4. Run the Agent and Evaluators\n",
+    "\n",
+    "Use the `arun_on_dataset` (or synchronous `run_on_dataset`) function to evaluate your model. This will:\n",
+    "1. Fetch example rows from the specified dataset\n",
+    "2. Run your llm or chain on each example.\n",
+    "3. Apply evalutors to the resulting run traces and corresponding reference examples to generate automated feedback.\n",
+    "\n",
+    "The results will be visible in the LangSmith app."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "3733269b-8085-4644-9d5d-baedcff13a2f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Processed examples: 2\r"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Chain failed for example 4de88b85-928e-4711-8f11-98886295c8b3. Error: LLMMathChain._evaluate(\"\n",
+      "age_of_Dua_Lipa_boyfriend ** 0.43\n",
+      "\") raised error: 'age_of_Dua_Lipa_boyfriend'. Please try again with a valid numerical expression\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Processed examples: 3\r"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Chain failed for example 7cacdf54-d1b8-4e6c-944e-c94578a2fe0d. Error: Too many arguments to single-input tool Calculator. Args: ['height ^ 0.13', {'height': 68}]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Processed examples: 9\r"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.smith import (\n",
+    "    arun_on_dataset,\n",
+    "    run_on_dataset,  # Available if your chain doesn't support async calls.\n",
+    ")\n",
+    "\n",
+    "chain_results = await arun_on_dataset(\n",
+    "    client=client,\n",
+    "    dataset_name=dataset_name,\n",
+    "    llm_or_chain_factory=agent_factory,\n",
+    "    evaluation=evaluation_config,\n",
+    "    verbose=True,\n",
+    "    tags=[\"testing-notebook\"],  # Optional, adds a tag to the resulting chain runs\n",
+    ")\n",
+    "\n",
+    "# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.\n",
+    "# These are logged as warnings here and captured as errors in the tracing UI."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "a8088b7d-3ab6-4279-94c8-5116fe7cee33",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\u001b[0;31mSignature:\u001b[0m\n",
+       "\u001b[0marun_on_dataset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mclient\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Client'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mdataset_name\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mllm_or_chain_factory\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'MODEL_OR_CHAIN_FACTORY'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0;34m*\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mevaluation\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Optional[RunEvalConfig]'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mconcurrency_level\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mnum_repetitions\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mproject_name\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Optional[str]'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mverbose\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0mtags\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Optional[List[str]]'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m    \u001b[0minput_mapper\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Optional[Callable[[Dict], Any]]'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n",
+       "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'Dict[str, Any]'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+       "\u001b[0;31mDocstring:\u001b[0m\n",
+       "Asynchronously run the Chain or language model on a dataset\n",
+       "and store traces to the specified project name.\n",
+       "\n",
+       "Args:\n",
+       "    client: LangSmith client to use to read the dataset, and to\n",
+       "        log feedback and run traces.\n",
+       "    dataset_name: Name of the dataset to run the chain on.\n",
+       "    llm_or_chain_factory: Language model or Chain constructor to run\n",
+       "        over the dataset. The Chain constructor is used to permit\n",
+       "        independent calls on each example without carrying over state.\n",
+       "    concurrency_level: The number of async tasks to run concurrently.\n",
+       "    num_repetitions: Number of times to run the model on each example.\n",
+       "        This is useful when testing success rates or generating confidence\n",
+       "        intervals.\n",
+       "    project_name: Name of the project to store the traces in.\n",
+       "        Defaults to {dataset_name}-{chain class name}-{datetime}.\n",
+       "    verbose: Whether to print progress.\n",
+       "    tags: Tags to add to each run in the project.\n",
+       "    run_evaluators: Evaluators to run on the results of the chain.\n",
+       "    input_mapper: A function to map to the inputs dictionary from an Example\n",
+       "        to the format expected by the model to be evaluated. This is useful if\n",
+       "        your model needs to deserialize more complex schema or if your dataset\n",
+       "        has inputs with keys that differ from what is expected by your chain\n",
+       "        or agent.\n",
+       "\n",
+       "Returns:\n",
+       "    A dictionary containing the run's project name and the\n",
+       "    resulting model outputs.\n",
+       "\u001b[0;31mFile:\u001b[0m      ~/code/lc/langchain/langchain/smith/evaluation/runner_utils.py\n",
+       "\u001b[0;31mType:\u001b[0m      function"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# For more information on additional configuration for the evaluation function:\n",
+    "\n",
+    "?arun_on_dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cdacd159-eb4d-49e9-bb2a-c55322c40ed4",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Review the Test Results\n",
+    "\n",
+    "You can review the test results tracing UI below by navigating to the \"Datasets & Testing\" page and selecting the **\"calculator-example-dataset-*\"** dataset and associated test project.\n",
+    "\n",
+    "This will show the new runs and the feedback logged from the selected evaluators."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "591c819e-9932-45cf-adab-63727dd49559",
+   "metadata": {},
+   "source": [
+    "## Exporting Runs\n",
+    "\n",
+    "LangSmith lets you export data to common formats such as CSV or JSONL directly in the web app. You can also use the client to fetch runs for further analysis, to store in your own database, or to share with others. Let's fetch the run traces from the evaluation run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "33bfefde-d1bb-4f50-9f7a-fd572ee76820",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Run(id=UUID('eb71a98c-660b-45e4-904e-e1567fdec145'), name='AgentExecutor', start_time=datetime.datetime(2023, 7, 13, 8, 23, 35, 102907), run_type=<RunTypeEnum.chain: 'chain'>, end_time=datetime.datetime(2023, 7, 13, 8, 23, 37, 793962), extra={'runtime': {'library': 'langchain', 'runtime': 'python', 'platform': 'macOS-13.4.1-arm64-arm-64bit', 'sdk_version': '0.0.5', 'library_version': '0.0.231', 'runtime_version': '3.11.2'}, 'total_tokens': 512, 'prompt_tokens': 451, 'completion_tokens': 61}, error=None, serialized=None, events=[{'name': 'start', 'time': '2023-07-13T08:23:35.102907'}, {'name': 'end', 'time': '2023-07-13T08:23:37.793962'}], inputs={'input': 'what is 1213 divided by 4345?'}, outputs={'output': '1213 divided by 4345 is approximately 0.2792.'}, reference_example_id=UUID('d343add7-2631-417b-905a-dc39361ace69'), parent_run_id=None, tags=['openai-functions', 'testing-notebook'], execution_order=1, session_id=UUID('cc5f4f88-f1bf-495f-8adb-384f66321eb2'), child_run_ids=[UUID('daa9708a-ad08-4be1-9841-e92e2f384cce'), UUID('28b1ada7-3fe8-4853-a5b0-dac8a93a3066'), UUID('dc0b4867-3f3d-46f7-bfb5-f4be10f3cc52'), UUID('58c9494e-2ea6-4291-ab78-73b8ffcdaef5'), UUID('8f5a3e08-ce96-4c81-a6aa-86bf5b3bb590'), UUID('f0447532-7ded-45b6-9d87-f1fa18e381b0')], child_runs=None, feedback_stats={'correctness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'helpfulness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'fifth-grader-score': {'n': 1, 'avg': 0.0, 'mode': 0}, 'embedding_cosine_distance': {'n': 1, 'avg': 0.144522385071361, 'mode': 0.144522385071361}})"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "runs = list(client.list_runs(dataset_name=dataset_name))\n",
+    "runs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "6595c888-1f5c-4ae3-9390-0a559f5575d1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'correctness': {'n': 7, 'avg': 0.7142857142857143, 'mode': 1},\n",
+       " 'helpfulness': {'n': 7, 'avg': 1.0, 'mode': 1},\n",
+       " 'fifth-grader-score': {'n': 7, 'avg': 0.7142857142857143, 'mode': 1},\n",
+       " 'embedding_cosine_distance': {'n': 7,\n",
+       "  'avg': 0.08308464442094905,\n",
+       "  'mode': 0.00371031210788608}}"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client.read_project(project_id=runs[0].session_id).feedback_stats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2646f0fb-81d4-43ce-8a9b-54b8e19841e2",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "Congratulations! You have succesfully traced and evaluated an agent using LangSmith!\n",
+    "\n",
+    "This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better results.\n",
+    "\n",
+    "For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.langchain.plus/docs/), and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}