cr

stash
2026-01-21 21:56:38 +00:00 · 2023-03-12 16:36:15 -07:00 · 2023-03-12 14:55:22 -07:00 · 2023-03-11 22:06:38 -08:00 · 2023-03-11 16:51:46 -08:00 · 2023-03-09 17:20:35 -08:00
12 changed files with 1273 additions and 7 deletions
--- a/docs/modules/chains/examples/sqlite.ipynb
+++ b/docs/modules/chains/examples/sqlite.ipynb
@@ -675,7 +675,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,
--- a/docs/use_cases/evaluation.rst
+++ b/docs/use_cases/evaluation.rst
@@ -1,9 +1,80 @@
 Evaluation
 ==============

-Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.
+This section of documentation covers how we approach and think about evaluation in LangChain.
+Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.

-The examples here all highlight how to use language models to assist in evaluation of themselves.
+The Problem
+-----------
+
+It can be really hard to evaluate LangChain chains and agents.
+There are two main reasons for this:
+
+**# 1: Lack of data**
+
+You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
+This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
+meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
+a large dataset of examples.
+This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
+before even getting started using a model.
+
+**# 2: Lack of metrics**
+
+Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
+For example, one of the most common use cases is generating text of some form.
+Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.
+
+The Solution
+------------
+
+LangChain attempts to tackle both of those issues.
+What we have so far are initial passes at solutions - we do not think we have a perfect solution.
+So we very much welcome feedback, contributions, integrations, and thoughts on this.
+
+Here is what we have for each problem so far:
+
+**# 1: Lack of data**
+
+We have started `LangChainDatasets<https://huggingface.co/LangChainDatasets>`_ a Community space on Hugging Face.
+We intend this to be a collection of open source datasets for evaluating common chains and agents.
+We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
+In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
+
+**# 2: Lack of metrics**
+
+We have two solutions to the lack of metrics.
+
+The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
+To assist in this, we have developed (and will continue to develop) `tracing <../tracing.md>`_, a UI-based visualizer of your chain and agent runs.
+
+The second solution we recommend is to use Language Models themselves to evaluate outputs.
+For this we have a few different chains and prompts aimed at tackling this issue.
+
+The Examples
+------------
+
+We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
+In addition to the examples we've curated, we also highly welcome contributions here.
+To facilitate that, we've included a `template notebook<./evaluation/benchmarking_template.html>`_ for community members to use to build their own examples.
+
+The existing examples we have are:
+
+`Question Answering (State of Union)<./evaluation/qa_benchmarking_sota.html>`_: An notebook showing evaluation of a question-answering task over a State-of-the-Union address.
+
+`Question Answering (Paul Graham Essay)<./evaluation/qa_benchmarking_pg.html>`_: An notebook showing evaluation of a question-answering task over a Paul Graham essay.
+
+`SQL Question Answering (Chinook)<./evaluation/sql_qa_benchmarking_chinook.html>`_: An notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
+
+`Agent Vectorstore<./evaluation/vectordb_agent_qa_benchmarking.html>`_: An notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
+
+`Agent Search + Calculator<./evaluation/agent_benchmarking.html>`_: An notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
+
+
+Other Examples
+------------
+
+In addition, we also have some more generic resources for evaluation.

 `Question Answering <./evaluation/question_answering.html>`_: An overview of LLMs aimed at evaluating question answering systems in general.

--- a/docs/use_cases/evaluation/qa_agent_evaluation.ipynb
+++ b/docs/use_cases/evaluation/qa_agent_evaluation.ipynb
@@ -0,0 +1,493 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6591df9f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<style>.container { width:100% !important; }</style>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from IPython.display import display, HTML\n",
+    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "22522eb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2b11cce",
+   "metadata": {},
+   "source": [
+    "evaluating a (task-oriented) agent\n",
+    "\n",
+    "- for now, task is information retrieval/QA\n",
+    "    - cuz these are the cases Harrison started with\n",
+    "- desiderata+challenges:\n",
+    "    - evaluate goal state:\n",
+    "    - evaluate intermediate states:\n",
+    "        - challenges:\n",
+    "            - non-deterministic/different trajectories may be acceptable\n",
+    "            - coupled/cascading errors\n",
+    "            - we can rely on another LM for evaluation, but this is error-prone too"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62418f6d",
+   "metadata": {},
+   "source": [
+    "# Evaluate each of these w/ DaVinci (quick few shot, chain-of-thought binary classifiers)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "56b5f44b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from distutils.util import strtobool\n",
+    "from typing import List\n",
+    "\n",
+    "from langchain.llms import OpenAI\n",
+    "from langchain.schema import AgentAction\n",
+    "\n",
+    "\n",
+    "davinci_003 = OpenAI(model_name=\"text-davinci-003\", temperature=0.0)\n",
+    "\n",
+    "\n",
+    "def extract_binary_classification(completion:  str) -> bool:\n",
+    "    \"\"\"Extract (try to extract) binary classification from text completion.\n",
+    "    \"\"\"\n",
+    "    boolean_as_str = completion.strip().split('.')[0]\n",
+    "    boolean = False\n",
+    "    try:\n",
+    "        boolean = bool(strtobool(boolean_as_str))\n",
+    "    except ValueError as e:\n",
+    "        print(e)\n",
+    "    return boolean\n",
+    "\n",
+    "\n",
+    "def evaluate_candidate_action_plan(question: str, action_plan: List[AgentAction], candidate_action_plan: List[AgentAction], model: 'LLM ABC', verbose: bool=False) -> bool:\n",
+    "    \"\"\"Use a few-shot classifier to verify whether 2 actin plans are \"roughly\" equivalent for a given question.\n",
+    "\n",
+    "    This approach is itself highly error prone!\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    prompt_prefix = \"\"\"Decide whether the Candidate action plan would give the same outcome as the Desired action plan in answering a given Question. Actions correspond to calling on a tool like a search engine, data store, calculator, etc.\n",
+    "\n",
+    "Examples:\n",
+    "\n",
+    "Question: How far is the Earth from the Moon?\n",
+    "Desired action plan: Search(distance between Earth and Moon)\n",
+    "Candidate action plan: Calculator(distance from Earth to Moon)\n",
+    "Satisfactory? No.\n",
+    "Explanation: The Candidate plan uses a Calculator instead of a Search engine.\n",
+    "\n",
+    "Question: What is the number of kids our current president has to the power of two?\n",
+    "Desired action plan: Search(how many kids the president has?),  Calculator(4^2)\n",
+    "Candidate action plan: Search(who is current president?), Search(how many kids Joe Biden has?), Calculator(4*4)\n",
+    "Satisfactory? Yes\n",
+    "Explanation: The Candidate plan reaches the same result as the Desired plan with one step broken down into two. \n",
+    "\n",
+    "Question: how long does it take to drive from Boston to New York?\n",
+    "Desired action plan: Search(distance from Boston to New York), Search(speed limit Boston to NewYork), Calculator(190.04/40)\n",
+    "Candidate action plan: Search(driving time from Boston to New York)\n",
+    "Satisfactory?  Yes.\n",
+    "Explanation: The Candidate plan uses a tool to answer the question directly,  rather than breaking it down like the Desired plan.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def serialize_action_plan(action_plan):\n",
+    "        return ', '.join([\n",
+    "            f\"{action.tool}({action.tool_input})\"\n",
+    "            for action in action_plan\n",
+    "        ])\n",
+    "    desired_action_plan_str = serialize_action_plan(desired_action_plan)\n",
+    "    candidate_action_plan_str = serialize_action_plan(candidate_action_plan)\n",
+    "\n",
+    "    prompt = prompt_prefix + f\"\"\"\n",
+    "Question: {question}\n",
+    "Desired action plan: {desired_action_plan_str}\n",
+    "Candidate action plan: {candidate_action_plan_str}\n",
+    "Satisfactory?\"\"\"\n",
+    "\n",
+    "    completion = model(prompt)    \n",
+    "    if verbose:\n",
+    "        print(\"Prompt:\\n\", prompt)\n",
+    "        print(\"Completion:\\n\", completion)\n",
+    "    \n",
+    "    return extract_binary_classification(completion)\n",
+    "    \n",
+    "\n",
+    "def evaluate_candidate_answer(question: str, answer: str, candidate_answer: str, model: 'LLM ABC', verbose: bool=False) -> bool:\n",
+    "    \"\"\"Use a few-shot classifier to verify whether 2 answers are \"roughly\" equivalent for a given question.\n",
+    "\n",
+    "    This approach is itself highly error prone!\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    prompt_prefix = f\"\"\"Decide whether a Candidate answer gives the same information as a Desired answer for a given Question.\n",
+    "\n",
+    "Examples:\n",
+    "\n",
+    "Question: What is the distance from Earth to the Moon?\n",
+    "Desired answer: 238,900 mi\n",
+    "Candidate answer: The distance is about 250k miles\n",
+    "Satisfactory? Yes. \n",
+    "Explanation: The Candidate answer roughly gives the same information as the Desired answer.\n",
+    "\n",
+    "Question: How many kids does Joe Biden have?\n",
+    "Desired answer: 4\n",
+    "Candidate answer: 42\n",
+    "Satisfactory? No.\n",
+    "Explanation: The candidate answer 42 is not the same as the Desired answer.\n",
+    "\"\"\"\n",
+    "    \n",
+    "    prompt = prompt_prefix + f\"\"\"\n",
+    "Question: {question}\n",
+    "Desired answer: {answer}\n",
+    "Candidate answer: {candidate_answer}\n",
+    "Satisfactory?\"\"\"\n",
+    "    \n",
+    "    completion = model(prompt)\n",
+    "    if verbose:\n",
+    "        print(\"Prompt:\\n\", prompt)\n",
+    "        print(\"Completion:\\n\", completion)\n",
+    "    \n",
+    "    return extract_binary_classification(completion)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f5bf13c",
+   "metadata": {},
+   "source": [
+    "# A couple test cases for ReAct-agent QA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "71daffa5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import LLMMathChain, OpenAI, SerpAPIWrapper, SQLDatabase, SQLDatabaseChain\n",
+    "from langchain.agents import initialize_agent, Tool, load_tools\n",
+    "\n",
+    "tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))\n",
+    "agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\", verbose=True, return_intermediate_steps=True)\n",
+    "\n",
+    "test_cases = [\n",
+    "    {\n",
+    "        \"question\": \"How many people live in canada as of 2023?\",\n",
+    "        \"answer\": \"approximately 38,625,801\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"Population of Canada 2023\"}\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"who is dua lipa's boyfriend? what is his age raised to the .43 power?\",\n",
+    "        \"answer\": \"her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"Dua Lipa's boyfriend\"},\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"Romain Gravas age\"},\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"41^.43\"}\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"what is dua lipa's boyfriend age raised to the .43 power?\",\n",
+    "        \"answer\": \"her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"Dua Lipa's boyfriend\"},\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"Romain Gravas age\"},\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"41^.43\"}\n",
+    "        ]\n",
+    "    \n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"how far is it from paris to boston in miles\",\n",
+    "        \"answer\": \"approximately 3,435 mi\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"paris to boston distance\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?\",\n",
+    "        \"answer\": \"approximately 2.682651500990882\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"2023 super bowl score\"},\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"73^.23\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"what was the total number of points scored in the 2023 super bowl raised to the .23 power?\",\n",
+    "        \"answer\": \"approximately 2.682651500990882\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"2023 super bowl score\"},\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"73^.23\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"how many more points were scored in the 2023 super bowl than in the 2022 super bowl?\",\n",
+    "        \"answer\": \"30\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"2023 super bowl score\"},\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"2022 super bowl score\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"what is 153 raised to .1312 power?\",\n",
+    "        \"answer\": \"approximately 1.9347796717823205\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"15**.13\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?\",\n",
+    "        \"answer\": \"approximately 1.7589107138176394\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"kendall jenner boyfriend\"},\n",
+    "            {\"tool\": \"Search\", \"tool_input\": \"devin booker height\"},\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"77**.13\"},\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"what is 1213 divided by 4345?\",\n",
+    "        \"answer\": \"approximately 0.2791714614499425\",\n",
+    "        \"steps\": [\n",
+    "            {\"tool\": \"Calculator\", \"tool_input\": \"1213/4345\"},\n",
+    "        ]\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "4fb64085",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
+      "\u001b[32;1m\u001b[1;3m I need to find out who Dua Lipa's boyfriend is and then calculate his age raised to the .43 power\n",
+      "Action: Search\n",
+      "Action Input: \"Dua Lipa's boyfriend\"\u001b[0m\n",
+      "Observation: \u001b[36;1m\u001b[1;3mDua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...\u001b[0m\n",
+      "Thought:\u001b[32;1m\u001b[1;3m I need to find out Isaac's age\n",
+      "Action: Search\n",
+      "Action Input: \"Isaac Carew age\"\u001b[0m\n",
+      "Observation: \u001b[36;1m\u001b[1;3m36 years\u001b[0m\n",
+      "Thought:\u001b[32;1m\u001b[1;3m I need to calculate 36 raised to the .43 power\n",
+      "Action: Calculator\n",
+      "Action Input: 36^.43\u001b[0m\n",
+      "Observation: \u001b[33;1m\u001b[1;3mAnswer: 4.6688516567750975\n",
+      "\u001b[0m\n",
+      "Thought:\u001b[32;1m\u001b[1;3m I now know the final answer\n",
+      "Final Answer: Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\u001b[0m\n",
+      "\n",
+      "\u001b[1m> Finished chain.\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_case = test_cases[1]\n",
+    "\n",
+    "question = test_case['question']\n",
+    "out = agent(question)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "fbe14d7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "desired_action_plan = [\n",
+    "    AgentAction(tool=step['tool'], tool_input=step['tool_input'], log=None)\n",
+    "    for step in test_case['steps']\n",
+    "]\n",
+    "desired_answer = test_case['answer']\n",
+    "\n",
+    "candidate_action_plan = [\n",
+    "    action for action, observation in out['intermediate_steps']\n",
+    "]\n",
+    "candidate_answer = out['output']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "8e4d5f86",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt:\n",
+      " Decide whether a Candidate answer gives the same information as a Desired answer for a given Question.\n",
+      "\n",
+      "Examples:\n",
+      "\n",
+      "Question: What is the distance from Earth to the Moon?\n",
+      "Desired answer: 238,900 mi\n",
+      "Candidate answer: The distance is about 250k miles\n",
+      "Satisfactory? Yes. \n",
+      "Explanation: The Candidate answer roughly gives the same information as the Desired answer.\n",
+      "\n",
+      "Question: How many kids does Joe Biden have?\n",
+      "Desired answer: 4\n",
+      "Candidate answer: 42\n",
+      "Satisfactory? No.\n",
+      "Explanation: The candidate answer 42 is not the same as the Desired answer.\n",
+      "\n",
+      "Question: who is dua lipa's boyfriend? what is his age raised to the .43 power?\n",
+      "Desired answer: her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665\n",
+      "Candidate answer: Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.\n",
+      "Satisfactory?\n",
+      "Completion:\n",
+      "  Yes.\n",
+      "Explanation: The Candidate answer roughly gives the same information as the Desired answer.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluate_candidate_answer(question, desired_answer, candidate_answer, davinci_003, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b4d6595",
+   "metadata": {},
+   "source": [
+    "**Not quite! From CoT explanation, appears to attend to the 2nd question only.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "51ff802c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt:\n",
+      " Decide whether the Candidate action plan would give the same outcome as the Desired action plan in answering a given Question. Actions correspond to calling on a tool like a search engine, data store, calculator, etc.\n",
+      "\n",
+      "Examples:\n",
+      "\n",
+      "Question: How far is the Earth from the Moon?\n",
+      "Desired action plan: Search(distance between Earth and Moon)\n",
+      "Candidate action plan: Calculator(distance from Earth to Moon)\n",
+      "Satisfactory? No.\n",
+      "Explanation: The Candidate plan uses a Calculator instead of a Search engine.\n",
+      "\n",
+      "Question: What is the number of kids our current president has to the power of two?\n",
+      "Desired action plan: Search(how many kids the president has?),  Calculator(4^2)\n",
+      "Candidate action plan: Search(who is current president?), Search(how many kids Joe Biden has?), Calculator(4*4)\n",
+      "Satisfactory? Yes\n",
+      "Explanation: The Candidate plan reaches the same result as the Desired plan with one step broken down into two. \n",
+      "\n",
+      "Question: how long does it take to drive from Boston to New York?\n",
+      "Desired action plan: Search(distance from Boston to New York), Search(speed limit Boston to NewYork), Calculator(190.04/40)\n",
+      "Candidate action plan: Search(driving time from Boston to New York)\n",
+      "Satisfactory?  Yes.\n",
+      "Explanation: The Candidate plan uses a tool to answer the question directly,  rather than breaking it down like the Desired plan.\n",
+      "    \n",
+      "Question: who is dua lipa's boyfriend? what is his age raised to the .43 power?\n",
+      "Desired action plan: Search(Dua Lipa's boyfriend), Search(Romain Gravas age), Calculator(41^.43)\n",
+      "Candidate action plan: Search(Dua Lipa's boyfriend), Search(Isaac Carew age), Calculator(36^.43)\n",
+      "Satisfactory?\n",
+      "Completion:\n",
+      "  No.\n",
+      "Explanation: The Candidate plan uses a different boyfriend than the Desired plan, so the result would be different.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "False"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluate_candidate_action_plan(question, desired_action_plan, candidate_action_plan, davinci_003, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2557d35b",
+   "metadata": {},
+   "source": [
+    "**Evaluates as intended initially, though we'd likely like to break this our farther (i.e. a trajectory should not really resolve to T/F).**"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/question_answering.ipynb
+++ b/docs/use_cases/evaluation/question_answering.ipynb
@@ -191,7 +191,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "782ae8c8",
   "metadata": {},
@@ -316,7 +315,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": ".venv",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -330,7 +329,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]"
+   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
--- a/docs/use_cases/evaluation/sql.ipynb
+++ b/docs/use_cases/evaluation/sql.ipynb
@@ -0,0 +1,205 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "63a6161b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain import OpenAI, SQLDatabase, SQLDatabaseChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "24f017da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "db = SQLDatabase.from_uri(\"sqlite:///../../../notebooks/Chinook.db\")\n",
+    "llm = OpenAI(temperature=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "3e980729",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "f8b4e54f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "questions = [\n",
+    "    {\n",
+    "        \"question\": \"How many employees are there?\",\n",
+    "        \"answer\": \"8\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"What are some example tracks by composer Johann Sebastian Bach?\",\n",
+    "        \"answer\": \"'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Aria Mit 30 Veränderungen, BWV 988 'Goldberg Variations': Aria', and 'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude'\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"What are some example tracks by Bach?\",\n",
+    "        \"answer\": \"'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Aria Mit 30 Veränderungen, BWV 988 'Goldberg Variations': Aria', and 'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude'\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"How many employees are also customers?\",\n",
+    "        \"answer\": \"None\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"Where is Mark Telus from?\",\n",
+    "        \"answer\": \"Edmonton, Canada\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"What is the most common genre of songs?\",\n",
+    "        \"answer\": \"Rock\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"What is the most common media type?\",\n",
+    "        \"answer\": \"MPEG audio file\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"What is the most common media type?\",\n",
+    "        \"answer\": \"Purchased AAC audio file\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"How many more Protected AAC audio files are there than Protected MPEG-4 video file?\",\n",
+    "        \"answer\": \"23\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"question\": \"How many albums are there\",\n",
+    "        \"answer\": \"347\"\n",
+    "    }\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "5896eda7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "10"
+      ]
+     },
+     "execution_count": 44,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(questions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "21dc41ac",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'[(1, 3034), (2, 237), (3, 214), (4, 7), (5, 11)]'"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "db.run(\"\"\"SELECT\n",
+    "  MediaTypeID,\n",
+    "  COUNT(*) AS `num`\n",
+    "FROM\n",
+    "  Track\n",
+    "GROUP BY\n",
+    "  MediaTypeID\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "115cd3da",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "''"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "db.get_table_info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "659c8d20",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'[(347,)]'"
+      ]
+     },
+     "execution_count": 39,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "db.run(\"select count(*) from album\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b99a505",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/use_cases/evaluation/vectorstore_routing.ipynb
+++ b/docs/use_cases/evaluation/vectorstore_routing.ipynb
@@ -0,0 +1,377 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "4669c98a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import TextLoader\n",
+    "sota_loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
+    "pg_loader = TextLoader(\"../../../../gpt_index/examples/paul_graham_essay/data/paul_graham_essay.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "c484ffb5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.indexes import VectorstoreIndexCreator\n",
+    "from langchain.vectorstores import FAISS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "5b139077",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sota_index = VectorstoreIndexCreator(vectorstore_cls=FAISS).from_loaders([sota_loader])\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "55fa5f56",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running Chroma using direct local API.\n",
+      "Using DuckDB in-memory for database. Data will be transient.\n"
+     ]
+    }
+   ],
+   "source": [
+    "pg_index = VectorstoreIndexCreator(vectorstore_kwargs={\"collection_name\": \"paul-graham\"}).from_loaders([pg_loader])\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "edad6e7b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\" The President nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court. He said she is one of the nation's top legal minds and will continue Justice Breyer's legacy of excellence.\""
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "sota_index.query(\"what did the president about kentaji brown jackson?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "201ff615",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\" Kentaji Brown Jackson was not mentioned in the context, so I don't know.\""
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pg_index.query(\"what did the president about kentaji brown jackson?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "3f9622e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.agents import initialize_agent, Tool\n",
+    "from langchain.tools import BaseTool\n",
+    "from langchain.llms import OpenAI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "45158bb9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tools = [\n",
+    "    Tool(\n",
+    "        name = \"State of Union QA System\",\n",
+    "        func=sota_index.query,\n",
+    "        description=\"useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.\"\n",
+    "    ),\n",
+    "    Tool(\n",
+    "        name = \"Paul Graham QA System\",\n",
+    "        func=pg_index.query,\n",
+    "        description=\"useful for when you need to answer questions about Paul Graham. Input should be a fully formed question.\"\n",
+    "    ),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "11514c7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "agent = initialize_agent(tools, OpenAI(temperature=0), agent=\"zero-shot-react-description\", verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "91cd2c71",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "f55dd305",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"../../../notebooks/state_of_union_qa.json\") as f:\n",
+    "    sota_qa = json.load(f)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "fb0fb286",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"../../../notebooks/paul_graham_qa.json\") as f:\n",
+    "    pg_qa = json.load(f)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "dfb08711",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for d in sota_qa:\n",
+    "    d['steps'] = [{\"tool\": \"State of Union QA System\"}, {\"tool_input\": d[\"question\"]}]\n",
+    "for d in pg_qa:\n",
+    "    d['steps'] = [{\"tool\": \"Paul Graham QA System\"}, {\"tool_input\": d[\"question\"]}]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "f442a356",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_vectorstore_routing = sota_qa + pg_qa"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "7b15160c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"vectorstore_sota_pg.json\", \"w\") as f:\n",
+    "    json.dump(all_vectorstore_routing, f)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "id": "87bc7826",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_vectorstore_routing = [{'question': 'What is the purpose of the NATO Alliance?',\n",
+    "  'answer': 'The purpose of the NATO Alliance is to secure peace and stability in Europe after World War 2.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the purpose of the NATO Alliance?'}]},\n",
+    " {'question': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?',\n",
+    "  'answer': 'The U.S. Department of Justice is assembling a dedicated task force to go after the crimes of Russian oligarchs.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the U.S. Department of Justice doing to combat the crimes of Russian oligarchs?'}]},\n",
+    " {'question': 'What is the American Rescue Plan and how did it help Americans?',\n",
+    "  'answer': 'The American Rescue Plan is a piece of legislation that provided immediate economic relief for tens of millions of Americans. It helped put food on their table, keep a roof over their heads, and cut the cost of health insurance. It created jobs and left no one behind.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the American Rescue Plan and how did it help Americans?'}]},\n",
+    " {'question': 'What is the purpose of the Bipartisan Innovation Act mentioned in the text?',\n",
+    "  'answer': 'The Bipartisan Innovation Act will make record investments in emerging technologies and American manufacturing to level the playing field with China and other competitors.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the purpose of the Bipartisan Innovation Act mentioned in the text?'}]},\n",
+    " {'question': \"What is Joe Biden's plan to fight inflation?\",\n",
+    "  'answer': \"Joe Biden's plan to fight inflation is to lower costs, not wages, by making more goods in America, increasing the productive capacity of the economy, and cutting the cost of prescription drugs, energy, and child care.\",\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': \"What is Joe Biden's plan to fight inflation?\"}]},\n",
+    " {'question': 'What is the proposed minimum tax rate for corporations under the plan?',\n",
+    "  'answer': 'The proposed minimum tax rate for corporations is 15%.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the proposed minimum tax rate for corporations under the plan?'}]},\n",
+    " {'question': 'What are the four common sense steps that the author suggests to move forward safely?',\n",
+    "  'answer': 'The four common sense steps suggested by the author to move forward safely are: stay protected with vaccines and treatments, prepare for new variants, end the shutdown of schools and businesses, and stay vigilant.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What are the four common sense steps that the author suggests to move forward safely?'}]},\n",
+    " {'question': 'What is the purpose of the American Rescue Plan?',\n",
+    "  'answer': 'The purpose of the American Rescue Plan is to provide $350 Billion that cities, states, and counties can use to hire more police and invest in proven strategies like community violence interruption.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the purpose of the American Rescue Plan?'}]},\n",
+    " {'question': 'What measures does the speaker ask Congress to pass to reduce gun violence?',\n",
+    "  'answer': 'The speaker asks Congress to pass universal background checks, ban assault weapons and high-capacity magazines, and repeal the liability shield that makes gun manufacturers the only industry in America that can’t be sued.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What measures does the speaker ask Congress to pass to reduce gun violence?'}]},\n",
+    " {'question': 'What is the Unity Agenda for the Nation that the President is offering?',\n",
+    "  'answer': 'The Unity Agenda for the Nation includes four big things that can be done together: beat the opioid epidemic, take on mental health, support veterans, and strengthen the Violence Against Women Act.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the Unity Agenda for the Nation that the President is offering?'}]},\n",
+    " {'question': 'What is the purpose of ARPA-H?',\n",
+    "  'answer': 'ARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more.',\n",
+    "  'steps': [{'tool': 'State of Union QA System'},\n",
+    "   {'tool_input': 'What is the purpose of ARPA-H?'}]},\n",
+    " {'question': 'What were the two main things the author worked on before college?',\n",
+    "  'answer': 'The two main things the author worked on before college were writing and programming.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What were the two main things the author worked on before college?'}]},\n",
+    " {'question': 'What made the author want to work on AI?',\n",
+    "  'answer': \"The novel 'The Moon is a Harsh Mistress' and a PBS documentary showing Terry Winograd using SHRDLU made the author want to work on AI.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What made the author want to work on AI?'}]},\n",
+    " {'question': 'What did the author realize while looking at a painting at the Carnegie Institute?',\n",
+    "  'answer': 'The author realized that paintings were something that could be made to last and that making them was a way to be independent and make a living.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What did the author realize while looking at a painting at the Carnegie Institute?'}]},\n",
+    " {'question': 'What did the author write their dissertation on?',\n",
+    "  'answer': 'The author wrote their dissertation on applications of continuations.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What did the author write their dissertation on?'}]},\n",
+    " {'question': 'What is the difference between painting still lives and painting people?',\n",
+    "  'answer': \"Painting still lives is different from painting people because the subject, as its name suggests, can't move. People can't sit for more than about 15 minutes at a time, and when they do they don't sit very still. So the traditional m.o. for painting people is to know how to paint a generic person, which you then modify to match the specific person you're painting.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is the difference between painting still lives and painting people?'}]},\n",
+    " {'question': 'What did the author learn while working at Interleaf?',\n",
+    "  'answer': \"The author learned that low end software tends to eat high end software, that it's better for technology companies to be run by product people than sales people, that it leads to bugs when code is edited by too many people, that cheap office space is no bargain if it's depressing, that planned meetings are inferior to corridor conversations, that big, bureaucratic customers are a dangerous source of money, and that there's not much overlap between conventional office hours and the optimal time for hacking, or conventional offices and the optimal place for it.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What did the author learn while working at Interleaf?'}]},\n",
+    " {'question': 'What did the author do to survive during the next several years after leaving RISD?',\n",
+    "  'answer': 'The author did freelance work for the group that did projects for customers to survive for the next several years after leaving RISD.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What did the author do to survive during the next several years after leaving RISD?'}]},\n",
+    " {'question': \"What was the author's motivation for wanting to become rich?\",\n",
+    "  'answer': 'The author wanted to become rich so that he could work on whatever he wanted.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': \"What was the author's motivation for wanting to become rich?\"}]},\n",
+    " {'question': 'What is Viaweb and how did it get its name?',\n",
+    "  'answer': 'Viaweb is a company that built a web app for creating online stores. It got its name from the fact that the software worked via the web.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is Viaweb and how did it get its name?'}]},\n",
+    " {'question': 'What was the price charged by Viaweb for a small store and a big one?',\n",
+    "  'answer': '$100 a month for a small store and $300 a month for a big one.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What was the price charged by Viaweb for a small store and a big one?'}]},\n",
+    " {'question': 'Why did the author hire more people for his startup?',\n",
+    "  'answer': \"The author hired more people for his startup partly because the investors wanted him to and partly because that's what startups did during the Internet Bubble.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'Why did the author hire more people for his startup?'}]},\n",
+    " {'question': \"What was the author's idea for a new company?\",\n",
+    "  'answer': \"The author's idea was to build a web app for making web apps, where people could edit code on their server through the browser and then host the resulting applications for them.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': \"What was the author's idea for a new company?\"}]},\n",
+    " {'question': \"What was the author's turning point in figuring out what to work on?\",\n",
+    "  'answer': \"The author's turning point in figuring out what to work on was when he started publishing essays online.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': \"What was the author's turning point in figuring out what to work on?\"}]},\n",
+    " {'question': 'What is the danger for the ambitious according to the text?',\n",
+    "  'answer': 'The desire to impress people is the danger for the ambitious according to the text.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is the danger for the ambitious according to the text?'}]},\n",
+    " {'question': 'What is the most distinctive thing about Y Combinator?',\n",
+    "  'answer': 'The most distinctive thing about YC is the batch model: to fund a bunch of startups all at once, twice a year, and then to spend three months focusing intensively on trying to help them.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is the most distinctive thing about Y Combinator?'}]},\n",
+    " {'question': 'What was the Summer Founders Program and how many groups were selected for funding?',\n",
+    "  'answer': 'The Summer Founders Program was a program for undergrads to apply for funding for their startup ideas. 8 groups were selected for funding out of 225 applications.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What was the Summer Founders Program and how many groups were selected for funding?'}]},\n",
+    " {'question': 'What was the biggest source of stress for the author while working at YC?',\n",
+    "  'answer': 'HN (Hacker News)',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What was the biggest source of stress for the author while working at YC?'}]},\n",
+    " {'question': 'What did the author decide to do after leaving YC?',\n",
+    "  'answer': 'The author decided to focus on painting.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What did the author decide to do after leaving YC?'}]},\n",
+    " {'question': 'What is the distinctive thing about Lisp?',\n",
+    "  'answer': 'The distinctive thing about Lisp is that its core is a language defined by writing an interpreter in itself.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is the distinctive thing about Lisp?'}]},\n",
+    " {'question': 'Why did the author move to England?',\n",
+    "  'answer': 'The author moved to England to let their kids experience living in another country and because the author was a British citizen by birth.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'Why did the author move to England?'}]},\n",
+    " {'question': 'What was the reason behind the change of name from Cambridge Seed to Y Combinator?',\n",
+    "  'answer': \"They didn't want a regional name, in case someone copied them in Silicon Valley, so they renamed themselves after one of the coolest tricks in the lambda calculus, the Y combinator.\",\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What was the reason behind the change of name from Cambridge Seed to Y Combinator?'}]},\n",
+    " {'question': 'What is the purpose of YC?',\n",
+    "  'answer': 'The purpose of YC is to cause startups to be founded that would not otherwise have existed.',\n",
+    "  'steps': [{'tool': 'Paul Graham QA System'},\n",
+    "   {'tool_input': 'What is the purpose of YC?'}]}]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cf7524b1",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/langchain/chains/qa_generation/init.py
+++ b/langchain/chains/qa_generation/init.py
--- a/langchain/chains/qa_generation/base.py
+++ b/langchain/chains/qa_generation/base.py
@@ -0,0 +1,53 @@
+import json
+from typing import Any, Dict, List, Optional
+
+from pydantic import Field
+
+from langchain.chains.base import Chain
+from langchain.chains.llm import LLMChain
+from langchain.chains.qa_generation.prompt import PROMPT_SELECTOR
+from langchain.prompts.base import BasePromptTemplate
+from langchain.schema import BaseLanguageModel
+from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
+
+
+class QAGenerationChain(Chain):
+    llm_chain: LLMChain
+    text_splitter: TextSplitter = Field(
+        default=RecursiveCharacterTextSplitter(chunk_overlap=500)
+    )
+    input_key: str = "text"
+    output_key: str = "questions"
+    k: Optional[int] = None
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        prompt: Optional[BasePromptTemplate] = None,
+        **kwargs: Any
+    ):
+        _prompt = prompt or PROMPT_SELECTOR.get_prompt(llm)
+        chain = LLMChain(llm=llm, prompt=_prompt)
+        return cls(llm_chain=chain, **kwargs)
+
+    @property
+    def _chain_type(self) -> str:
+        raise NotImplementedError
+
+    @property
+    def input_keys(self) -> List[str]:
+        return [self.input_key]
+
+    @property
+    def output_keys(self) -> List[str]:
+        return [self.output_key]
+
+    def _call(self, inputs: Dict[str, str]) -> Dict[str, str]:
+        docs = self.text_splitter.create_documents([inputs[self.input_key]])
+        results = self.llm_chain.generate([{"text": d.page_content} for d in docs])
+        qa = [json.loads(res[0].text) for res in results.generations]
+        return {self.output_key: qa}
+
+    async def _acall(self, inputs: Dict[str, str]) -> Dict[str, str]:
+        raise NotImplementedError
--- a/langchain/chains/qa_generation/prompt.py
+++ b/langchain/chains/qa_generation/prompt.py
@@ -0,0 +1,49 @@
+from langchain.chains.prompt_selector import ConditionalPromptSelector, is_chat_model
+from langchain.prompts.chat import (
+    ChatPromptTemplate,
+    HumanMessagePromptTemplate,
+    SystemMessagePromptTemplate,
+)
+from langchain.prompts.prompt import PromptTemplate
+
+templ1 = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
+Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
+When coming up with this question/answer pair, you must respond in the following format:
+```
+{{
+    "question": "$YOUR_QUESTION_HERE",
+    "answer": "$THE_ANSWER_HERE"
+}}
+```
+
+Everything between the ``` must be valid json.
+"""
+templ2 = """Please come up with a question/answer pair, in the specified JSON format, for the following text:
+----------------
+{text}"""
+CHAT_PROMPT = ChatPromptTemplate.from_messages(
+    [
+        SystemMessagePromptTemplate.from_template(templ1),
+        HumanMessagePromptTemplate.from_template(templ2),
+    ]
+)
+templ = """You are a smart assistant designed to help high school teachers come up with reading comprehension questions.
+Given a piece of text, you must come up with a question and answer pair that can be used to test a student's reading comprehension abilities.
+When coming up with this question/answer pair, you must respond in the following format:
+```
+{{
+    "question": "$YOUR_QUESTION_HERE",
+    "answer": "$THE_ANSWER_HERE"
+}}
+```
+
+Everything between the ``` must be valid json.
+
+Please come up with a question/answer pair, in the specified JSON format, for the following text:
+----------------
+{text}"""
+PROMPT = PromptTemplate.from_template(templ)
+
+PROMPT_SELECTOR = ConditionalPromptSelector(
+    default_prompt=PROMPT, conditionals=[(is_chat_model, CHAT_PROMPT)]
+)
--- a/langchain/evaluation/agents/init.py
+++ b/langchain/evaluation/agents/init.py
@@ -0,0 +1,11 @@
+from langchain.agents import AgentExecutor
+
+
+def run_agent(agent: AgentExecutor, data: list):
+    results = []
+    for datapoint in data:
+        try:
+            results.append(agent(datapoint))
+        except Exception:
+            results.append("ERROR")
+    return results
--- a/langchain/evaluation/loading.py
+++ b/langchain/evaluation/loading.py
@@ -0,0 +1,8 @@
+from typing import Dict, List
+
+
+def load_dataset(uri: str) -> List[Dict]:
+    from datasets import load_dataset
+
+    dataset = load_dataset(f"LangChainDatasets/{uri}")
+    return [d for d in dataset["train"]]
--- a/langchain/indexes/vectorstore.py
+++ b/langchain/indexes/vectorstore.py
@@ -50,8 +50,8 @@ class VectorstoreIndexCreator(BaseModel):
    """Logic for creating indexes."""

    vectorstore_cls: Type[VectorStore] = Chroma
-    text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
    embedding: Embeddings = Field(default_factory=OpenAIEmbeddings)
+    text_splitter: TextSplitter = Field(default_factory=_get_default_text_splitter)
    vectorstore_kwargs: dict = Field(default_factory=dict)

    class Config:
Author	SHA1	Message	Date
Harrison Chase	b9669444fc	cr	2023-03-12 16:36:15 -07:00
Harrison Chase	86985766e7	stash	2023-03-12 14:55:22 -07:00
Harrison Chase	16ff72296f	stash	2023-03-11 22:06:38 -08:00
Harrison Chase	58c4007654	cr	2023-03-11 16:51:46 -08:00
Harrison Chase	e4de859b71	cr	2023-03-09 17:20:35 -08:00
Harrison Chase	3da6787a47	Merge branch 'harrison/eval' into agent_evaluation	2023-03-09 17:15:53 -08:00
Harrison Chase	eb1cd3d66c	eval	2023-03-09 16:44:30 -08:00
jerwelborn	22a0d50bfb	compare intermediate + end states of qa agent trace	2023-03-09 16:38:39 -08:00