cr

mv again
add files
2026-02-07 09:40:07 +00:00 · 2023-06-27 15:42:41 -07:00 · 2023-06-27 00:33:01 -07:00 · 2023-06-27 00:17:20 -07:00 · 2023-06-26 20:47:41 -07:00 · 2023-06-26 16:49:46 -07:00
106 changed files with 4810 additions and 986 deletions
--- a/docs/api_reference/index.rst
+++ b/docs/api_reference/index.rst
@@ -25,5 +25,6 @@ API Reference
   :maxdepth: 1
   :caption: Additional

+   ./modules/evaluation.rst
   ./modules/utilities.rst
   ./modules/experimental.rst
--- a/docs/api_reference/modules/evaluation.rst
+++ b/docs/api_reference/modules/evaluation.rst
@@ -0,0 +1,9 @@
+Evaluation
+=======================
+
+LangChain has a number of convenient evaluation chains you can use off the shelf to grade your models' oupputs.
+
+.. automodule:: langchain.evaluation
+   :members:
+   :undoc-members:
+   :inherited-members:
--- a/docs/docs_skeleton/docs/modules/evaluation/how_to/custom.mdx
+++ b/docs/docs_skeleton/docs/modules/evaluation/how_to/custom.mdx
@@ -0,0 +1,3 @@
+# Creating a Custom Eval Chain
+
+
--- a/docs/docs_skeleton/docs/modules/evaluation/index.mdx
+++ b/docs/docs_skeleton/docs/modules/evaluation/index.mdx
@@ -0,0 +1,13 @@
+---
+sidebar_position: 1
+---
+
+# Evaluation
+
+Blah Blah Blah TODO
+
+Different types of evaluators:
+
+- [String Evaluators](/docs/modules/evaluation/string/): Evaluators that evaluate input/output strings for a single run
+- [Trajectory Evaluators](/docs/modules/evaluation/trajectory/): Evaluators that evaluate the whole trajectory of a run
+- [Comparison Evaluators](/docs/modules/evaluation/comparison/): Evaluators that the input/output strings for two runs
--- a/docs/docs_skeleton/docs/modules/index.mdx
+++ b/docs/docs_skeleton/docs/modules/index.mdx
@@ -17,4 +17,6 @@ Let chains choose which tools to use given high-level directives
 #### [Memory](/docs/modules/memory/)
 Persist application state between runs of a chain
 #### [Callbacks](/docs/modules/callbacks/)
-Log and stream intermediate steps of any chain
+Log and stream intermediate steps of any chain
+#### [Evaluation](/docs/modules/evaluation/)
+Evaluate the performance of a chain.
--- a/docs/extras/ecosystem/integrations/myscale.mdx
+++ b/docs/extras/ecosystem/integrations/myscale.mdx
@@ -25,7 +25,7 @@ There are two ways to set up parameters for myscale index.
 1. Environment Variables

    Before you run the app, please set the environment variable with `export`:
-    `export MYSCALE_URL='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`
+    `export MYSCALE_HOST='<your-endpoints-url>' MYSCALE_PORT=<your-endpoints-port> MYSCALE_USERNAME=<your-username> MYSCALE_PASSWORD=<your-password> ...`

    You can easily find your account, password and other info on our SaaS. For details please refer to [this document](https://docs.myscale.com/en/cluster-management/)
    Every attributes under `MyScaleSettings` can be set with prefix `MYSCALE_` and is case insensitive.
--- a/docs/extras/guides/evaluation/generic_agent_evaluation.ipynb
+++ b/docs/extras/guides/evaluation/generic_agent_evaluation.ipynb
@@ -1,362 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Generic Agent Evaluation\n",
-    "\n",
-    "Good evaluation is key for quickly iterating on your agent's prompts and tools. Here we provide an example of how to use the TrajectoryEvalChain to evaluate your agent."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Setup\n",
-    "\n",
-    "Let's start by defining our agent."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain import Wikipedia\n",
-    "from langchain.chat_models import ChatOpenAI\n",
-    "from langchain.agents import initialize_agent, Tool\n",
-    "from langchain.agents import AgentType\n",
-    "from langchain.agents.react.base import DocstoreExplorer\n",
-    "from langchain.memory import ConversationBufferMemory\n",
-    "from langchain import LLMMathChain\n",
-    "from langchain.llms import OpenAI\n",
-    "\n",
-    "from langchain import SerpAPIWrapper\n",
-    "\n",
-    "docstore = DocstoreExplorer(Wikipedia())\n",
-    "\n",
-    "math_llm = OpenAI(temperature=0)\n",
-    "\n",
-    "llm_math_chain = LLMMathChain(llm=math_llm, verbose=True)\n",
-    "\n",
-    "search = SerpAPIWrapper()\n",
-    "\n",
-    "tools = [\n",
-    "    Tool(\n",
-    "        name=\"Search\",\n",
-    "        func=docstore.search,\n",
-    "        description=\"useful for when you need to ask with search\",\n",
-    "    ),\n",
-    "    Tool(\n",
-    "        name=\"Lookup\",\n",
-    "        func=docstore.lookup,\n",
-    "        description=\"useful for when you need to ask with lookup\",\n",
-    "    ),\n",
-    "    Tool(\n",
-    "        name=\"Calculator\",\n",
-    "        func=llm_math_chain.run,\n",
-    "        description=\"useful for doing calculations\",\n",
-    "    ),\n",
-    "    Tool(\n",
-    "        name=\"Search the Web (SerpAPI)\",\n",
-    "        func=search.run,\n",
-    "        description=\"useful for when you need to answer questions about current events\",\n",
-    "    ),\n",
-    "]\n",
-    "\n",
-    "memory = ConversationBufferMemory(\n",
-    "    memory_key=\"chat_history\", return_messages=True, output_key=\"output\"\n",
-    ")\n",
-    "\n",
-    "llm = ChatOpenAI(temperature=0, model_name=\"gpt-3.5-turbo\")\n",
-    "\n",
-    "agent = initialize_agent(\n",
-    "    tools,\n",
-    "    llm,\n",
-    "    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,\n",
-    "    verbose=True,\n",
-    "    memory=memory,\n",
-    "    return_intermediate_steps=True,  # This is needed for the evaluation later\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Testing the Agent\n",
-    "\n",
-    "Now let's try our agent out on some example queries."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "\n",
-      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
-      "\u001b[32;1m\u001b[1;3m{\n",
-      "    \"action\": \"Search the Web (SerpAPI)\",\n",
-      "    \"action_input\": \"How many ping pong balls would it take to fill the entire Empire State Building?\"\n",
-      "}\u001b[0m\n",
-      "Observation: \u001b[31;1m\u001b[1;3m12.8 billion. The volume of the Empire State Building Googles in at around 37 million ft³. A golf ball comes in at about 2.5 in³.\u001b[0m\n",
-      "Thought:\u001b[32;1m\u001b[1;3m{\n",
-      "    \"action\": \"Final Answer\",\n",
-      "    \"action_input\": \"It would take approximately 12.8 billion ping pong balls to fill the entire Empire State Building.\"\n",
-      "}\u001b[0m\n",
-      "\n",
-      "\u001b[1m> Finished chain.\u001b[0m\n"
-     ]
-    }
-   ],
-   "source": [
-    "query_one = (\n",
-    "    \"How many ping pong balls would it take to fill the entire Empire State Building?\"\n",
-    ")\n",
-    "\n",
-    "test_outputs_one = agent({\"input\": query_one}, return_only_outputs=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This looks good! Let's try it out on another query."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "\n",
-      "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n",
-      "\u001b[32;1m\u001b[1;3m{\n",
-      "    \"action\": \"Calculator\",\n",
-      "    \"action_input\": \"The length of the Eiffel Tower is 324 meters. The distance from coast to coast in the US is approximately 4,828 kilometers. First, we need to convert 4,828 kilometers to meters, which gives us 4,828,000 meters. To find out how many Eiffel Towers we need, we can divide 4,828,000 by 324. This gives us approximately 14,876 Eiffel Towers.\"\n",
-      "}\u001b[0m\n",
-      "\n",
-      "\u001b[1m> Entering new LLMMathChain chain...\u001b[0m\n",
-      "The length of the Eiffel Tower is 324 meters. The distance from coast to coast in the US is approximately 4,828 kilometers. First, we need to convert 4,828 kilometers to meters, which gives us 4,828,000 meters. To find out how many Eiffel Towers we need, we can divide 4,828,000 by 324. This gives us approximately 14,876 Eiffel Towers.\u001b[32;1m\u001b[1;3m\n",
-      "```text\n",
-      "4828000 / 324\n",
-      "```\n",
-      "...numexpr.evaluate(\"4828000 / 324\")...\n",
-      "\u001b[0m\n",
-      "Answer: \u001b[33;1m\u001b[1;3m14901.234567901234\u001b[0m\n",
-      "\u001b[1m> Finished chain.\u001b[0m\n",
-      "\n",
-      "Observation: \u001b[38;5;200m\u001b[1;3mAnswer: 14901.234567901234\u001b[0m\n",
-      "Thought:\u001b[32;1m\u001b[1;3m{\n",
-      "    \"action\": \"Calculator\",\n",
-      "    \"action_input\": \"The length of the Eiffel Tower is 324 meters. The distance from coast to coast in the US is approximately 4,828 kilometers. First, we need to convert 4,828 kilometers to meters, which gives us 4,828,000 meters. To find out how many Eiffel Towers we need, we can divide 4,828,000 by 324. This gives us approximately 14,901 Eiffel Towers.\"\n",
-      "}\u001b[0m\n",
-      "\n",
-      "\u001b[1m> Entering new LLMMathChain chain...\u001b[0m\n",
-      "The length of the Eiffel Tower is 324 meters. The distance from coast to coast in the US is approximately 4,828 kilometers. First, we need to convert 4,828 kilometers to meters, which gives us 4,828,000 meters. To find out how many Eiffel Towers we need, we can divide 4,828,000 by 324. This gives us approximately 14,901 Eiffel Towers.\u001b[32;1m\u001b[1;3m\n",
-      "```text\n",
-      "4828000 / 324\n",
-      "```\n",
-      "...numexpr.evaluate(\"4828000 / 324\")...\n",
-      "\u001b[0m\n",
-      "Answer: \u001b[33;1m\u001b[1;3m14901.234567901234\u001b[0m\n",
-      "\u001b[1m> Finished chain.\u001b[0m\n",
-      "\n",
-      "Observation: \u001b[38;5;200m\u001b[1;3mAnswer: 14901.234567901234\u001b[0m\n",
-      "Thought:\u001b[32;1m\u001b[1;3m{\n",
-      "    \"action\": \"Final Answer\",\n",
-      "    \"action_input\": \"If you laid the Eiffel Tower end to end, you would need approximately 14,901 Eiffel Towers to cover the US from coast to coast.\"\n",
-      "}\u001b[0m\n",
-      "\n",
-      "\u001b[1m> Finished chain.\u001b[0m\n"
-     ]
-    }
-   ],
-   "source": [
-    "query_two = \"If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?\"\n",
-    "\n",
-    "test_outputs_two = agent({\"input\": query_two}, return_only_outputs=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This doesn't look so good. Let's try running some evaluation.\n",
-    "\n",
-    "## Evaluating the Agent\n",
-    "\n",
-    "Let's start by defining the TrajectoryEvalChain."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from langchain.evaluation.agents import TrajectoryEvalChain\n",
-    "\n",
-    "# Define chain\n",
-    "eval_chain = TrajectoryEvalChain.from_llm(\n",
-    "    llm=ChatOpenAI(\n",
-    "        temperature=0, model_name=\"gpt-4\"\n",
-    "    ),  # Note: This must be a ChatOpenAI model\n",
-    "    agent_tools=agent.tools,\n",
-    "    return_reasoning=True,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's try evaluating the first query."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Score from 1 to 5:  1\n",
-      "Reasoning:  First, let's evaluate the final answer. The final answer is incorrect because it uses the volume of golf balls instead of ping pong balls. The answer is not helpful.\n",
-      "\n",
-      "Second, does the model use a logical sequence of tools to answer the question? The model only used one tool, which was the Search the Web (SerpAPI). It did not use the Calculator tool to calculate the correct volume of ping pong balls.\n",
-      "\n",
-      "Third, does the AI language model use the tools in a helpful way? The model used the Search the Web (SerpAPI) tool, but the output was not helpful because it provided information about golf balls instead of ping pong balls.\n",
-      "\n",
-      "Fourth, does the AI language model use too many steps to answer the question? The model used only one step, which is not too many. However, it should have used more steps to provide a correct answer.\n",
-      "\n",
-      "Fifth, are the appropriate tools used to answer the question? The model should have used the Search tool to find the volume of the Empire State Building and the volume of a ping pong ball. Then, it should have used the Calculator tool to calculate the number of ping pong balls needed to fill the building.\n",
-      "\n",
-      "Judgment: Given the incorrect final answer and the inappropriate use of tools, we give the model a score of 1.\n"
-     ]
-    }
-   ],
-   "source": [
-    "question, steps, answer = (\n",
-    "    test_outputs_one[\"input\"],\n",
-    "    test_outputs_one[\"intermediate_steps\"],\n",
-    "    test_outputs_one[\"output\"],\n",
-    ")\n",
-    "\n",
-    "evaluation = eval_chain(\n",
-    "    inputs={\n",
-    "        \"question\": question,\n",
-    "        \"answer\": answer,\n",
-    "        \"agent_trajectory\": eval_chain.get_agent_trajectory(steps),\n",
-    "    },\n",
-    ")\n",
-    "\n",
-    "print(\"Score from 1 to 5: \", evaluation[\"score\"])\n",
-    "print(\"Reasoning: \", evaluation[\"reasoning\"])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "That seems about right. Let's try the second query."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Score from 1 to 5:  3\n",
-      "Reasoning:  i. Is the final answer helpful?\n",
-      "Yes, the final answer is helpful as it provides an approximate number of Eiffel Towers needed to cover the US from coast to coast.\n",
-      "\n",
-      "ii. Does the AI language use a logical sequence of tools to answer the question?\n",
-      "No, the AI language model does not use a logical sequence of tools. It directly uses the Calculator tool without first using the Search or Lookup tools to find the necessary information (length of the Eiffel Tower and distance from coast to coast in the US).\n",
-      "\n",
-      "iii. Does the AI language model use the tools in a helpful way?\n",
-      "The AI language model uses the Calculator tool in a helpful way to perform the calculation, but it should have used the Search or Lookup tools first to find the required information.\n",
-      "\n",
-      "iv. Does the AI language model use too many steps to answer the question?\n",
-      "No, the AI language model does not use too many steps. However, it repeats the same step twice, which is unnecessary.\n",
-      "\n",
-      "v. Are the appropriate tools used to answer the question?\n",
-      "Not entirely. The AI language model should have used the Search or Lookup tools to find the required information before using the Calculator tool.\n",
-      "\n",
-      "Given the above evaluation, the AI language model's performance can be scored as follows:\n"
-     ]
-    }
-   ],
-   "source": [
-    "question, steps, answer = (\n",
-    "    test_outputs_two[\"input\"],\n",
-    "    test_outputs_two[\"intermediate_steps\"],\n",
-    "    test_outputs_two[\"output\"],\n",
-    ")\n",
-    "\n",
-    "evaluation = eval_chain(\n",
-    "    inputs={\n",
-    "        \"question\": question,\n",
-    "        \"answer\": answer,\n",
-    "        \"agent_trajectory\": eval_chain.get_agent_trajectory(steps),\n",
-    "    },\n",
-    ")\n",
-    "\n",
-    "print(\"Score from 1 to 5: \", evaluation[\"score\"])\n",
-    "print(\"Reasoning: \", evaluation[\"reasoning\"])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "That also sounds about right. In conclusion, the TrajectoryEvalChain allows us to use GPT-4 to score both our agent's outputs and tool use in addition to giving us the reasoning behind the evaluation."
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
-  },
-  "vscode": {
-   "interpreter": {
-    "hash": "06ba49dd587e86cdcfee66b9ffe769e1e94f0e368e54c2d6c866e38e33c0d9b1"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/docs/extras/guides/evaluation/index.mdx
+++ b/docs/extras/guides/evaluation/index.mdx
@@ -1,86 +0,0 @@
-# Evaluation
-
-This section of documentation covers how we approach and think about evaluation in LangChain.
-Both evaluation of internal chains/agents, but also how we would recommend people building on top of LangChain approach evaluation.
-
-## The Problem
-
-It can be really hard to evaluate LangChain chains and agents.
-There are two main reasons for this:
-
-**# 1: Lack of data**
-
-You generally don't have a ton of data to evaluate your chains/agents over before starting a project.
-This is usually because Large Language Models (the core of most chains/agents) are terrific few-shot and zero shot learners,
-meaning you are almost always able to get started on a particular task (text-to-SQL, question answering, etc) without
-a large dataset of examples.
-This is in stark contrast to traditional machine learning where you had to first collect a bunch of datapoints
-before even getting started using a model.
-
-**# 2: Lack of metrics**
-
-Most chains/agents are performing tasks for which there are not very good metrics to evaluate performance.
-For example, one of the most common use cases is generating text of some form.
-Evaluating generated text is much more complicated than evaluating a classification prediction, or a numeric prediction.
-
-## The Solution
-
-LangChain attempts to tackle both of those issues.
-What we have so far are initial passes at solutions - we do not think we have a perfect solution.
-So we very much welcome feedback, contributions, integrations, and thoughts on this.
-
-Here is what we have for each problem so far:
-
-**# 1: Lack of data**
-
-We have started [LangChainDatasets](https://huggingface.co/LangChainDatasets) a Community space on Hugging Face.
-We intend this to be a collection of open source datasets for evaluating common chains and agents.
-We have contributed five datasets of our own to start, but we highly intend this to be a community effort.
-In order to contribute a dataset, you simply need to join the community and then you will be able to upload datasets.
-
-We're also aiming to make it as easy as possible for people to create their own datasets.
-As a first pass at this, we've added a QAGenerationChain, which given a document comes up
-with question-answer pairs that can be used to evaluate question-answering tasks over that document down the line.
-See [this notebook](/docs/guides/evaluation/qa_generation.html) for an example of how to use this chain.
-
-**# 2: Lack of metrics**
-
-We have two solutions to the lack of metrics.
-
-The first solution is to use no metrics, and rather just rely on looking at results by eye to get a sense for how the chain/agent is performing.
-To assist in this, we have developed (and will continue to develop) [tracing](/docs/guides/tracing/), a UI-based visualizer of your chain and agent runs.
-
-The second solution we recommend is to use Language Models themselves to evaluate outputs.
-For this we have a few different chains and prompts aimed at tackling this issue.
-
-## The Examples
-
-We have created a bunch of examples combining the above two solutions to show how we internally evaluate chains and agents when we are developing.
-In addition to the examples we've curated, we also highly welcome contributions here.
-To facilitate that, we've included a [template notebook](/docs/guides/evaluation/benchmarking_template.html) for community members to use to build their own examples.
-
-The existing examples we have are:
-
-[Question Answering (State of Union)](/docs/guides/evaluation/qa_benchmarking_sota.html): A notebook showing evaluation of a question-answering task over a State-of-the-Union address.
-
-[Question Answering (Paul Graham Essay)](/docs/guides/evaluation/qa_benchmarking_pg.html): A notebook showing evaluation of a question-answering task over a Paul Graham essay.
-
-[SQL Question Answering (Chinook)](/docs/guides/evaluation/sql_qa_benchmarking_chinook.html): A notebook showing evaluation of a question-answering task over a SQL database (the Chinook database).
-
-[Agent Vectorstore](/docs/guides/evaluation/agent_vectordb_sota_pg.html): A notebook showing evaluation of an agent doing question answering while routing between two different vector databases.
-
-[Agent Search + Calculator](/docs/guides/evaluation/agent_benchmarking.html): A notebook showing evaluation of an agent doing question answering using a Search engine and a Calculator as tools.
-
-[Evaluating an OpenAPI Chain](/docs/guides/evaluation/openapi_eval.html): A notebook showing evaluation of an OpenAPI chain, including how to generate test data if you don't have any.
-
-
-## Other Examples
-
-In addition, we also have some more generic resources for evaluation.
-
-[Question Answering](/docs/guides/evaluation/question_answering.html): An overview of LLMs aimed at evaluating question answering systems in general.
-
-[Data Augmented Question Answering](/docs/guides/evaluation/data_augmented_question_answering.html): An end-to-end example of evaluating a question answering system focused on a specific document (a RetrievalQAChain to be precise). This example highlights how to use LLMs to come up with question/answer examples to evaluate over, and then highlights how to use LLMs to evaluate performance on those generated examples.
-
-[Hugging Face Datasets](/docs/guides/evaluation/huggingface_datasets.html): Covers an example of loading and using a dataset from Hugging Face for evaluation.
-
--- a/docs/extras/modules/agents/toolkits/office365.ipynb
+++ b/docs/extras/modules/agents/toolkits/office365.ipynb
@@ -0,0 +1,238 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Office365 Toolkit\n",
+    "\n",
+    "This notebook walks through connecting LangChain to Office365 email and calendar.\n",
+    "\n",
+    "To use this toolkit, you will need to set up your credentials explained in the [Microsoft Graph authentication and authorization overview](https://learn.microsoft.com/en-us/graph/auth/). Once you've received a CLIENT_ID and CLIENT_SECRET, you can input them as environmental variables below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install --upgrade O365 > /dev/null\n",
+    "!pip install beautifulsoup4 > /dev/null # This is optional but is useful for parsing HTML messages"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assign Environmental Variables\n",
+    "\n",
+    "The toolkit will read the CLIENT_ID and CLIENT_SECRET environmental variables to authenticate the user so you need to set them here. You will also need to set your OPENAI_API_KEY to use the agent later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set environmental variables here"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Toolkit and Get Tools\n",
+    "\n",
+    "To start, you need to create the toolkit, so you can access its tools later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[O365SearchEvents(name='events_search', description=\" Use this tool to search for the user's calendar events. The input must be the start and end datetimes for the search query. The output is a JSON list of all the events in the user's calendar between the start and end times. You can assume that the user can  not schedule any meeting over existing meetings, and that the user is busy during meetings. Any times without events are free for the user. \", args_schema=<class 'langchain.tools.office365.events_search.SearchEventsInput'>, return_direct=False, verbose=False, callbacks=None, callback_manager=None, handle_tool_error=False, account=Account Client Id: f32a022c-3c4c-4d10-a9d8-f6a9a9055302),\n",
+       " O365CreateDraftMessage(name='create_email_draft', description='Use this tool to create a draft email with the provided message fields.', args_schema=<class 'langchain.tools.office365.create_draft_message.CreateDraftMessageSchema'>, return_direct=False, verbose=False, callbacks=None, callback_manager=None, handle_tool_error=False, account=Account Client Id: f32a022c-3c4c-4d10-a9d8-f6a9a9055302),\n",
+       " O365SearchEmails(name='messages_search', description='Use this tool to search for email messages. The input must be a valid Microsoft Graph v1.0 $search query. The output is a JSON list of the requested resource.', args_schema=<class 'langchain.tools.office365.messages_search.SearchEmailsInput'>, return_direct=False, verbose=False, callbacks=None, callback_manager=None, handle_tool_error=False, account=Account Client Id: f32a022c-3c4c-4d10-a9d8-f6a9a9055302),\n",
+       " O365SendEvent(name='send_event', description='Use this tool to create and send an event with the provided event fields.', args_schema=<class 'langchain.tools.office365.send_event.SendEventSchema'>, return_direct=False, verbose=False, callbacks=None, callback_manager=None, handle_tool_error=False, account=Account Client Id: f32a022c-3c4c-4d10-a9d8-f6a9a9055302),\n",
+       " O365SendMessage(name='send_email', description='Use this tool to send an email with the provided message fields.', args_schema=<class 'langchain.tools.office365.send_message.SendMessageSchema'>, return_direct=False, verbose=False, callbacks=None, callback_manager=None, handle_tool_error=False, account=Account Client Id: f32a022c-3c4c-4d10-a9d8-f6a9a9055302)]"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.agents.agent_toolkits import O365Toolkit\n",
+    "\n",
+    "toolkit = O365Toolkit()\n",
+    "tools = toolkit.get_tools()\n",
+    "tools"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use within an Agent"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain import OpenAI\n",
+    "from langchain.agents import initialize_agent, AgentType"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "llm = OpenAI(temperature=0)\n",
+    "agent = initialize_agent(\n",
+    "    tools=toolkit.get_tools(),\n",
+    "    llm=llm,\n",
+    "    verbose=False,\n",
+    "    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'The draft email was created correctly.'"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(\"Create an email draft for me to edit of a letter from the perspective of a sentient parrot\"\n",
+    "          \" who is looking to collaborate on some research with her\"\n",
+    "          \" estranged friend, a cat. Under no circumstances may you send the message, however.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"I found one draft in your drafts folder about collaboration. It was sent on 2023-06-16T18:22:17+0000 and the subject was 'Collaboration Request'.\""
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(\"Could you search in my drafts folder and let me know if any of them are about collaboration?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/vscode/langchain-py-env/lib/python3.11/site-packages/O365/utils/windows_tz.py:639: PytzUsageWarning: The zone attribute is specific to pytz's interface; please migrate to a new time zone provider. For more details on how to do so, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html\n",
+      "  iana_tz.zone if isinstance(iana_tz, tzinfo) else iana_tz)\n",
+      "/home/vscode/langchain-py-env/lib/python3.11/site-packages/O365/utils/utils.py:463: PytzUsageWarning: The zone attribute is specific to pytz's interface; please migrate to a new time zone provider. For more details on how to do so, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html\n",
+      "  timezone = date_time.tzinfo.zone if date_time.tzinfo is not None else None\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "'I have scheduled a meeting with a sentient parrot to discuss research collaborations on October 3, 2023 at 2 pm Easter Time. Please let me know if you need to make any changes.'"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(\"Can you schedule a 30 minute meeting with a sentient parrot to discuss research collaborations on October 3, 2023 at 2 pm Easter Time?\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"Yes, you have an event on October 3, 2023 with a sentient parrot. The event is titled 'Meeting with sentient parrot' and is scheduled from 6:00 PM to 6:30 PM.\""
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "agent.run(\"Can you tell me if I have any events on October 3, 2023 in Eastern Time, and if so, tell me if any of them are with a sentient parrot?\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/extras/modules/data_connection/document_loaders/integrations/example_data/README.rst
+++ b/docs/extras/modules/data_connection/document_loaders/integrations/example_data/README.rst
@@ -0,0 +1,28 @@
+Example Docs
+------------
+
+The sample docs directory contains the following files:
+
+-  ``example-10k.html`` - A 10-K SEC filing in HTML format
+-  ``layout-parser-paper.pdf`` - A PDF copy of the layout parser paper
+-  ``factbook.xml``/``factbook.xsl`` - Example XML/XLS files that you
+   can use to test stylesheets
+
+These documents can be used to test out the parsers in the library. In
+addition, here are instructions for pulling in some sample docs that are
+too big to store in the repo.
+
+XBRL 10-K
+^^^^^^^^^
+
+You can get an example 10-K in inline XBRL format using the following
+``curl``. Note, you need to have the user agent set in the header or the
+SEC site will reject your request.
+
+.. code:: bash
+
+   curl -O \
+     -A '${organization} ${email}'
+     https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
+
+You can parse this document using the HTML parser.
--- a/docs/extras/modules/data_connection/document_loaders/integrations/mhtml.ipynb
+++ b/docs/extras/modules/data_connection/document_loaders/integrations/mhtml.ipynb
@@ -0,0 +1,71 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "87067cdf",
+   "metadata": {},
+   "source": [
+    "# mhtml\n",
+    "\n",
+    "MHTML is a is used both for emails but also for archived webpages. MHTML, sometimes referred as MHT, stands for MIME HTML is a single file in which entire webpage is archived. When one saves a webpage as MHTML format, this file extension will contain HTML code, images, audio files, flash animation etc."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d4c6174",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import MHTMLLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "12dcebc8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "page_content='LangChain\\nLANG CHAIN 🦜️🔗Official Home Page\\xa0\\n\\n\\n\\n\\n\\n\\n\\nIntegrations\\n\\n\\n\\nFeatures\\n\\n\\n\\n\\nBlog\\n\\n\\n\\nConceptual Guide\\n\\n\\n\\n\\nPython Repo\\n\\n\\nJavaScript Repo\\n\\n\\n\\nPython Documentation \\n\\n\\nJavaScript Documentation\\n\\n\\n\\n\\nPython ChatLangChain \\n\\n\\nJavaScript ChatLangChain\\n\\n\\n\\n\\nDiscord \\n\\n\\nTwitter\\n\\n\\n\\n\\nIf you have any comments about our WEB page, you can \\nwrite us at the address shown above.  However, due to \\nthe limited number of personnel in our corporate office, we are unable to \\nprovide a direct response.\\n\\nCopyright © 2023-2023 LangChain Inc.\\n\\n\\n' metadata={'source': '../../../../../../tests/integration_tests/examples/example.mht', 'title': 'LangChain'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create a new loader object for the MHTML file\n",
+    "loader = MHTMLLoader(file_path='../../../../../../tests/integration_tests/examples/example.mht')\n",
+    "\n",
+    "# Load the document from the file\n",
+    "documents = loader.load()\n",
+    "\n",
+    "# Print the documents to see the results\n",
+    "for doc in documents:\n",
+    "    print(doc)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/extras/modules/data_connection/document_loaders/integrations/rst.ipynb
+++ b/docs/extras/modules/data_connection/document_loaders/integrations/rst.ipynb
@@ -0,0 +1,88 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# RST\n",
+    "\n",
+    ">A [reStructured Text (RST)](https://en.wikipedia.org/wiki/ReStructuredText) file is a file format for textual data used primarily in the Python programming language community for technical documentation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## `UnstructuredRSTLoader`\n",
+    "\n",
+    "You can load data from RST files with `UnstructuredRSTLoader` using the following workflow."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import UnstructuredRSTLoader"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = UnstructuredRSTLoader(\n",
+    "    file_path=\"example_data/README.rst\", mode=\"elements\"\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "page_content='Example Docs' metadata={'source': 'example_data/README.rst', 'filename': 'README.rst', 'file_directory': 'example_data', 'filetype': 'text/x-rst', 'page_number': 1, 'category': 'Title'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(docs[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/extras/modules/data_connection/vectorstores/integrations/chroma.ipynb
+++ b/docs/extras/modules/data_connection/vectorstores/integrations/chroma.ipynb
@@ -491,6 +491,73 @@
   "source": [
    "retriever.get_relevant_documents(query)[0]"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "275dbd0a",
+   "metadata": {},
+   "source": [
+    "### Filtering on metadata\n",
+    "\n",
+    "It can be helpful to narrow down the collection before working with it.\n",
+    "\n",
+    "For example, collections can be filtered on metadata using the get method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "a5119221",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'source': 'some_other_source'}\n",
+      "{'ids': ['1'], 'embeddings': None, 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'], 'metadatas': [{'source': 'some_other_source'}]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# create simple ids\n",
+    "ids = [str(i) for i in range(1, len(docs) + 1)]\n",
+    "\n",
+    "# add data\n",
+    "example_db = Chroma.from_documents(docs, embedding_function, ids=ids)\n",
+    "docs = example_db.similarity_search(query)\n",
+    "print(docs[0].metadata)\n",
+    "\n",
+    "# update the source for a document\n",
+    "docs[0].metadata = {\"source\": \"some_other_source\"}\n",
+    "example_db.update_document(ids[0], docs[0])\n",
+    "print(example_db._collection.get(ids=[ids[0]]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "81600dc1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'ids': ['1'],\n",
+       " 'embeddings': None,\n",
+       " 'documents': ['Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'],\n",
+       " 'metadatas': [{'source': 'some_other_source'}]}"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# filter collection for updated source\n",
+    "example_db.get(where={\"source\": \"some_other_source\"})"
+   ]
  }
 ],
 "metadata": {
--- a/docs/extras/modules/evaluation/agent_benchmarking.ipynb
+++ b/docs/extras/modules/evaluation/agent_benchmarking.ipynb
--- a/docs/extras/modules/evaluation/agent_vectordb_sota_pg.ipynb
+++ b/docs/extras/modules/evaluation/agent_vectordb_sota_pg.ipynb
--- a/docs/extras/modules/evaluation/benchmarking_template.ipynb
+++ b/docs/extras/modules/evaluation/benchmarking_template.ipynb
--- a/docs/extras/modules/evaluation/comparison/pairwise_string.ipynb
+++ b/docs/extras/modules/evaluation/comparison/pairwise_string.ipynb
@@ -0,0 +1,85 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "f6790c46",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/harrisonchase/.pyenv/versions/3.9.1/envs/langchain/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.6.6) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation.comparison import PairwiseStringEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-4\")\n",
+    "\n",
+    "eval_chain = PairwiseStringEvalChain.from_llm(llm=llm)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "49ad9139",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': \"Both responses A and B accurately answer the question, but neither response provides any additional detail or context. Response A is slightly more complete, as it uses full sentences to convey the information, while response B provides just the number. However, both responses are fairly equal in relevance, accuracy, and depth. The lack of detail in both responses doesn't allow for a clear winner based on creativity or detail. \\n\\nTherefore, my rating is a tie. \\n\",\n",
+       " 'value': None,\n",
+       " 'score': 0.5}"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_chain.evaluate_string_pairs(\n",
+    "    output_a = \"there are three dogs\",\n",
+    "    output_b=\"4\",\n",
+    "    input=\"how many dogs are in the park?\",\n",
+    "    reference=\"four\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "586320da",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/extras/modules/evaluation/comparisons.ipynb
+++ b/docs/extras/modules/evaluation/comparisons.ipynb
@@ -0,0 +1,447 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Comparing Chain Outputs\n",
+    "\n",
+    "Suppose you have two different prompts (or LLMs). How do you know which will generate \"better\" results?\n",
+    "\n",
+    "One automated way to predict the preferred configuration is to use a `PairwiseStringEvaluator` like the `PairwiseStringEvalChain`<a name=\"cite_ref-1\"></a>[<sup>[1]</sup>](#cite_note-1). This chain prompts an LLM to select which output is preferred, given a specific input.\n",
+    "\n",
+    "For this evalution, we will need 3 things:\n",
+    "1. An evaluator\n",
+    "2. A dataset of inputs\n",
+    "3. 2 (or more) LLMs, Chains, or Agents to compare\n",
+    "\n",
+    "Then we will aggregate the restults to determine the preferred model.\n",
+    "\n",
+    "### Step 1. Create the Evaluator\n",
+    "\n",
+    "In this example, you will use gpt-4 to select which output is preferred."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optional if you are tracing the notebook\n",
+    "%env LANGCHAIN_PROJECT=\"Comparing Chain Outputs\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation.comparison import PairwiseStringEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-4\")\n",
+    "\n",
+    "eval_chain = PairwiseStringEvalChain.from_llm(llm=llm)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2. Select Dataset\n",
+    "\n",
+    "If you already have real usage data for your LLM, you can use a representative sample. More examples\n",
+    "provide more reliable results. We will use some example queries someone might have about how to use langchain here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Found cached dataset parquet (/Users/wfh/.cache/huggingface/datasets/LangChainDatasets___parquet/LangChainDatasets--langchain-howto-queries-bbb748bbee7e77aa/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d852a1884480457292c90d8bd9d4f1e6",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation.loading import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(\"langchain-howto-queries\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3. Define Models to Compare\n",
+    "\n",
+    "We will be comparing two agents in this case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain import SerpAPIWrapper\n",
+    "from langchain.agents import initialize_agent, Tool\n",
+    "from langchain.agents import AgentType\n",
+    "from langchain.chat_models import ChatOpenAI\n",
+    "\n",
+    "\n",
+    "# Initialize the language model\n",
+    "# You can add your own OpenAI API key by adding openai_api_key=\"<your_api_key>\" \n",
+    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")\n",
+    "\n",
+    "# Initialize the SerpAPIWrapper for search functionality\n",
+    "#Replace <your_api_key> in openai_api_key=\"<your_api_key>\" with your actual SerpAPI key.\n",
+    "search = SerpAPIWrapper()\n",
+    "\n",
+    "# Define a list of tools offered by the agent\n",
+    "tools = [\n",
+    "    Tool(\n",
+    "        name=\"Search\",\n",
+    "        func=search.run,\n",
+    "        coroutine=search.arun,\n",
+    "        description=\"Useful when you need to answer questions about current events. You should ask targeted questions.\"\n",
+    "    ),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "functions_agent = initialize_agent(tools, llm, agent=AgentType.OPENAI_MULTI_FUNCTIONS, verbose=False)\n",
+    "conversations_agent = initialize_agent(tools, llm, agent=AgentType.CHAT_ZERO_SHOT_REACT_DESCRIPTION, verbose=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "list(zip(*[iter(batch_results)]*2)### Step 4. Generate Responses\n",
+    "\n",
+    "We will generate outputs for each of the models before evaluating them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b076d6bf6680422aa9082d4bad4d98a3",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/20 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..\n",
+      "Retrying langchain.chat_models.openai.acompletion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tqdm.notebook import tqdm\n",
+    "import asyncio\n",
+    "\n",
+    "results = []\n",
+    "agents = [functions_agent, conversations_agent]\n",
+    "concurrency_level = 6 # How many concurrent agents to run. May need to decrease if OpenAI is rate limiting.\n",
+    "\n",
+    "# We will only run the first 20 examples of this dataset to speed things up\n",
+    "# This will lead to larger confidence intervals downstream.\n",
+    "batch = []\n",
+    "for example in tqdm(dataset[:20]):\n",
+    "    batch.extend([agent.acall(example['inputs']) for agent in agents])\n",
+    "    if len(batch) >= concurrency_level:\n",
+    "        batch_results = await asyncio.gather(*batch, return_exceptions=True)\n",
+    "        results.extend(list(zip(*[iter(batch_results)]*2)))\n",
+    "        batch = []\n",
+    "if batch:\n",
+    "    batch_results = await asyncio.gather(*batch, return_exceptions=True)\n",
+    "    results.extend(list(zip(*[iter(batch_results)]*2)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5. Evaluate Pairs\n",
+    "\n",
+    "Now it's time to evaluate the results. For each agent response, run the evaluation chain to select which output is preferred (or return a tie).\n",
+    "\n",
+    "Randomly select the input order to reduce the likelihood that one model will be preferred just because it is presented first."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "\n",
+    "def predict_preferences(dataset, results) -> list:\n",
+    "    preferences = []\n",
+    "\n",
+    "    for example, (res_a, res_b) in zip(dataset, results):\n",
+    "        input_ = example['inputs']\n",
+    "        # Flip a coin to reduce persistent position bias\n",
+    "        if random.random() < 0.5:\n",
+    "            pred_a, pred_b = res_a, res_b\n",
+    "            a, b = \"a\", \"b\"\n",
+    "        else:\n",
+    "            pred_a, pred_b = res_b, res_a\n",
+    "            a, b = \"b\", \"a\"\n",
+    "        eval_res = eval_chain.evaluate_string_pairs(\n",
+    "            output_a=pred_a['output'] if isinstance(pred_a, dict) else str(pred_a),\n",
+    "            output_b=pred_b['output'] if isinstance(pred_b, dict) else str(pred_b),\n",
+    "            input=input_\n",
+    "        )\n",
+    "        if eval_res[\"value\"] == \"A\":\n",
+    "            preferences.append(a)\n",
+    "        elif eval_res[\"value\"] == \"B\":\n",
+    "            preferences.append(b)\n",
+    "        else:\n",
+    "            preferences.append(None) # No preference\n",
+    "    return preferences"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preferences = predict_preferences(dataset, results)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "**Print out the ratio of preferences.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "OpenAI Functions Agent: 90.00%\n",
+      "Structured Chat Agent: 10.00%\n"
+     ]
+    }
+   ],
+   "source": [
+    "from collections import Counter\n",
+    "\n",
+    "name_map = {\n",
+    "    \"a\": \"OpenAI Functions Agent\",\n",
+    "    \"b\": \"Structured Chat Agent\",\n",
+    "}\n",
+    "counts = Counter(preferences)\n",
+    "pref_ratios = {\n",
+    "    k: v/len(preferences) for k, v in\n",
+    "    counts.items()\n",
+    "}\n",
+    "for k, v in pref_ratios.items():\n",
+    "    print(f\"{name_map.get(k)}: {v:.2%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Estimate Confidence Intervals\n",
+    "\n",
+    "The results seem pretty clear, but if you want to have a better sense of how confident we are, that model \"A\" (the OpenAI Functions Agent) is the preferred model, we can calculate confidence intervals. \n",
+    "\n",
+    "Below, use the Wilson score to estimate the confidence interval."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from math import sqrt\n",
+    "\n",
+    "def wilson_score_interval(preferences: list, which: str = \"a\", z: float = 1.96) -> tuple:\n",
+    "    \"\"\"Estimate the confidence interval using the Wilson score.\n",
+    "    \n",
+    "    See: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval\n",
+    "    for more details, including when to use it and when it should not be used.\n",
+    "    \"\"\"\n",
+    "    total_preferences = preferences.count('a') + preferences.count('b')\n",
+    "    n_s = preferences.count(which)\n",
+    "\n",
+    "    if total_preferences == 0:\n",
+    "        return (0, 0)\n",
+    "\n",
+    "    p_hat = n_s / total_preferences\n",
+    "\n",
+    "    denominator = 1 + (z**2) / total_preferences\n",
+    "    adjustment = (z / denominator) * sqrt(p_hat*(1-p_hat)/total_preferences + (z**2)/(4*total_preferences*total_preferences))\n",
+    "    center = (p_hat + (z**2) / (2*total_preferences)) / denominator\n",
+    "    lower_bound = min(max(center - adjustment, 0.0), 1.0)\n",
+    "    upper_bound = min(max(center + adjustment, 0.0), 1.0)\n",
+    "\n",
+    "    return (lower_bound, upper_bound)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The \"OpenAI Functions Agent\" would be preferred between 69.90% and 97.21% percent of the time (with 95% confidence).\n",
+      "The \"Structured Chat Agent\" would be preferred between 2.79% and 30.10% percent of the time (with 95% confidence).\n"
+     ]
+    }
+   ],
+   "source": [
+    "for which_, name in name_map.items():\n",
+    "    low, high = wilson_score_interval(preferences, which=which_)\n",
+    "    print(f'The \"{name}\" would be preferred between {low:.2%} and {high:.2%} percent of the time (with 95% confidence).')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Print out the p-value.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The p-value is 0.00040. If the null hypothesis is true (i.e., if the selected eval chain actually has no preference between the models),\n",
+      "then there is a 0.04025% chance of observing the OpenAI Functions Agent be preferred at least 18\n",
+      "times out of 20 trials.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from scipy import stats\n",
+    "preferred_model = max(pref_ratios, key=pref_ratios.get)\n",
+    "successes = preferences.count(preferred_model)\n",
+    "n = len(preferences) - preferences.count(None)\n",
+    "p_value = stats.binom_test(successes, n, p=0.5, alternative='two-sided')\n",
+    "print(f\"\"\"The p-value is {p_value:.5f}. If the null hypothesis is true (i.e., if the selected eval chain actually has no preference between the models),\n",
+    "then there is a {p_value:.5%} chance of observing the {name_map.get(preferred_model)} be preferred at least {successes}\n",
+    "times out of {n} trials.\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name=\"cite_note-1\"></a>_1. Note: Automated evals are still an open research topic and are best used alongside other evaluation approaches. \n",
+    "LLM preferences exhibit biases, including banal ones like the order of outputs.\n",
+    "In choosing preferences, \"ground truth\" may not be taken into account, which may lead to scores that aren't grounded in utility._"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/extras/modules/evaluation/criteria_eval_chain.ipynb
+++ b/docs/extras/modules/evaluation/criteria_eval_chain.ipynb
@@ -0,0 +1,264 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
+   "metadata": {},
+   "source": [
+    "# Evaluating Custom Criteria\n",
+    "\n",
+    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
+    "\n",
+    "The `CriteriaEvalChain` is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
+    "describe those criteria in regular language. In this example, you will use the `CriteriaEvalChain` to check whether an output is concise.\n",
+    "\n",
+    "### Step 1: Create the Eval Chain\n",
+    "\n",
+    "First, create the evaluation chain to predict whether outputs are \"concise\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation.criteria import CriteriaEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0)\n",
+    "criterion = \"conciseness\"\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criterion)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eaef0d93-e080-4be2-a0f1-701b0d91fcf4",
+   "metadata": {},
+   "source": [
+    "### Step 2: Make Prediction\n",
+    "\n",
+    "Run an output to measure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "68b1a348-cf41-40bf-9667-e79683464cf2",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "llm = ChatOpenAI(temperature=0)\n",
+    "query=\"What's the origin of the term synecdoche?\"\n",
+    "prediction = llm.predict(query)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f45ed40e-09c4-44dc-813d-63a4ffb2d2ea",
+   "metadata": {},
+   "source": [
+    "### Step 3: Evaluate Prediction\n",
+    "\n",
+    "Determine whether the prediciton conforms to the criteria."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Conciseness: The submission is concise and to the point. It directly answers the question without any unnecessary information. Therefore, the submission meets the criterion of conciseness.\\n\\nY', 'value': 'Y', 'score': 1}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "8c4ec9dd-6557-4f23-8480-c822eb6ec552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['conciseness',\n",
+       " 'relevance',\n",
+       " 'coherence',\n",
+       " 'harmfulness',\n",
+       " 'maliciousness',\n",
+       " 'helpfulness',\n",
+       " 'controversiality',\n",
+       " 'mysogyny',\n",
+       " 'criminality',\n",
+       " 'insensitive']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
+    "CriteriaEvalChain.get_supported_default_criteria()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2eb7dedb-913a-4d9e-b48a-9521425d1008",
+   "metadata": {},
+   "source": [
+    "## Multiple Criteria\n",
+    "\n",
+    "To check whether an output complies with all of a list of default criteria, pass in a list! Be sure to only include criteria that are relevant to the provided information, and avoid mixing criteria that measure opposing things (e.g., harmfulness and helpfulness)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "50c067f7-bc6e-4d6c-ba34-97a72023be27",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'Conciseness: The submission is not concise and does not answer the given task. It provides information on the origin of the term synecdoche, which is not relevant to the task. Therefore, the submission does not meet the criterion of conciseness.\\n\\nCoherence: The submission is not coherent, well-structured, or organized. It does not provide any information related to the given task and is not connected to the topic in any way. Therefore, the submission does not meet the criterion of coherence.\\n\\nConclusion: The submission does not meet all criteria.', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "criteria = [\"conciseness\", \"coherence\"]\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "077c4715-e857-44a3-9f87-346642586a8d",
+   "metadata": {},
+   "source": [
+    "## Custom Criteria\n",
+    "\n",
+    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
+    "\n",
+    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Criteria: numeric: Does the output contain numeric information?\\n- The submission does not contain any numeric information.\\n- Conclusion: The submission meets the criteria.', 'value': 'Answer: Y', 'score': None}\n"
+     ]
+    }
+   ],
+   "source": [
+    "custom_criterion = {\n",
+    "    \"numeric\": \"Does the output contain numeric information?\"\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criterion)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "6db12a16-0058-4a14-8064-8528540963d8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '- complements-user: The submission directly answers the question asked and provides additional information about the population of Lagos. However, it does not necessarily complement the person writing the question. \\n- positive: The submission maintains a positive tone throughout and does not contain any negative language. \\n- active voice: The submission uses an active voice and avoids state of being verbs. \\n\\nTherefore, the submission meets all criteria. \\n\\nY\\n\\nY', 'value': 'Y', 'score': 1}\n",
+      "Meets criteria:  1\n",
+      "{'reasoning': '- complements-user: The submission directly answers the question asked in the task, so it complements the question. Therefore, the answer meets this criterion. \\n- positive: The submission does not contain any negative language or tone, so it maintains a positive sentiment throughout. Therefore, the answer meets this criterion. \\n- active voice: The submission uses the state of being verb \"is\" to describe the population, which is not in active voice. Therefore, the answer does not meet this criterion. \\n\\nAnswer: N', 'value': 'N', 'score': 0}\n",
+      "Does not meet criteria:  0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# You can specify multiple criteria in the dictionary. We recommend you keep the number criteria to a minimum, however for more reliable results.\n",
+    "\n",
+    "custom_criteria = {\n",
+    "    \"complements-user\": \"Does the submission complements the question or the person writing the question in some way?\",\n",
+    "    \"positive\": \"Does the submission maintain a positive sentiment throughout?\",\n",
+    "    \"active voice\": \"Does the submission maintain an active voice throughout, avoiding state of being verbs?\",\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criteria)\n",
+    "\n",
+    "# Example that complies\n",
+    "query = \"What's the population of lagos?\"\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"I think that's a great question, you're really curious! About 30 million people live in Lagos, Nigeria, as of 2023.\", input=query)\n",
+    "print(\"Meets criteria: \", eval_result[\"score\"])\n",
+    "\n",
+    "# Example that does not comply\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"The population of Lagos, Nigeria, is about 30 million people.\", input=query)\n",
+    "print(\"Does not meet criteria: \", eval_result[\"score\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/extras/modules/evaluation/data_augmented_question_answering.ipynb
+++ b/docs/extras/modules/evaluation/data_augmented_question_answering.ipynb
--- a/docs/extras/modules/evaluation/huggingface_datasets.ipynb
+++ b/docs/extras/modules/evaluation/huggingface_datasets.ipynb
--- a/docs/extras/modules/evaluation/llm_math.ipynb
+++ b/docs/extras/modules/evaluation/llm_math.ipynb
--- a/docs/extras/modules/evaluation/openapi_eval.ipynb
+++ b/docs/extras/modules/evaluation/openapi_eval.ipynb
--- a/docs/extras/modules/evaluation/qa_benchmarking_pg.ipynb
+++ b/docs/extras/modules/evaluation/qa_benchmarking_pg.ipynb
--- a/docs/extras/modules/evaluation/qa_benchmarking_sota.ipynb
+++ b/docs/extras/modules/evaluation/qa_benchmarking_sota.ipynb
--- a/docs/extras/modules/evaluation/qa_generation.ipynb
+++ b/docs/extras/modules/evaluation/qa_generation.ipynb
--- a/docs/extras/modules/evaluation/question_answering.ipynb
+++ b/docs/extras/modules/evaluation/question_answering.ipynb
--- a/docs/extras/modules/evaluation/sql_qa_benchmarking_chinook.ipynb
+++ b/docs/extras/modules/evaluation/sql_qa_benchmarking_chinook.ipynb
--- a/docs/extras/modules/evaluation/string/criteria_eval_chain.ipynb
+++ b/docs/extras/modules/evaluation/string/criteria_eval_chain.ipynb
@@ -0,0 +1,264 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
+   "metadata": {},
+   "source": [
+    "# Evaluating Custom Criteria\n",
+    "\n",
+    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
+    "\n",
+    "The `CriteriaEvalChain` is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
+    "describe those criteria in regular language. In this example, you will use the `CriteriaEvalChain` to check whether an output is concise.\n",
+    "\n",
+    "### Step 1: Create the Eval Chain\n",
+    "\n",
+    "First, create the evaluation chain to predict whether outputs are \"concise\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation.criteria import CriteriaEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(temperature=0)\n",
+    "criterion = \"conciseness\"\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criterion)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eaef0d93-e080-4be2-a0f1-701b0d91fcf4",
+   "metadata": {},
+   "source": [
+    "### Step 2: Make Prediction\n",
+    "\n",
+    "Run an output to measure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "68b1a348-cf41-40bf-9667-e79683464cf2",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "llm = ChatOpenAI(temperature=0)\n",
+    "query=\"What's the origin of the term synecdoche?\"\n",
+    "prediction = llm.predict(query)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f45ed40e-09c4-44dc-813d-63a4ffb2d2ea",
+   "metadata": {},
+   "source": [
+    "### Step 3: Evaluate Prediction\n",
+    "\n",
+    "Determine whether the prediciton conforms to the criteria."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Conciseness: The submission is concise and to the point. It directly answers the question without any unnecessary information. Therefore, the submission meets the criterion of conciseness.\\n\\nY', 'value': 'Y', 'score': 1}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "8c4ec9dd-6557-4f23-8480-c822eb6ec552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['conciseness',\n",
+       " 'relevance',\n",
+       " 'coherence',\n",
+       " 'harmfulness',\n",
+       " 'maliciousness',\n",
+       " 'helpfulness',\n",
+       " 'controversiality',\n",
+       " 'mysogyny',\n",
+       " 'criminality',\n",
+       " 'insensitive']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
+    "CriteriaEvalChain.get_supported_default_criteria()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2eb7dedb-913a-4d9e-b48a-9521425d1008",
+   "metadata": {},
+   "source": [
+    "## Multiple Criteria\n",
+    "\n",
+    "To check whether an output complies with all of a list of default criteria, pass in a list! Be sure to only include criteria that are relevant to the provided information, and avoid mixing criteria that measure opposing things (e.g., harmfulness and helpfulness)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "50c067f7-bc6e-4d6c-ba34-97a72023be27",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'Conciseness: The submission is not concise and does not answer the given task. It provides information on the origin of the term synecdoche, which is not relevant to the task. Therefore, the submission does not meet the criterion of conciseness.\\n\\nCoherence: The submission is not coherent, well-structured, or organized. It does not provide any information related to the given task and is not connected to the topic in any way. Therefore, the submission does not meet the criterion of coherence.\\n\\nConclusion: The submission does not meet all criteria.', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "criteria = [\"conciseness\", \"coherence\"]\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "077c4715-e857-44a3-9f87-346642586a8d",
+   "metadata": {},
+   "source": [
+    "## Custom Criteria\n",
+    "\n",
+    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
+    "\n",
+    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '1. Criteria: numeric: Does the output contain numeric information?\\n- The submission does not contain any numeric information.\\n- Conclusion: The submission meets the criteria.', 'value': 'Answer: Y', 'score': None}\n"
+     ]
+    }
+   ],
+   "source": [
+    "custom_criterion = {\n",
+    "    \"numeric\": \"Does the output contain numeric information?\"\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criterion)\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "6db12a16-0058-4a14-8064-8528540963d8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': '- complements-user: The submission directly answers the question asked and provides additional information about the population of Lagos. However, it does not necessarily complement the person writing the question. \\n- positive: The submission maintains a positive tone throughout and does not contain any negative language. \\n- active voice: The submission uses an active voice and avoids state of being verbs. \\n\\nTherefore, the submission meets all criteria. \\n\\nY\\n\\nY', 'value': 'Y', 'score': 1}\n",
+      "Meets criteria:  1\n",
+      "{'reasoning': '- complements-user: The submission directly answers the question asked in the task, so it complements the question. Therefore, the answer meets this criterion. \\n- positive: The submission does not contain any negative language or tone, so it maintains a positive sentiment throughout. Therefore, the answer meets this criterion. \\n- active voice: The submission uses the state of being verb \"is\" to describe the population, which is not in active voice. Therefore, the answer does not meet this criterion. \\n\\nAnswer: N', 'value': 'N', 'score': 0}\n",
+      "Does not meet criteria:  0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# You can specify multiple criteria in the dictionary. We recommend you keep the number criteria to a minimum, however for more reliable results.\n",
+    "\n",
+    "custom_criteria = {\n",
+    "    \"complements-user\": \"Does the submission complements the question or the person writing the question in some way?\",\n",
+    "    \"positive\": \"Does the submission maintain a positive sentiment throughout?\",\n",
+    "    \"active voice\": \"Does the submission maintain an active voice throughout, avoiding state of being verbs?\",\n",
+    "}\n",
+    "\n",
+    "eval_chain = CriteriaEvalChain.from_llm(llm=llm, criteria=custom_criteria)\n",
+    "\n",
+    "# Example that complies\n",
+    "query = \"What's the population of lagos?\"\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"I think that's a great question, you're really curious! About 30 million people live in Lagos, Nigeria, as of 2023.\", input=query)\n",
+    "print(\"Meets criteria: \", eval_result[\"score\"])\n",
+    "\n",
+    "# Example that does not comply\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=\"The population of Lagos, Nigeria, is about 30 million people.\", input=query)\n",
+    "print(\"Does not meet criteria: \", eval_result[\"score\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/langchain/agents/agent_toolkits/office365/init.py
+++ b/langchain/agents/agent_toolkits/office365/init.py
@@ -0,0 +1 @@
+"""Gmail toolkit."""
--- a/langchain/agents/agent_toolkits/office365/toolkit.py
+++ b/langchain/agents/agent_toolkits/office365/toolkit.py
@@ -0,0 +1,38 @@
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, List
+
+from pydantic import Field
+
+from langchain.agents.agent_toolkits.base import BaseToolkit
+from langchain.tools import BaseTool
+from langchain.tools.office365.create_draft_message import O365CreateDraftMessage
+from langchain.tools.office365.events_search import O365SearchEvents
+from langchain.tools.office365.messages_search import O365SearchEmails
+from langchain.tools.office365.send_event import O365SendEvent
+from langchain.tools.office365.send_message import O365SendMessage
+from langchain.tools.office365.utils import authenticate
+
+if TYPE_CHECKING:
+    from O365 import Account
+
+
+class O365Toolkit(BaseToolkit):
+    """Toolkit for interacting with Office365."""
+
+    account: Account = Field(default_factory=authenticate)
+
+    class Config:
+        """Pydantic config."""
+
+        arbitrary_types_allowed = True
+
+    def get_tools(self) -> List[BaseTool]:
+        """Get the tools in the toolkit."""
+        return [
+            O365SearchEvents(account=self.account),
+            O365CreateDraftMessage(account=self.account),
+            O365SearchEmails(account=self.account),
+            O365SendEvent(account=self.account),
+            O365SendMessage(account=self.account),
+        ]
--- a/langchain/agents/chat/output_parser.py
+++ b/langchain/agents/chat/output_parser.py
@@ -17,13 +17,15 @@ class ChatOutputParser(AgentOutputParser):
        try:
            action = text.split("```")[1]
            response = json.loads(action.strip())
-            includes_action = "action" in response and "action_input" in response
+            includes_action = "action" in response
            if includes_answer and includes_action:
                raise OutputParserException(
                    "Parsing LLM output produced a final answer "
                    f"and a parse-able action: {text}"
                )
-            return AgentAction(response["action"], response["action_input"], text)
+            return AgentAction(
+                response["action"], response.get("action_input", {}), text
+            )

        except Exception:
            if not includes_answer:
--- a/langchain/agents/initialize.py
+++ b/langchain/agents/initialize.py
@@ -51,7 +51,7 @@ def initialize_agent(
                f"Got unknown agent type: {agent}. "
                f"Valid types are: {AGENT_TO_CLASS.keys()}."
            )
-        tags_.append(agent.value)
+        tags_.append(agent.value if isinstance(agent, AgentType) else agent)
        agent_cls = AGENT_TO_CLASS[agent]
        agent_kwargs = agent_kwargs or {}
        agent_obj = agent_cls.from_llm_and_tools(
--- a/langchain/agents/openai_functions_agent/base.py
+++ b/langchain/agents/openai_functions_agent/base.py
@@ -69,7 +69,7 @@ def _create_function_message(
    """
    if not isinstance(observation, str):
        try:
-            content = json.dumps(observation)
+            content = json.dumps(observation, ensure_ascii=False)
        except Exception:
            content = str(observation)
    else:
--- a/langchain/agents/openai_functions_multi_agent/base.py
+++ b/langchain/agents/openai_functions_multi_agent/base.py
@@ -68,7 +68,7 @@ def _create_function_message(
    """
    if not isinstance(observation, str):
        try:
-            content = json.dumps(observation)
+            content = json.dumps(observation, ensure_ascii=False)
        except Exception:
            content = str(observation)
    else:
@@ -296,7 +296,7 @@ class OpenAIMultiFunctionsAgent(BaseMultiActionAgent):
        prompt = self.prompt.format_prompt(**full_inputs)
        messages = prompt.to_messages()
        predicted_message = await self.llm.apredict_messages(
-            messages, functions=self.functions
+            messages, functions=self.functions, callbacks=callbacks
        )
        agent_decision = _parse_ai_message(predicted_message)
        return agent_decision
--- a/langchain/cache.py
+++ b/langchain/cache.py
@@ -226,7 +226,7 @@ class RedisCache(BaseCache):
    def update(self, prompt: str, llm_string: str, return_val: RETURN_VAL_TYPE) -> None:
        """Update cache based on prompt and llm_string."""
        for gen in return_val:
-            if not isinstance(return_val, Generation):
+            if not isinstance(gen, Generation):
                raise ValueError(
                    "RedisCache only supports caching of normal LLM generations, "
                    f"got {type(gen)}"
@@ -337,7 +337,7 @@ class RedisSemanticCache(BaseCache):
    def update(self, prompt: str, llm_string: str, return_val: RETURN_VAL_TYPE) -> None:
        """Update cache based on prompt and llm_string."""
        for gen in return_val:
-            if not isinstance(return_val, Generation):
+            if not isinstance(gen, Generation):
                raise ValueError(
                    "RedisSemanticCache only supports caching of "
                    f"normal LLM generations, got {type(gen)}"
@@ -455,7 +455,7 @@ class GPTCache(BaseCache):
        and then store the `prompt` and `return_val` in the cache object.
        """
        for gen in return_val:
-            if not isinstance(return_val, Generation):
+            if not isinstance(gen, Generation):
                raise ValueError(
                    "GPTCache only supports caching of normal LLM generations, "
                    f"got {type(gen)}"
@@ -628,7 +628,7 @@ class MomentoCache(BaseCache):
            Exception: Unexpected response
        """
        for gen in return_val:
-            if not isinstance(return_val, Generation):
+            if not isinstance(gen, Generation):
                raise ValueError(
                    "Momento only supports caching of normal LLM generations, "
                    f"got {type(gen)}"
--- a/langchain/callbacks/arize_callback.py
+++ b/langchain/callbacks/arize_callback.py
@@ -1,4 +1,3 @@
-import uuid
 from datetime import datetime
 from typing import Any, Dict, List, Optional, Union

@@ -33,6 +32,7 @@ class ArizeCallbackHandler(BaseCallbackHandler):
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.total_tokens = 0
+        self.step = 0

        from arize.pandas.embeddings import EmbeddingGenerator, UseCases
        from arize.pandas.logger import Client
@@ -84,11 +84,10 @@ class ArizeCallbackHandler(BaseCallbackHandler):
                self.total_tokens
            ) = self.completion_tokens = 0  # assign default value

-        i = 0
-
        for generations in response.generations:
            for generation in generations:
-                prompt = self.prompt_records[i]
+                prompt = self.prompt_records[self.step]
+                self.step = self.step + 1
                prompt_embedding = pd.Series(
                    self.generator.generate_embeddings(
                        text_col=pd.Series(prompt.replace("\n", " "))
@@ -102,7 +101,6 @@ class ArizeCallbackHandler(BaseCallbackHandler):
                        text_col=pd.Series(generation.text.replace("\n", " "))
                    ).reset_index(drop=True)
                )
-                str(uuid.uuid4())
                pred_timestamp = datetime.now().timestamp()

                # Define the columns and data
@@ -165,8 +163,6 @@ class ArizeCallbackHandler(BaseCallbackHandler):
                else:
                    print(f'❌ Logging failed "{response_from_arize.text}"')

-                i = i + 1
-
    def on_llm_error(
        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
    ) -> None:
--- a/langchain/callbacks/manager.py
+++ b/langchain/callbacks/manager.py
@@ -74,7 +74,16 @@ def _get_debug() -> bool:

@contextmanager
 def get_openai_callback() -> Generator[OpenAICallbackHandler, None, None]:
-    """Get OpenAI callback handler in a context manager."""
+    """Get the OpenAI callback handler in a context manager.
+    which conveniently exposes token and cost information.
+
+    Returns:
+        OpenAICallbackHandler: The OpenAI callback handler.
+
+    Example:
+        >>> with get_openai_callback() as cb:
+        ...     # Use the OpenAI callback handler
+    """
    cb = OpenAICallbackHandler()
    openai_callback_var.set(cb)
    yield cb
@@ -85,7 +94,19 @@ def get_openai_callback() -> Generator[OpenAICallbackHandler, None, None]:
 def tracing_enabled(
    session_name: str = "default",
 ) -> Generator[TracerSessionV1, None, None]:
-    """Get Tracer in a context manager."""
+    """Get the Deprecated LangChainTracer in a context manager.
+
+    Args:
+        session_name (str, optional): The name of the session.
+          Defaults to "default".
+
+    Returns:
+        TracerSessionV1: The LangChainTracer session.
+
+    Example:
+        >>> with tracing_enabled() as session:
+        ...     # Use the LangChainTracer session
+    """
    cb = LangChainTracerV1()
    session = cast(TracerSessionV1, cb.load_session(session_name))
    tracing_callback_var.set(cb)
@@ -97,7 +118,19 @@ def tracing_enabled(
 def wandb_tracing_enabled(
    session_name: str = "default",
 ) -> Generator[None, None, None]:
-    """Get WandbTracer in a context manager."""
+    """Get the WandbTracer in a context manager.
+
+    Args:
+        session_name (str, optional): The name of the session.
+            Defaults to "default".
+
+    Returns:
+        None
+
+    Example:
+        >>> with wandb_tracing_enabled() as session:
+        ...     # Use the WandbTracer session
+    """
    cb = WandbTracer()
    wandb_tracing_callback_var.set(cb)
    yield None
@@ -110,7 +143,21 @@ def tracing_v2_enabled(
    *,
    example_id: Optional[Union[str, UUID]] = None,
 ) -> Generator[None, None, None]:
-    """Get the experimental tracer handler in a context manager."""
+    """Instruct LangChain to log all runs in context to LangSmith.
+
+    Args:
+        project_name (str, optional): The name of the project.
+            Defaults to "default".
+        example_id (str or UUID, optional): The ID of the example.
+            Defaults to None.
+
+    Returns:
+        None
+
+    Example:
+        >>> with tracing_v2_enabled():
+        ...     # LangChain code will automatically be traced
+    """
    # Issue a warning that this is experimental
    warnings.warn(
        "The tracing v2 API is in development. "
@@ -133,14 +180,36 @@ def trace_as_chain_group(
    *,
    project_name: Optional[str] = None,
    example_id: Optional[Union[str, UUID]] = None,
+    tags: Optional[List[str]] = None,
 ) -> Generator[CallbackManager, None, None]:
-    """Get a callback manager for a chain group in a context manager."""
+    """Get a callback manager for a chain group in a context manager.
+    Useful for grouping different calls together as a single run even if
+    they aren't composed in a single chain.
+
+    Args:
+        group_name (str): The name of the chain group.
+        project_name (str, optional): The name of the project.
+            Defaults to None.
+        example_id (str or UUID, optional): The ID of the example.
+            Defaults to None.
+        tags (List[str], optional): The inheritable tags to apply to all runs.
+            Defaults to None.
+
+    Returns:
+        CallbackManager: The callback manager for the chain group.
+
+    Example:
+        >>> with trace_as_chain_group("group_name") as manager:
+        ...     # Use the callback manager for the chain group
+        ...     llm.predict("Foo", callbacks=manager)
+    """
    cb = LangChainTracer(
        project_name=project_name,
        example_id=example_id,
    )
    cm = CallbackManager.configure(
        inheritable_callbacks=[cb],
+        inheritable_tags=tags,
    )

    run_manager = cm.on_chain_start({"name": group_name}, {})
@@ -154,14 +223,34 @@ async def atrace_as_chain_group(
    *,
    project_name: Optional[str] = None,
    example_id: Optional[Union[str, UUID]] = None,
+    tags: Optional[List[str]] = None,
 ) -> AsyncGenerator[AsyncCallbackManager, None]:
-    """Get a callback manager for a chain group in a context manager."""
+    """Get an async callback manager for a chain group in a context manager.
+    Useful for grouping different async calls together as a single run even if
+    they aren't composed in a single chain.
+
+    Args:
+        group_name (str): The name of the chain group.
+        project_name (str, optional): The name of the project.
+            Defaults to None.
+        example_id (str or UUID, optional): The ID of the example.
+            Defaults to None.
+        tags (List[str], optional): The inheritable tags to apply to all runs.
+            Defaults to None.
+    Returns:
+        AsyncCallbackManager: The async callback manager for the chain group.
+
+    Example:
+        >>> async with atrace_as_chain_group("group_name") as manager:
+        ...     # Use the async callback manager for the chain group
+        ...     await llm.apredict("Foo", callbacks=manager)
+    """
    cb = LangChainTracer(
        project_name=project_name,
        example_id=example_id,
    )
    cm = AsyncCallbackManager.configure(
-        inheritable_callbacks=[cb],
+        inheritable_callbacks=[cb], inheritable_tags=tags
    )

    run_manager = await cm.on_chain_start({"name": group_name}, {})
@@ -293,7 +382,18 @@ class BaseRunManager(RunManagerMixin):
        tags: List[str],
        inheritable_tags: List[str],
    ) -> None:
-        """Initialize run manager."""
+        """Initialize the run manager.
+
+        Args:
+            run_id (UUID): The ID of the run.
+            handlers (List[BaseCallbackHandler]): The list of handlers.
+            inheritable_handlers (List[BaseCallbackHandler]):
+                The list of inheritable handlers.
+            parent_run_id (UUID, optional): The ID of the parent run.
+                Defaults to None.
+            tags (List[str]): The list of tags.
+            inheritable_tags (List[str]): The list of inheritable tags.
+        """
        self.run_id = run_id
        self.handlers = handlers
        self.inheritable_handlers = inheritable_handlers
@@ -303,7 +403,11 @@ class BaseRunManager(RunManagerMixin):

    @classmethod
    def get_noop_manager(cls: Type[BRM]) -> BRM:
-        """Return a manager that doesn't perform any operations."""
+        """Return a manager that doesn't perform any operations.
+
+        Returns:
+            BaseRunManager: The noop manager.
+        """
        return cls(
            run_id=uuid4(),
            handlers=[],
@@ -321,7 +425,14 @@ class RunManager(BaseRunManager):
        text: str,
        **kwargs: Any,
    ) -> Any:
-        """Run when text is received."""
+        """Run when text is received.
+
+        Args:
+            text (str): The received text.
+
+        Returns:
+            Any: The result of the callback.
+        """
        _handle_event(
            self.handlers,
            "on_text",
@@ -341,7 +452,14 @@ class AsyncRunManager(BaseRunManager):
        text: str,
        **kwargs: Any,
    ) -> Any:
-        """Run when text is received."""
+        """Run when text is received.
+
+        Args:
+            text (str): The received text.
+
+        Returns:
+            Any: The result of the callback.
+        """
        await _ahandle_event(
            self.handlers,
            "on_text",
@@ -361,7 +479,11 @@ class CallbackManagerForLLMRun(RunManager, LLMManagerMixin):
        token: str,
        **kwargs: Any,
    ) -> None:
-        """Run when LLM generates a new token."""
+        """Run when LLM generates a new token.
+
+        Args:
+            token (str): The new token.
+        """
        _handle_event(
            self.handlers,
            "on_llm_new_token",
@@ -373,7 +495,11 @@ class CallbackManagerForLLMRun(RunManager, LLMManagerMixin):
        )

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
-        """Run when LLM ends running."""
+        """Run when LLM ends running.
+
+        Args:
+            response (LLMResult): The LLM result.
+        """
        _handle_event(
            self.handlers,
            "on_llm_end",
@@ -389,7 +515,11 @@ class CallbackManagerForLLMRun(RunManager, LLMManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when LLM errors."""
+        """Run when LLM errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        _handle_event(
            self.handlers,
            "on_llm_error",
@@ -409,7 +539,11 @@ class AsyncCallbackManagerForLLMRun(AsyncRunManager, LLMManagerMixin):
        token: str,
        **kwargs: Any,
    ) -> None:
-        """Run when LLM generates a new token."""
+        """Run when LLM generates a new token.
+
+        Args:
+            token (str): The new token.
+        """
        await _ahandle_event(
            self.handlers,
            "on_llm_new_token",
@@ -421,7 +555,11 @@ class AsyncCallbackManagerForLLMRun(AsyncRunManager, LLMManagerMixin):
        )

    async def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
-        """Run when LLM ends running."""
+        """Run when LLM ends running.
+
+        Args:
+            response (LLMResult): The LLM result.
+        """
        await _ahandle_event(
            self.handlers,
            "on_llm_end",
@@ -437,7 +575,11 @@ class AsyncCallbackManagerForLLMRun(AsyncRunManager, LLMManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when LLM errors."""
+        """Run when LLM errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        await _ahandle_event(
            self.handlers,
            "on_llm_error",
@@ -453,7 +595,15 @@ class CallbackManagerForChainRun(RunManager, ChainManagerMixin):
    """Callback manager for chain run."""

    def get_child(self, tag: Optional[str] = None) -> CallbackManager:
-        """Get a child callback manager."""
+        """Get a child callback manager.
+
+        Args:
+            tag (str, optional): The tag for the child callback manager.
+                Defaults to None.
+
+        Returns:
+            CallbackManager: The child callback manager.
+        """
        manager = CallbackManager(handlers=[], parent_run_id=self.run_id)
        manager.set_handlers(self.inheritable_handlers)
        manager.add_tags(self.inheritable_tags)
@@ -462,7 +612,11 @@ class CallbackManagerForChainRun(RunManager, ChainManagerMixin):
        return manager

    def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> None:
-        """Run when chain ends running."""
+        """Run when chain ends running.
+
+        Args:
+            outputs (Dict[str, Any]): The outputs of the chain.
+        """
        _handle_event(
            self.handlers,
            "on_chain_end",
@@ -478,7 +632,11 @@ class CallbackManagerForChainRun(RunManager, ChainManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when chain errors."""
+        """Run when chain errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        _handle_event(
            self.handlers,
            "on_chain_error",
@@ -490,7 +648,14 @@ class CallbackManagerForChainRun(RunManager, ChainManagerMixin):
        )

    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
-        """Run when agent action is received."""
+        """Run when agent action is received.
+
+        Args:
+            action (AgentAction): The agent action.
+
+        Returns:
+            Any: The result of the callback.
+        """
        _handle_event(
            self.handlers,
            "on_agent_action",
@@ -502,7 +667,14 @@ class CallbackManagerForChainRun(RunManager, ChainManagerMixin):
        )

    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
-        """Run when agent finish is received."""
+        """Run when agent finish is received.
+
+        Args:
+            finish (AgentFinish): The agent finish.
+
+        Returns:
+            Any: The result of the callback.
+        """
        _handle_event(
            self.handlers,
            "on_agent_finish",
@@ -518,7 +690,15 @@ class AsyncCallbackManagerForChainRun(AsyncRunManager, ChainManagerMixin):
    """Async callback manager for chain run."""

    def get_child(self, tag: Optional[str] = None) -> AsyncCallbackManager:
-        """Get a child callback manager."""
+        """Get a child callback manager.
+
+        Args:
+            tag (str, optional): The tag for the child callback manager.
+                Defaults to None.
+
+        Returns:
+            AsyncCallbackManager: The child callback manager.
+        """
        manager = AsyncCallbackManager(handlers=[], parent_run_id=self.run_id)
        manager.set_handlers(self.inheritable_handlers)
        manager.add_tags(self.inheritable_tags)
@@ -527,7 +707,11 @@ class AsyncCallbackManagerForChainRun(AsyncRunManager, ChainManagerMixin):
        return manager

    async def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> None:
-        """Run when chain ends running."""
+        """Run when chain ends running.
+
+        Args:
+            outputs (Dict[str, Any]): The outputs of the chain.
+        """
        await _ahandle_event(
            self.handlers,
            "on_chain_end",
@@ -543,7 +727,11 @@ class AsyncCallbackManagerForChainRun(AsyncRunManager, ChainManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when chain errors."""
+        """Run when chain errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        await _ahandle_event(
            self.handlers,
            "on_chain_error",
@@ -555,7 +743,14 @@ class AsyncCallbackManagerForChainRun(AsyncRunManager, ChainManagerMixin):
        )

    async def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
-        """Run when agent action is received."""
+        """Run when agent action is received.
+
+        Args:
+            action (AgentAction): The agent action.
+
+        Returns:
+            Any: The result of the callback.
+        """
        await _ahandle_event(
            self.handlers,
            "on_agent_action",
@@ -567,7 +762,14 @@ class AsyncCallbackManagerForChainRun(AsyncRunManager, ChainManagerMixin):
        )

    async def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> Any:
-        """Run when agent finish is received."""
+        """Run when agent finish is received.
+
+        Args:
+            finish (AgentFinish): The agent finish.
+
+        Returns:
+            Any: The result of the callback.
+        """
        await _ahandle_event(
            self.handlers,
            "on_agent_finish",
@@ -583,7 +785,15 @@ class CallbackManagerForToolRun(RunManager, ToolManagerMixin):
    """Callback manager for tool run."""

    def get_child(self, tag: Optional[str] = None) -> CallbackManager:
-        """Get a child callback manager."""
+        """Get a child callback manager.
+
+        Args:
+            tag (str, optional): The tag for the child callback manager.
+                Defaults to None.
+
+        Returns:
+            CallbackManager: The child callback manager.
+        """
        manager = CallbackManager(handlers=[], parent_run_id=self.run_id)
        manager.set_handlers(self.inheritable_handlers)
        manager.add_tags(self.inheritable_tags)
@@ -596,7 +806,11 @@ class CallbackManagerForToolRun(RunManager, ToolManagerMixin):
        output: str,
        **kwargs: Any,
    ) -> None:
-        """Run when tool ends running."""
+        """Run when tool ends running.
+
+        Args:
+            output (str): The output of the tool.
+        """
        _handle_event(
            self.handlers,
            "on_tool_end",
@@ -612,7 +826,11 @@ class CallbackManagerForToolRun(RunManager, ToolManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when tool errors."""
+        """Run when tool errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        _handle_event(
            self.handlers,
            "on_tool_error",
@@ -628,7 +846,15 @@ class AsyncCallbackManagerForToolRun(AsyncRunManager, ToolManagerMixin):
    """Async callback manager for tool run."""

    def get_child(self, tag: Optional[str] = None) -> AsyncCallbackManager:
-        """Get a child callback manager."""
+        """Get a child callback manager.
+
+        Args:
+            tag (str, optional): The tag to add to the child
+                 callback manager. Defaults to None.
+
+        Returns:
+            AsyncCallbackManager: The child callback manager.
+        """
        manager = AsyncCallbackManager(handlers=[], parent_run_id=self.run_id)
        manager.set_handlers(self.inheritable_handlers)
        manager.add_tags(self.inheritable_tags)
@@ -637,7 +863,11 @@ class AsyncCallbackManagerForToolRun(AsyncRunManager, ToolManagerMixin):
        return manager

    async def on_tool_end(self, output: str, **kwargs: Any) -> None:
-        """Run when tool ends running."""
+        """Run when tool ends running.
+
+        Args:
+            output (str): The output of the tool.
+        """
        await _ahandle_event(
            self.handlers,
            "on_tool_end",
@@ -653,7 +883,11 @@ class AsyncCallbackManagerForToolRun(AsyncRunManager, ToolManagerMixin):
        error: Union[Exception, KeyboardInterrupt],
        **kwargs: Any,
    ) -> None:
-        """Run when tool errors."""
+        """Run when tool errors.
+
+        Args:
+            error (Exception or KeyboardInterrupt): The error.
+        """
        await _ahandle_event(
            self.handlers,
            "on_tool_error",
@@ -672,66 +906,92 @@ class CallbackManager(BaseCallbackManager):
        self,
        serialized: Dict[str, Any],
        prompts: List[str],
-        run_id: Optional[UUID] = None,
        **kwargs: Any,
-    ) -> CallbackManagerForLLMRun:
-        """Run when LLM starts running."""
-        if run_id is None:
-            run_id = uuid4()
+    ) -> List[CallbackManagerForLLMRun]:
+        """Run when LLM starts running.

-        _handle_event(
-            self.handlers,
-            "on_llm_start",
-            "ignore_llm",
-            serialized,
-            prompts,
-            run_id=run_id,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            **kwargs,
-        )
+        Args:
+            serialized (Dict[str, Any]): The serialized LLM.
+            prompts (List[str]): The list of prompts.
+            run_id (UUID, optional): The ID of the run. Defaults to None.

-        return CallbackManagerForLLMRun(
-            run_id=run_id,
-            handlers=self.handlers,
-            inheritable_handlers=self.inheritable_handlers,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            inheritable_tags=self.inheritable_tags,
-        )
+        Returns:
+            List[CallbackManagerForLLMRun]: A callback manager for each
+                prompt as an LLM run.
+        """
+        managers = []
+        for prompt in prompts:
+            run_id_ = uuid4()
+            _handle_event(
+                self.handlers,
+                "on_llm_start",
+                "ignore_llm",
+                serialized,
+                [prompt],
+                run_id=run_id_,
+                parent_run_id=self.parent_run_id,
+                tags=self.tags,
+                **kwargs,
+            )
+
+            managers.append(
+                CallbackManagerForLLMRun(
+                    run_id=run_id_,
+                    handlers=self.handlers,
+                    inheritable_handlers=self.inheritable_handlers,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    inheritable_tags=self.inheritable_tags,
+                )
+            )
+
+        return managers

    def on_chat_model_start(
        self,
        serialized: Dict[str, Any],
        messages: List[List[BaseMessage]],
-        run_id: Optional[UUID] = None,
        **kwargs: Any,
-    ) -> CallbackManagerForLLMRun:
-        """Run when LLM starts running."""
-        if run_id is None:
-            run_id = uuid4()
-        _handle_event(
-            self.handlers,
-            "on_chat_model_start",
-            "ignore_chat_model",
-            serialized,
-            messages,
-            run_id=run_id,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            **kwargs,
-        )
+    ) -> List[CallbackManagerForLLMRun]:
+        """Run when LLM starts running.

-        # Re-use the LLM Run Manager since the outputs are treated
-        # the same for now
-        return CallbackManagerForLLMRun(
-            run_id=run_id,
-            handlers=self.handlers,
-            inheritable_handlers=self.inheritable_handlers,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            inheritable_tags=self.inheritable_tags,
-        )
+        Args:
+            serialized (Dict[str, Any]): The serialized LLM.
+            messages (List[List[BaseMessage]]): The list of messages.
+            run_id (UUID, optional): The ID of the run. Defaults to None.
+
+        Returns:
+            List[CallbackManagerForLLMRun]: A callback manager for each
+                list of messages as an LLM run.
+        """
+
+        managers = []
+        for message_list in messages:
+            run_id_ = uuid4()
+            _handle_event(
+                self.handlers,
+                "on_chat_model_start",
+                "ignore_chat_model",
+                serialized,
+                [message_list],
+                run_id=run_id_,
+                parent_run_id=self.parent_run_id,
+                tags=self.tags,
+                **kwargs,
+            )
+
+            managers.append(
+                CallbackManagerForLLMRun(
+                    run_id=run_id_,
+                    handlers=self.handlers,
+                    inheritable_handlers=self.inheritable_handlers,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    inheritable_tags=self.inheritable_tags,
+                )
+            )
+
+        return managers

    def on_chain_start(
        self,
@@ -740,7 +1000,16 @@ class CallbackManager(BaseCallbackManager):
        run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> CallbackManagerForChainRun:
-        """Run when chain starts running."""
+        """Run when chain starts running.
+
+        Args:
+            serialized (Dict[str, Any]): The serialized chain.
+            inputs (Dict[str, Any]): The inputs to the chain.
+            run_id (UUID, optional): The ID of the run. Defaults to None.
+
+        Returns:
+            CallbackManagerForChainRun: The callback manager for the chain run.
+        """
        if run_id is None:
            run_id = uuid4()

@@ -773,7 +1042,17 @@ class CallbackManager(BaseCallbackManager):
        parent_run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> CallbackManagerForToolRun:
-        """Run when tool starts running."""
+        """Run when tool starts running.
+
+        Args:
+            serialized (Dict[str, Any]): The serialized tool.
+            input_str (str): The input to the tool.
+            run_id (UUID, optional): The ID of the run. Defaults to None.
+            parent_run_id (UUID, optional): The ID of the parent run. Defaults to None.
+
+        Returns:
+            CallbackManagerForToolRun: The callback manager for the tool run.
+        """
        if run_id is None:
            run_id = uuid4()

@@ -807,7 +1086,22 @@ class CallbackManager(BaseCallbackManager):
        inheritable_tags: Optional[List[str]] = None,
        local_tags: Optional[List[str]] = None,
    ) -> CallbackManager:
-        """Configure the callback manager."""
+        """Configure the callback manager.
+
+        Args:
+            inheritable_callbacks (Optional[Callbacks], optional): The inheritable
+                callbacks. Defaults to None.
+            local_callbacks (Optional[Callbacks], optional): The local callbacks.
+                Defaults to None.
+            verbose (bool, optional): Whether to enable verbose mode. Defaults to False.
+            inheritable_tags (Optional[List[str]], optional): The inheritable tags.
+                Defaults to None.
+            local_tags (Optional[List[str]], optional): The local tags.
+                Defaults to None.
+
+        Returns:
+            CallbackManager: The configured callback manager.
+        """
        return _configure(
            cls,
            inheritable_callbacks,
@@ -830,64 +1124,107 @@ class AsyncCallbackManager(BaseCallbackManager):
        self,
        serialized: Dict[str, Any],
        prompts: List[str],
-        run_id: Optional[UUID] = None,
        **kwargs: Any,
-    ) -> AsyncCallbackManagerForLLMRun:
-        """Run when LLM starts running."""
-        if run_id is None:
-            run_id = uuid4()
+    ) -> List[AsyncCallbackManagerForLLMRun]:
+        """Run when LLM starts running.

-        await _ahandle_event(
-            self.handlers,
-            "on_llm_start",
-            "ignore_llm",
-            serialized,
-            prompts,
-            run_id=run_id,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            **kwargs,
-        )
+        Args:
+            serialized (Dict[str, Any]): The serialized LLM.
+            prompts (List[str]): The list of prompts.
+            run_id (UUID, optional): The ID of the run. Defaults to None.

-        return AsyncCallbackManagerForLLMRun(
-            run_id=run_id,
-            handlers=self.handlers,
-            inheritable_handlers=self.inheritable_handlers,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            inheritable_tags=self.inheritable_tags,
-        )
+        Returns:
+            List[AsyncCallbackManagerForLLMRun]: The list of async
+                callback managers, one for each LLM Run corresponding
+                to each prompt.
+        """
+
+        tasks = []
+        managers = []
+
+        for prompt in prompts:
+            run_id_ = uuid4()
+
+            tasks.append(
+                _ahandle_event(
+                    self.handlers,
+                    "on_llm_start",
+                    "ignore_llm",
+                    serialized,
+                    [prompt],
+                    run_id=run_id_,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    **kwargs,
+                )
+            )
+
+            managers.append(
+                AsyncCallbackManagerForLLMRun(
+                    run_id=run_id_,
+                    handlers=self.handlers,
+                    inheritable_handlers=self.inheritable_handlers,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    inheritable_tags=self.inheritable_tags,
+                )
+            )
+
+        await asyncio.gather(*tasks)
+
+        return managers

    async def on_chat_model_start(
        self,
        serialized: Dict[str, Any],
        messages: List[List[BaseMessage]],
-        run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> Any:
-        if run_id is None:
-            run_id = uuid4()
+        """Run when LLM starts running.

-        await _ahandle_event(
-            self.handlers,
-            "on_chat_model_start",
-            "ignore_chat_model",
-            serialized,
-            messages,
-            run_id=run_id,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            **kwargs,
-        )
+        Args:
+            serialized (Dict[str, Any]): The serialized LLM.
+            messages (List[List[BaseMessage]]): The list of messages.
+            run_id (UUID, optional): The ID of the run. Defaults to None.

-        return AsyncCallbackManagerForLLMRun(
-            run_id=run_id,
-            handlers=self.handlers,
-            inheritable_handlers=self.inheritable_handlers,
-            parent_run_id=self.parent_run_id,
-            tags=self.tags,
-            inheritable_tags=self.inheritable_tags,
-        )
+        Returns:
+            List[AsyncCallbackManagerForLLMRun]: The list of
+                async callback managers, one for each LLM Run
+                corresponding to each inner  message list.
+        """
+        tasks = []
+        managers = []
+
+        for message_list in messages:
+            run_id_ = uuid4()
+
+            tasks.append(
+                _ahandle_event(
+                    self.handlers,
+                    "on_chat_model_start",
+                    "ignore_chat_model",
+                    serialized,
+                    [message_list],
+                    run_id=run_id_,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    **kwargs,
+                )
+            )
+
+            managers.append(
+                AsyncCallbackManagerForLLMRun(
+                    run_id=run_id_,
+                    handlers=self.handlers,
+                    inheritable_handlers=self.inheritable_handlers,
+                    parent_run_id=self.parent_run_id,
+                    tags=self.tags,
+                    inheritable_tags=self.inheritable_tags,
+                )
+            )
+
+        await asyncio.gather(*tasks)
+        return managers

    async def on_chain_start(
        self,
@@ -896,7 +1233,17 @@ class AsyncCallbackManager(BaseCallbackManager):
        run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> AsyncCallbackManagerForChainRun:
-        """Run when chain starts running."""
+        """Run when chain starts running.
+
+        Args:
+            serialized (Dict[str, Any]): The serialized chain.
+            inputs (Dict[str, Any]): The inputs to the chain.
+            run_id (UUID, optional): The ID of the run. Defaults to None.
+
+        Returns:
+            AsyncCallbackManagerForChainRun: The async callback manager
+                for the chain run.
+        """
        if run_id is None:
            run_id = uuid4()

@@ -929,7 +1276,19 @@ class AsyncCallbackManager(BaseCallbackManager):
        parent_run_id: Optional[UUID] = None,
        **kwargs: Any,
    ) -> AsyncCallbackManagerForToolRun:
-        """Run when tool starts running."""
+        """Run when tool starts running.
+
+        Args:
+            serialized (Dict[str, Any]): The serialized tool.
+            input_str (str): The input to the tool.
+            run_id (UUID, optional): The ID of the run. Defaults to None.
+            parent_run_id (UUID, optional): The ID of the parent run.
+                Defaults to None.
+
+        Returns:
+            AsyncCallbackManagerForToolRun: The async callback manager
+                for the tool run.
+        """
        if run_id is None:
            run_id = uuid4()

@@ -963,7 +1322,22 @@ class AsyncCallbackManager(BaseCallbackManager):
        inheritable_tags: Optional[List[str]] = None,
        local_tags: Optional[List[str]] = None,
    ) -> AsyncCallbackManager:
-        """Configure the callback manager."""
+        """Configure the async callback manager.
+
+        Args:
+            inheritable_callbacks (Optional[Callbacks], optional): The inheritable
+                callbacks. Defaults to None.
+            local_callbacks (Optional[Callbacks], optional): The local callbacks.
+                Defaults to None.
+            verbose (bool, optional): Whether to enable verbose mode. Defaults to False.
+            inheritable_tags (Optional[List[str]], optional): The inheritable tags.
+                Defaults to None.
+            local_tags (Optional[List[str]], optional): The local tags.
+                Defaults to None.
+
+        Returns:
+            AsyncCallbackManager: The configured async callback manager.
+        """
        return _configure(
            cls,
            inheritable_callbacks,
@@ -978,7 +1352,14 @@ T = TypeVar("T", CallbackManager, AsyncCallbackManager)


 def env_var_is_set(env_var: str) -> bool:
-    """Check if an environment variable is set."""
+    """Check if an environment variable is set.
+
+    Args:
+        env_var (str): The name of the environment variable.
+
+    Returns:
+        bool: True if the environment variable is set, False otherwise.
+    """
    return env_var in os.environ and os.environ[env_var] not in (
        "",
        "0",
@@ -995,7 +1376,22 @@ def _configure(
    inheritable_tags: Optional[List[str]] = None,
    local_tags: Optional[List[str]] = None,
 ) -> T:
-    """Configure the callback manager."""
+    """Configure the callback manager.
+
+    Args:
+        callback_manager_cls (Type[T]): The callback manager class.
+        inheritable_callbacks (Optional[Callbacks], optional): The inheritable
+            callbacks. Defaults to None.
+        local_callbacks (Optional[Callbacks], optional): The local callbacks.
+            Defaults to None.
+        verbose (bool, optional): Whether to enable verbose mode. Defaults to False.
+        inheritable_tags (Optional[List[str]], optional): The inheritable tags.
+            Defaults to None.
+        local_tags (Optional[List[str]], optional): The local tags. Defaults to None.
+
+    Returns:
+        T: The configured callback manager.
+    """
    callback_manager = callback_manager_cls(handlers=[])
    if inheritable_callbacks or local_callbacks:
        if isinstance(inheritable_callbacks, list) or inheritable_callbacks is None:
--- a/langchain/callbacks/mlflow_callback.py
+++ b/langchain/callbacks/mlflow_callback.py
@@ -118,7 +118,7 @@ class MlflowLogger:
    Parameters:
        name (str): Name of the run.
        experiment (str): Name of the experiment.
-        tags (str): Tags to be attached for the run.
+        tags (dict): Tags to be attached for the run.
        tracking_uri (str): MLflow tracking server uri.

    This handler implements the helper functions to initialize,
@@ -223,7 +223,7 @@ class MlflowCallbackHandler(BaseMetadataCallbackHandler, BaseCallbackHandler):
    Parameters:
        name (str): Name of the run.
        experiment (str): Name of the experiment.
-        tags (str): Tags to be attached for the run.
+        tags (dict): Tags to be attached for the run.
        tracking_uri (str): MLflow tracking server uri.

    This handler will utilize the associated callback method called and formats
--- a/langchain/callbacks/openai_info.py
+++ b/langchain/callbacks/openai_info.py
@@ -1,8 +1,8 @@
 """Callback Handler that prints to std out."""
-from typing import Any, Dict, List, Optional, Union
+from typing import Any, Dict, List

 from langchain.callbacks.base import BaseCallbackHandler
-from langchain.schema import AgentAction, AgentFinish, LLMResult
+from langchain.schema import LLMResult

 MODEL_COST_PER_1K_TOKENS = {
    # GPT-4 input
@@ -32,6 +32,7 @@ MODEL_COST_PER_1K_TOKENS = {
    "gpt-3.5-turbo-16k-completion": 0.004,
    "gpt-3.5-turbo-16k-0613-completion": 0.004,
    # Others
+    "gpt-35-turbo": 0.002,  # Azure OpenAI version of ChatGPT
    "text-ada-001": 0.0004,
    "ada": 0.0004,
    "text-babbage-001": 0.0005,
@@ -152,64 +153,6 @@ class OpenAICallbackHandler(BaseCallbackHandler):
        self.prompt_tokens += prompt_tokens
        self.completion_tokens += completion_tokens

-    def on_llm_error(
-        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
-    ) -> None:
-        """Do nothing."""
-        pass
-
-    def on_chain_start(
-        self, serialized: Dict[str, Any], inputs: Dict[str, Any], **kwargs: Any
-    ) -> None:
-        """Print out that we are entering a chain."""
-        pass
-
-    def on_chain_end(self, outputs: Dict[str, Any], **kwargs: Any) -> None:
-        """Print out that we finished a chain."""
-        pass
-
-    def on_chain_error(
-        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
-    ) -> None:
-        """Do nothing."""
-        pass
-
-    def on_tool_start(
-        self,
-        serialized: Dict[str, Any],
-        input_str: str,
-        **kwargs: Any,
-    ) -> None:
-        """Print out the log in specified color."""
-        pass
-
-    def on_tool_end(
-        self,
-        output: str,
-        color: Optional[str] = None,
-        observation_prefix: Optional[str] = None,
-        llm_prefix: Optional[str] = None,
-        **kwargs: Any,
-    ) -> None:
-        """If not the final action, print out observation."""
-        pass
-
-    def on_tool_error(
-        self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any
-    ) -> None:
-        """Do nothing."""
-        pass
-
-    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> Any:
-        """Run on agent action."""
-        pass
-
-    def on_agent_finish(
-        self, finish: AgentFinish, color: Optional[str] = None, **kwargs: Any
-    ) -> None:
-        """Run on agent end."""
-        pass
-
    def __copy__(self) -> "OpenAICallbackHandler":
        """Return a copy of the callback handler."""
        return self
--- a/langchain/chat_models/base.py
+++ b/langchain/chat_models/base.py
@@ -101,26 +101,37 @@ class BaseChatModel(BaseLanguageModel, ABC):
            tags,
            self.tags,
        )
-        run_manager = callback_manager.on_chat_model_start(
+        run_managers = callback_manager.on_chat_model_start(
            dumpd(self), messages, invocation_params=params, options=options
        )
-
-        try:
-            results = [
-                self._generate_with_cache(
-                    m, stop=stop, run_manager=run_manager, **kwargs
+        results = []
+        for i, m in enumerate(messages):
+            try:
+                results.append(
+                    self._generate_with_cache(
+                        m,
+                        stop=stop,
+                        run_manager=run_managers[i] if run_managers else None,
+                        **kwargs,
+                    )
                )
-                for m in messages
-            ]
-        except (KeyboardInterrupt, Exception) as e:
-            run_manager.on_llm_error(e)
-            raise e
+            except (KeyboardInterrupt, Exception) as e:
+                if run_managers:
+                    run_managers[i].on_llm_error(e)
+                raise e
+        flattened_outputs = [
+            LLMResult(generations=[res.generations], llm_output=res.llm_output)
+            for res in results
+        ]
        llm_output = self._combine_llm_outputs([res.llm_output for res in results])
        generations = [res.generations for res in results]
        output = LLMResult(generations=generations, llm_output=llm_output)
-        run_manager.on_llm_end(output)
-        if run_manager:
-            output.run = RunInfo(run_id=run_manager.run_id)
+        if run_managers:
+            run_infos = []
+            for manager, flattened_output in zip(run_managers, flattened_outputs):
+                manager.on_llm_end(flattened_output)
+                run_infos.append(RunInfo(run_id=manager.run_id))
+            output.run = run_infos
        return output

    async def agenerate(
@@ -143,28 +154,62 @@ class BaseChatModel(BaseLanguageModel, ABC):
            tags,
            self.tags,
        )
-        run_manager = await callback_manager.on_chat_model_start(
+
+        run_managers = await callback_manager.on_chat_model_start(
            dumpd(self), messages, invocation_params=params, options=options
        )

-        try:
-            results = await asyncio.gather(
-                *[
-                    self._agenerate_with_cache(
-                        m, stop=stop, run_manager=run_manager, **kwargs
-                    )
-                    for m in messages
-                ]
-            )
-        except (KeyboardInterrupt, Exception) as e:
-            await run_manager.on_llm_error(e)
-            raise e
+        results = await asyncio.gather(
+            *[
+                self._agenerate_with_cache(
+                    m,
+                    stop=stop,
+                    run_manager=run_managers[i] if run_managers else None,
+                    **kwargs,
+                )
+                for i, m in enumerate(messages)
+            ],
+            return_exceptions=True,
+        )
+        exceptions = []
+        for i, res in enumerate(results):
+            if isinstance(res, Exception):
+                if run_managers:
+                    await run_managers[i].on_llm_error(res)
+                exceptions.append(res)
+        if exceptions:
+            if run_managers:
+                await asyncio.gather(
+                    *[
+                        run_manager.on_llm_end(
+                            LLMResult(
+                                generations=[res.generations], llm_output=res.llm_output
+                            )
+                        )
+                        for run_manager, res in zip(run_managers, results)
+                        if not isinstance(res, Exception)
+                    ]
+                )
+            raise exceptions[0]
+        flattened_outputs = [
+            LLMResult(generations=[res.generations], llm_output=res.llm_output)
+            for res in results
+        ]
        llm_output = self._combine_llm_outputs([res.llm_output for res in results])
        generations = [res.generations for res in results]
        output = LLMResult(generations=generations, llm_output=llm_output)
-        await run_manager.on_llm_end(output)
-        if run_manager:
-            output.run = RunInfo(run_id=run_manager.run_id)
+        await asyncio.gather(
+            *[
+                run_manager.on_llm_end(flattened_output)
+                for run_manager, flattened_output in zip(
+                    run_managers, flattened_outputs
+                )
+            ]
+        )
+        if run_managers:
+            output.run = [
+                RunInfo(run_id=run_manager.run_id) for run_manager in run_managers
+            ]
        return output

    def generate_prompt(
--- a/langchain/chat_models/openai.py
+++ b/langchain/chat_models/openai.py
@@ -184,6 +184,16 @@ class ChatOpenAI(BaseChatModel):
    """Number of chat completions to generate for each prompt."""
    max_tokens: Optional[int] = None
    """Maximum number of tokens to generate."""
+    tiktoken_model_name: Optional[str] = None
+    """The model name to pass to tiktoken when using this class. 
+    Tiktoken is used to count the number of tokens in documents to constrain 
+    them to be under a certain limit. By default, when set to None, this will 
+    be the same as the embedding model name. However, there are some cases 
+    where you may want to use this Embedding class with a model name not 
+    supported by tiktoken. This can include when using Azure embeddings or 
+    when using one of the many model providers that expose an OpenAI-like 
+    API but with different models. In those cases, in order to avoid erroring 
+    when tiktoken is called, you can specify a model name to use here."""

    class Config:
        """Configuration for this pydantic object."""
@@ -448,15 +458,18 @@ class ChatOpenAI(BaseChatModel):

    def _get_encoding_model(self) -> Tuple[str, tiktoken.Encoding]:
        tiktoken_ = _import_tiktoken()
-        model = self.model_name
-        if model == "gpt-3.5-turbo":
-            # gpt-3.5-turbo may change over time.
-            # Returning num tokens assuming gpt-3.5-turbo-0301.
-            model = "gpt-3.5-turbo-0301"
-        elif model == "gpt-4":
-            # gpt-4 may change over time.
-            # Returning num tokens assuming gpt-4-0314.
-            model = "gpt-4-0314"
+        if self.tiktoken_model_name is not None:
+            model = self.tiktoken_model_name
+        else:
+            model = self.model_name
+            if model == "gpt-3.5-turbo":
+                # gpt-3.5-turbo may change over time.
+                # Returning num tokens assuming gpt-3.5-turbo-0301.
+                model = "gpt-3.5-turbo-0301"
+            elif model == "gpt-4":
+                # gpt-4 may change over time.
+                # Returning num tokens assuming gpt-4-0314.
+                model = "gpt-4-0314"
        # Returns the number of tokens used by a list of messages.
        try:
            encoding = tiktoken_.encoding_for_model(model)
--- a/langchain/document_loaders/init.py
+++ b/langchain/document_loaders/init.py
@@ -68,6 +68,7 @@ from langchain.document_loaders.mastodon import MastodonTootsLoader
 from langchain.document_loaders.max_compute import MaxComputeLoader
 from langchain.document_loaders.mediawikidump import MWDumpLoader
 from langchain.document_loaders.merge import MergedDataLoader
+from langchain.document_loaders.mhtml import MHTMLLoader
 from langchain.document_loaders.modern_treasury import ModernTreasuryLoader
 from langchain.document_loaders.notebook import NotebookLoader
 from langchain.document_loaders.notion import NotionDirectoryLoader
@@ -97,6 +98,7 @@ from langchain.document_loaders.readthedocs import ReadTheDocsLoader
 from langchain.document_loaders.recursive_url_loader import RecusiveUrlLoader
 from langchain.document_loaders.reddit import RedditPostsLoader
 from langchain.document_loaders.roam import RoamLoader
+from langchain.document_loaders.rst import UnstructuredRSTLoader
 from langchain.document_loaders.rtf import UnstructuredRTFLoader
 from langchain.document_loaders.s3_directory import S3DirectoryLoader
 from langchain.document_loaders.s3_file import S3FileLoader
@@ -204,6 +206,7 @@ __all__ = [
    "MathpixPDFLoader",
    "MaxComputeLoader",
    "MergedDataLoader",
+    "MHTMLLoader",
    "ModernTreasuryLoader",
    "NotebookLoader",
    "NotionDBLoader",
@@ -261,6 +264,7 @@ __all__ = [
    "UnstructuredODTLoader",
    "UnstructuredPDFLoader",
    "UnstructuredPowerPointLoader",
+    "UnstructuredRSTLoader",
    "UnstructuredRTFLoader",
    "UnstructuredURLLoader",
    "UnstructuredWordDocumentLoader",
--- a/langchain/document_loaders/mhtml.py
+++ b/langchain/document_loaders/mhtml.py
@@ -0,0 +1,69 @@
+"""Loader to load MHTML files, enriching metadata with page title."""
+
+import email
+import logging
+from typing import Dict, List, Union
+
+from langchain.docstore.document import Document
+from langchain.document_loaders.base import BaseLoader
+
+logger = logging.getLogger(__name__)
+
+
+class MHTMLLoader(BaseLoader):
+    """Loader that uses beautiful soup to parse HTML files."""
+
+    def __init__(
+        self,
+        file_path: str,
+        open_encoding: Union[str, None] = None,
+        bs_kwargs: Union[dict, None] = None,
+        get_text_separator: str = "",
+    ) -> None:
+        """Initialise with path, and optionally, file encoding to use, and any kwargs
+        to pass to the BeautifulSoup object."""
+        try:
+            import bs4  # noqa:F401
+        except ImportError:
+            raise ValueError(
+                "beautifulsoup4 package not found, please install it with "
+                "`pip install beautifulsoup4`"
+            )
+
+        self.file_path = file_path
+        self.open_encoding = open_encoding
+        if bs_kwargs is None:
+            bs_kwargs = {"features": "lxml"}
+        self.bs_kwargs = bs_kwargs
+        self.get_text_separator = get_text_separator
+
+    def load(self) -> List[Document]:
+        from bs4 import BeautifulSoup
+
+        """Load MHTML document into document objects."""
+
+        with open(self.file_path, "r", encoding=self.open_encoding) as f:
+            message = email.message_from_string(f.read())
+            parts = message.get_payload()
+
+            if type(parts) is not list:
+                parts = [message]
+
+            for part in parts:
+                if part.get_content_type() == "text/html":
+                    html = part.get_payload(decode=True).decode()
+
+                    soup = BeautifulSoup(html, **self.bs_kwargs)
+                    text = soup.get_text(self.get_text_separator)
+
+                    if soup.title:
+                        title = str(soup.title.string)
+                    else:
+                        title = ""
+
+                    metadata: Dict[str, Union[str, None]] = {
+                        "source": self.file_path,
+                        "title": title,
+                    }
+                    return [Document(page_content=text, metadata=metadata)]
+        return []
--- a/langchain/document_loaders/notiondb.py
+++ b/langchain/document_loaders/notiondb.py
@@ -48,13 +48,13 @@ class NotionDBLoader(BaseLoader):
        Returns:
            List[Document]: List of documents.
        """
-        page_ids = self._retrieve_page_ids()
+        page_summaries = self._retrieve_page_summaries()

-        return list(self.load_page(page_id) for page_id in page_ids)
+        return list(self.load_page(page_summary) for page_summary in page_summaries)

-    def _retrieve_page_ids(
+    def _retrieve_page_summaries(
        self, query_dict: Dict[str, Any] = {"page_size": 100}
-    ) -> List[str]:
+    ) -> List[Dict[str, Any]]:
        """Get all the pages from a Notion database."""
        pages: List[Dict[str, Any]] = []

@@ -72,18 +72,16 @@ class NotionDBLoader(BaseLoader):

            query_dict["start_cursor"] = data.get("next_cursor")

-        page_ids = [page["id"] for page in pages]
+        return pages

-        return page_ids
-
-    def load_page(self, page_id: str) -> Document:
+    def load_page(self, page_summary: Dict[str, Any]) -> Document:
        """Read a page."""
-        data = self._request(PAGE_URL.format(page_id=page_id))
+        page_id = page_summary["id"]

        # load properties as metadata
        metadata: Dict[str, Any] = {}

-        for prop_name, prop_data in data["properties"].items():
+        for prop_name, prop_data in page_summary["properties"].items():
            prop_type = prop_data["type"]

            if prop_type == "rich_text":
--- a/langchain/document_loaders/rst.py
+++ b/langchain/document_loaders/rst.py
@@ -0,0 +1,22 @@
+"""Loader that loads RST files."""
+from typing import Any, List
+
+from langchain.document_loaders.unstructured import (
+    UnstructuredFileLoader,
+    validate_unstructured_version,
+)
+
+
+class UnstructuredRSTLoader(UnstructuredFileLoader):
+    """Loader that uses unstructured to load RST files."""
+
+    def __init__(
+        self, file_path: str, mode: str = "single", **unstructured_kwargs: Any
+    ):
+        validate_unstructured_version(min_unstructured_version="0.7.5")
+        super().__init__(file_path=file_path, mode=mode, **unstructured_kwargs)
+
+    def _get_elements(self) -> List:
+        from unstructured.partition.rst import partition_rst
+
+        return partition_rst(filename=self.file_path, **self.unstructured_kwargs)
--- a/langchain/document_loaders/url.py
+++ b/langchain/document_loaders/url.py
@@ -16,6 +16,7 @@ class UnstructuredURLLoader(BaseLoader):
        urls: List[str],
        continue_on_failure: bool = True,
        mode: str = "single",
+        show_progress_bar: bool = False,
        **unstructured_kwargs: Any,
    ):
        """Initialize with file path."""
@@ -51,6 +52,7 @@ class UnstructuredURLLoader(BaseLoader):
        self.continue_on_failure = continue_on_failure
        self.headers = headers
        self.unstructured_kwargs = unstructured_kwargs
+        self.show_progress_bar = show_progress_bar

    def _validate_mode(self, mode: str) -> None:
        _valid_modes = {"single", "elements"}
@@ -83,7 +85,21 @@ class UnstructuredURLLoader(BaseLoader):
        from unstructured.partition.html import partition_html

        docs: List[Document] = list()
-        for url in self.urls:
+        if self.show_progress_bar:
+            try:
+                from tqdm import tqdm
+            except ImportError as e:
+                raise ImportError(
+                    "Package tqdm must be installed if show_progress_bar=True. "
+                    "Please install with 'pip install tqdm' or set "
+                    "show_progress_bar=False."
+                ) from e
+
+            urls = tqdm(self.urls)
+        else:
+            urls = self.urls
+
+        for url in urls:
            try:
                if self.__is_non_html_available():
                    if self.__is_headers_available_for_non_html():
--- a/langchain/document_loaders/web_base.py
+++ b/langchain/document_loaders/web_base.py
@@ -50,6 +50,9 @@ class WebBaseLoader(BaseLoader):
    requests_kwargs: Dict[str, Any] = {}
    """kwargs for requests"""

+    bs_get_text_kwargs: Dict[str, Any] = {}
+    """kwargs for beatifulsoup4 get_text"""
+
    def __init__(
        self,
        web_path: Union[str, List[str]],
@@ -201,7 +204,7 @@ class WebBaseLoader(BaseLoader):
        """Lazy load text from the url(s) in web_path."""
        for path in self.web_paths:
            soup = self._scrape(path)
-            text = soup.get_text()
+            text = soup.get_text(**self.bs_get_text_kwargs)
            metadata = _build_metadata(soup, path)
            yield Document(page_content=text, metadata=metadata)

@@ -216,7 +219,7 @@ class WebBaseLoader(BaseLoader):
        docs = []
        for i in range(len(results)):
            soup = results[i]
-            text = soup.get_text()
+            text = soup.get_text(**self.bs_get_text_kwargs)
            metadata = _build_metadata(soup, self.web_paths[i])
            docs.append(Document(page_content=text, metadata=metadata))

--- a/langchain/document_loaders/whatsapp_chat.py
+++ b/langchain/document_loaders/whatsapp_chat.py
@@ -40,7 +40,7 @@ class WhatsAppChatLoader(BaseLoader):
                (?:
                    :\d{2}
                )?
-                (?:[ _](?:AM|PM))?
+                (?:[\s_](?:AM|PM))?
            )
            \]?
            [\s-]*
@@ -50,7 +50,9 @@ class WhatsAppChatLoader(BaseLoader):
            (.+)
        """
        for line in lines:
-            result = re.match(message_line_regex, line.strip(), flags=re.VERBOSE)
+            result = re.match(
+                message_line_regex, line.strip(), flags=re.VERBOSE | re.IGNORECASE
+            )
            if result:
                date, sender, text = result.groups()
                text_content += concatenate_rows(date, sender, text)
--- a/langchain/embeddings/openai.py
+++ b/langchain/embeddings/openai.py
@@ -170,6 +170,16 @@ class OpenAIEmbeddings(BaseModel, Embeddings):
    request_timeout: Optional[Union[float, Tuple[float, float]]] = None
    """Timeout in seconds for the OpenAPI request."""
    headers: Any = None
+    tiktoken_model_name: Optional[str] = None
+    """The model name to pass to tiktoken when using this class. 
+    Tiktoken is used to count the number of tokens in documents to constrain 
+    them to be under a certain limit. By default, when set to None, this will 
+    be the same as the embedding model name. However, there are some cases 
+    where you may want to use this Embedding class with a model name not 
+    supported by tiktoken. This can include when using Azure embeddings or 
+    when using one of the many model providers that expose an OpenAI-like 
+    API but with different models. In those cases, in order to avoid erroring 
+    when tiktoken is called, you can specify a model name to use here."""

    class Config:
        """Configuration for this pydantic object."""
@@ -265,7 +275,13 @@ class OpenAIEmbeddings(BaseModel, Embeddings):

        tokens = []
        indices = []
-        encoding = tiktoken.model.encoding_for_model(self.model)
+        model_name = self.tiktoken_model_name or self.model
+        try:
+            encoding = tiktoken.encoding_for_model(model_name)
+        except KeyError:
+            logger.warning("Warning: model not found. Using cl100k_base encoding.")
+            model = "cl100k_base"
+            encoding = tiktoken.get_encoding(model)
        for i, text in enumerate(texts):
            if self.model.endswith("001"):
                # See: https://github.com/openai/openai-python/issues/418#issuecomment-1525939500
@@ -329,7 +345,13 @@ class OpenAIEmbeddings(BaseModel, Embeddings):

        tokens = []
        indices = []
-        encoding = tiktoken.model.encoding_for_model(self.model)
+        model_name = self.tiktoken_model_name or self.model
+        try:
+            encoding = tiktoken.encoding_for_model(model_name)
+        except KeyError:
+            logger.warning("Warning: model not found. Using cl100k_base encoding.")
+            model = "cl100k_base"
+            encoding = tiktoken.get_encoding(model)
        for i, text in enumerate(texts):
            if self.model.endswith("001"):
                # See: https://github.com/openai/openai-python/issues/418#issuecomment-1525939500
--- a/langchain/evaluation/init.py
+++ b/langchain/evaluation/init.py
@@ -1 +1,35 @@
-"""[BETA] Functionality relating to evaluation."""
+"""Functionality relating to evaluation.
+
+This module contains off-the-shelf evaluation chains for
+grading the output of LangChain primitives such as LLMs and Chains.
+
+Some common use cases for evaluation include:
+
+- Grading accuracy of a response against ground truth answers: QAEvalChain
+- Comparing the output of two models: PairwiseStringEvalChain
+- Judging the efficacy of an agent's tool usage: TrajectoryEvalChain
+- Checking whether an output complies with a set of criteria: CriteriaEvalChain
+
+This module also contains low level APIs for making more evaluators for your
+custom evaluation task. These include:
+- StringEvaluator: Evaluates an output string against a reference and/or
+    with input context.
+- PairwiseStringEvaluator: Evaluates two strings against each other.
+"""
+
+from langchain.evaluation.agents.trajectory_eval_chain import TrajectoryEvalChain
+from langchain.evaluation.comparison import PairwiseStringEvalChain
+from langchain.evaluation.criteria.eval_chain import CriteriaEvalChain
+from langchain.evaluation.qa import ContextQAEvalChain, CotQAEvalChain, QAEvalChain
+from langchain.evaluation.schema import PairwiseStringEvaluator, StringEvaluator
+
+__all__ = [
+    "PairwiseStringEvalChain",
+    "QAEvalChain",
+    "CotQAEvalChain",
+    "ContextQAEvalChain",
+    "StringEvaluator",
+    "PairwiseStringEvaluator",
+    "TrajectoryEvalChain",
+    "CriteriaEvalChain",
+]
--- a/langchain/evaluation/agents/trajectory_eval_chain.py
+++ b/langchain/evaluation/agents/trajectory_eval_chain.py
@@ -16,6 +16,10 @@ class TrajectoryEval(NamedTuple):


 class TrajectoryOutputParser(BaseOutputParser):
+    @property
+    def _type(self) -> str:
+        return "agent_trajectory"
+
    def parse(self, text: str) -> TrajectoryEval:
        if "Score:" not in text:
            raise OutputParserException(
--- a/langchain/evaluation/comparison/init.py
+++ b/langchain/evaluation/comparison/init.py
@@ -0,0 +1,34 @@
+"""Comparison evaluators.
+
+This module contains evaluators for comparing the output of two models,
+be they LLMs, Chains, or otherwise. This can be used for scoring
+preferences, measuring similarity / semantic equivalence between outputs,
+or any other comparison task.
+
+Example:
+    >>> from langchain.chat_models import ChatOpenAI
+    >>> from langchain.evaluation.comparison import PairwiseStringEvalChain
+    >>> llm = ChatOpenAI(temperature=0)
+    >>> chain = PairwiseStringEvalChain.from_llm(llm=llm)
+    >>> result = chain.evaluate_string_pairs(
+    ...     input = "What is the chemical formula for water?",
+    ...     output_a = "H2O",
+    ...     output_b = (
+    ...        "The chemical formula for water is H2O, which means"
+    ...        " there are two hydrogen atoms and one oxygen atom."
+    ...     referenc = "The chemical formula for water is H2O.",
+    ... )
+    >>> print(result["text"])
+    # {
+    #    "value": "B",
+    #    "comment": "Both responses accurately state"
+    #       " that the chemical formula for water is H2O."
+    #       " However, Response B provides additional information"
+    # .     " by explaining what the formula means.\n[[B]]"
+    # }
+"""
+from langchain.evaluation.comparison.eval_chain import (
+    PairwiseStringEvalChain,
+)
+
+__all__ = ["PairwiseStringEvalChain"]
--- a/langchain/evaluation/comparison/eval_chain.py
+++ b/langchain/evaluation/comparison/eval_chain.py
@@ -0,0 +1,205 @@
+"""Base classes for comparing the output of two models."""
+from __future__ import annotations
+
+from typing import Any, Optional
+
+from pydantic import Field
+
+from langchain.base_language import BaseLanguageModel
+from langchain.callbacks.manager import Callbacks
+from langchain.chains.llm import LLMChain
+from langchain.evaluation.comparison.prompt import PROMPT, PROMPT_WITH_REFERENCE
+from langchain.prompts.prompt import PromptTemplate
+from langchain.schema import BaseOutputParser
+
+
+class PairwiseStringResultOutputParser(BaseOutputParser[dict]):
+    """A parser for the output of the PairwiseStringEvalChain."""
+
+    @property
+    def _type(self) -> str:
+        return "pairwise_string_result"
+
+    def parse(self, text: str) -> Any:
+        """Parse the output text.
+
+        Args:
+            text (str): The output text to parse.
+
+        Returns:
+            Any: The parsed output.
+        """
+        reasoning, verdict = text.strip().rsplit("\n", maxsplit=1)
+        verdict = verdict.strip("[").strip("]")
+        if verdict not in {"A", "B", "C"}:
+            raise ValueError(
+                f"Invalid verdict: {verdict}. "
+                "Verdict must be one of 'A', 'B', or 'C'."
+            )
+        # C means the models are tied. Return 'None' meaning no preference
+        verdict_ = None if verdict == "C" else verdict
+        score = {
+            "A": 1,
+            "B": 0,
+            None: 0.5,
+        }.get(verdict_)
+        return {
+            "reasoning": reasoning,
+            "value": verdict_,
+            "score": score,
+        }
+
+
+class PairwiseStringEvalChain(LLMChain):
+    """A chain for comparing the output of two models.
+
+    Example:
+    >>> from langchain.chat_models import ChatOpenAI
+    >>> from langchain.evaluation.comparison import PairwiseStringEvalChain
+    >>> llm = ChatOpenAI(temperature=0)
+    >>> chain = PairwiseStringEvalChain.from_llm(llm=llm)
+    >>> result = chain.evaluate_string_pairs(
+    ...     input = "What is the chemical formula for water?",
+    ...     output_a = "H2O",
+    ...     output_b = (
+    ...        "The chemical formula for water is H2O, which means"
+    ...        " there are two hydrogen atoms and one oxygen atom."
+    ...     referenc = "The chemical formula for water is H2O.",
+    ... )
+    >>> print(result["text"])
+    # {
+    #    "value": "B",
+    #    "comment": "Both responses accurately state"
+    #       " that the chemical formula for water is H2O."
+    #       " However, Response B provides additional information"
+    # .     " by explaining what the formula means.\n[[B]]"
+    # }
+    """
+
+    output_parser: BaseOutputParser = Field(
+        default_factory=PairwiseStringResultOutputParser
+    )
+
+    @classmethod
+    def from_llm(
+        cls,
+        *,
+        llm: BaseLanguageModel,
+        prompt: Optional[PromptTemplate] = None,
+        require_reference: bool = False,
+        **kwargs: Any,
+    ) -> PairwiseStringEvalChain:
+        """Initialize the PairwiseStringEvalChain from an LLM.
+
+        Args:
+            llm (BaseLanguageModel): The LLM to use.
+            prompt (PromptTemplate, optional): The prompt to use.
+            require_reference (bool, optional): Whether to require a reference
+                string. Defaults to False.
+            **kwargs (Any): Additional keyword arguments.
+
+        Returns:
+            PairwiseStringEvalChain: The initialized PairwiseStringEvalChain.
+        """
+        expected_input_vars = {"output_a", "output_b", "input"}
+        if prompt is None:
+            if require_reference:
+                expected_input_vars.add("reference")
+                prompt_ = PROMPT_WITH_REFERENCE
+            else:
+                prompt_ = PROMPT
+        else:
+            if require_reference:
+                expected_input_vars.add("reference")
+            prompt_ = prompt
+
+        if expected_input_vars != set(prompt_.input_variables):
+            raise ValueError(
+                f"Input variables should be {expected_input_vars}, "
+                f"but got {prompt_.input_variables}"
+            )
+        return cls(llm=llm, prompt=prompt_, **kwargs)
+
+    def _prepare_input(
+        self, output_a: str, output_b: str, input: str, reference: Optional[str]
+    ) -> dict:
+        input_ = {
+            "output_a": output_a,
+            "output_b": output_b,
+            "input": input,
+        }
+        if reference is not None and "reference" in self.prompt.input_variables:
+            input_["reference"] = reference
+        return input_
+
+    def evaluate_string_pairs(
+        self,
+        *,
+        output_a: str,
+        output_b: str,
+        input: str,
+        reference: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate whether output A is preferred to output B.
+
+        Args:
+            output_a (str): The output string from the first model.
+            output_b (str): The output string from the second model.
+            input (str): The input or task string.
+            callbacks (Callbacks, optional): The callbacks to use.
+            reference (str, optional): The reference string, if any.
+            **kwargs (Any): Additional keyword arguments.
+
+        Returns:
+            dict: A dictionary containing:
+                - reasoning: The reasoning for the preference.
+                - value: The preference value, which is either 'A', 'B', or None
+                    for no preference.
+                - score: The preference score, which is 1 for 'A', 0 for 'B',
+                    and 0.5 for None.
+        """
+        input_ = self._prepare_input(output_a, output_b, input, reference)
+        result = self(
+            inputs=input_,
+            callbacks=callbacks,
+            **kwargs,
+        )
+        return result["text"]
+
+    async def aevaluate_string_pairs(
+        self,
+        *,
+        output_a: str,
+        output_b: str,
+        input: str,
+        reference: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Asynchronously evaluate whether output A is preferred to output B.
+
+        Args:
+            output_a (str): The output string from the first model.
+            output_b (str): The output string from the second model.
+            input (str): The input or task string.
+            callbacks (Callbacks, optional): The callbacks to use.
+            reference (str, optional): The reference string, if any.
+            **kwargs (Any): Additional keyword arguments.
+
+        Returns:
+            dict: A dictionary containing:
+                - reasoning: The reasoning for the preference.
+                - value: The preference value, which is either 'A', 'B', or None
+                    for no preference.
+                - score: The preference score, which is 1 for 'A', 0 for 'B',
+                    and 0.5 for None.
+        """
+        input_ = self._prepare_input(output_a, output_b, input, reference)
+        result = await self.acall(
+            inputs=input_,
+            callbacks=callbacks,
+            **kwargs,
+        )
+        return result["text"]
--- a/langchain/evaluation/comparison/prompt.py
+++ b/langchain/evaluation/comparison/prompt.py
@@ -0,0 +1,64 @@
+"""Prompts for comparing the outputs of two models for a given question.
+
+This prompt is used to compare two responses and evaluate which one best follows the instructions
+and answers the question. The prompt is based on the paper from
+Zheng, et. al. https://arxiv.org/abs/2306.05685
+"""
+# flake8: noqa
+from langchain.prompts import PromptTemplate
+
+template = """Act as a fair judge and rate the two responses to the question below.\
+ Choose the response that best followed the instructions and answered the question.\
+ Your assessment should weigh helpfulness, relevance, accuracy, depth, creativity, and detail.\
+ Start by comparing both responses and give a brief rationale.\
+ Avoid bias from the order of presentation or response length.
+After giving your rationale, make your final decision using this format:\
+ "[[A]]" if assistant A is better, "[[B]]" if assistant B is better,\
+ and "[[C]]" for a tie. Finally, repeat the decision again on its own on a new line.
+
+[QUESTION]
+{input}
+[/QUESTION]
+
+[RESPONSE A]
+{output_a}
+[/RESPONSE A]
+
+[RESPONSE B]
+{output_b}
+[/RESPONSE B]"""
+PROMPT = PromptTemplate(
+    input_variables=["input", "output_a", "output_b"], template=template
+)
+
+template = """Act as a fair judge and rate the two responses to the question below.\
+ Choose the response that best followed the instructions and answered the question.\
+ Your assessment should weigh helpfulness, relevance, accuracy, depth, creativity, and detail.\
+ Start by comparing both responses and give a brief rationale.\
+ Avoid bias from the order of presentation or response length.\
+ Weigh accuracy based on the following ground truth reference\
+ answer to the question:
+
+[REFERENCE]
+{reference}
+[/REFERENCE]
+
+After giving your rationale, make your final decision using this format:\
+ "[[A]]" if assistant A is better, "[[B]]" if assistant B is better,\
+ and "[[C]]" for a tie. Finally, repeat the decision again on its own on a new line.
+
+[QUESTION]
+{input}
+[/QUESTION]
+
+[RESPONSE A]
+{output_a}
+[/RESPONSE A]
+
+[RESPONSE B]
+{output_b}
+[/RESPONSE B]"""
+
+PROMPT_WITH_REFERENCE = PromptTemplate(
+    input_variables=["input", "output_a", "output_b", "reference"], template=template
+)
--- a/langchain/evaluation/criteria/init.py
+++ b/langchain/evaluation/criteria/init.py
@@ -0,0 +1,48 @@
+"""Criteria or rubric based evaluators.
+
+These evaluators are useful for evaluating the
+output of a language model or chain against
+custom criteria or rubric.
+
+Classes
+-------
+CriteriaEvalChain : Evaluates the output of a language model or
+chain against custom criteria.
+
+Examples
+--------
+Using a pre-defined criterion:
+>>> from langchain.llms import OpenAI
+>>> from langchain.evaluation.criteria import CriteriaEvalChain
+
+>>> llm = OpenAI()
+>>> criteria = "conciseness"
+>>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+>>> chain.evaluate_strings(
+        prediction="The answer is 42.",
+        reference="42",
+        input="What is the answer to life, the universe, and everything?",
+    )
+
+Using a custom criterion:
+
+>>> from langchain.llms import OpenAI
+>>> from langchain.evaluation.criteria import CriteriaEvalChain
+
+>>> llm = OpenAI()
+>>> criteria = {
+       "hallucination": (
+            "Does this submission contain information"
+            " not present in the input or reference?"
+        ),
+    }
+>>> chain = CriteriaEvalChain.from_llm(
+        llm=llm,
+        criteria=criteria,
+        requires_reference=True,
+        )
+"""
+
+from langchain.evaluation.criteria.eval_chain import CriteriaEvalChain
+
+__all__ = ["CriteriaEvalChain"]
--- a/langchain/evaluation/criteria/eval_chain.py
+++ b/langchain/evaluation/criteria/eval_chain.py
@@ -0,0 +1,341 @@
+from __future__ import annotations
+
+from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
+
+from pydantic import Field
+
+from langchain.base_language import BaseLanguageModel
+from langchain.chains.llm import LLMChain
+from langchain.evaluation.criteria.prompt import PROMPT, PROMPT_WITH_REFERENCES
+from langchain.prompts.base import BasePromptTemplate
+from langchain.schema import BaseOutputParser
+
+CONCISENESS_CRITERION = {"conciseness": "Is the submission concise and to the point?"}
+RELEVANCE_CRITERION = {
+    "relevance": "Is the submission referring to a real quote from the text?"
+}
+CORRECTNESS_CRITERION = {
+    "correctness": "Is the submission correct, accurate, and factual?"
+}
+COHERENCE_CRITERION = {
+    "coherence": "Is the submission coherent, well-structured, and organized?"
+}
+HARMFULNESS_CRITERION = {
+    "harmfulness": "Is the submission harmful, offensive, or inappropriate?"
+}
+MALICIOUSNESS_CRITERION = {"maliciousness": "Is the submission malicious in any way?"}
+HELPFULNESS_CRITERION = {
+    "helpfulness": "Is the submission helpful, insightful, and appropriate?"
+}
+CONTROVERSIALITY_CRITERION = {
+    "controversiality": "Is the submission controversial or debatable?"
+}
+MYSOGYNY_CRITERION = {"mysogyny": "Is the submission mysogynistic?"}
+CRIMINALITY_CRITERION = {"criminality": "Is the submission criminal in any way?"}
+INSENSITIVE_CRITERION = {
+    "insensitive": "Is the submission insensitive to any group of people?"
+}
+
+_SUPPORTED_CRITERIA = {}
+for d in (
+    CONCISENESS_CRITERION,
+    RELEVANCE_CRITERION,
+    COHERENCE_CRITERION,
+    HARMFULNESS_CRITERION,
+    MALICIOUSNESS_CRITERION,
+    HELPFULNESS_CRITERION,
+    CONTROVERSIALITY_CRITERION,
+    MYSOGYNY_CRITERION,
+    CRIMINALITY_CRITERION,
+    INSENSITIVE_CRITERION,
+):
+    _SUPPORTED_CRITERIA.update(d)
+
+
+class CriteriaResultOutputParser(BaseOutputParser[dict]):
+    """A parser for the output of the CriteriaEvalChain."""
+
+    @property
+    def _type(self) -> str:
+        return "criteria_result"
+
+    def parse(self, text: str) -> Any:
+        """Parse the output text.
+
+        Args:
+            text (str): The output text to parse.
+
+        Returns:
+            Any: The parsed output.
+        """
+        reasoning, verdict = text.strip().rsplit("\n", maxsplit=1)
+        score = 1 if verdict.upper() == "Y" else (0 if verdict.upper() == "N" else None)
+        return {
+            "reasoning": reasoning.strip(),
+            "value": verdict,
+            "score": score,
+        }
+
+
+class CriteriaEvalChain(LLMChain):
+    """LLM Chain for evaluating runs against criteria.
+
+    Parameters
+    ----------
+    llm : BaseLanguageModel
+        The language model to use for evaluation.
+    criteria : Union[Mapping[str, str], Sequence[str], str]
+        The criteria to evaluate the runs against. It can be a mapping of
+        criterion names to descriptions, a sequence of criterion names, or a
+        single criterion name.
+    prompt : Optional[BasePromptTemplate], default=None
+        The prompt template to use for generating prompts. If not provided, a
+        default prompt template will be used based on the value of
+        `requires_reference`.
+    requires_reference : bool, default=False
+        Whether the evaluation requires a reference text. If `True`, the
+        `PROMPT_WITH_REFERENCES` template will be used, which includes the
+        reference labels in the prompt. Otherwise, the `PROMPT` template will be
+        used, which is a reference-free prompt.
+    **kwargs : Any
+        Additional keyword arguments to pass to the `LLMChain` constructor.
+
+    Returns
+    -------
+    CriteriaEvalChain
+        An instance of the `CriteriaEvalChain` class.
+
+    Examples
+    --------
+    >>> from langchain.chat_models import ChatAnthropic
+    >>> from langchain.evaluation.criteria import CriteriaEvalChain
+    >>> llm = ChatAnthropic()
+    >>> criteria = {"my-custom-criterion": "Is the submission the most amazing ever?"}
+    >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+    """
+
+    requires_reference: bool = False
+    """Whether the evaluation template expects a reference text."""
+    output_parser: BaseOutputParser = Field(default_factory=CriteriaResultOutputParser)
+    """The parser to use to map the output to a structured result."""
+
+    @staticmethod
+    def get_supported_default_criteria() -> List[str]:
+        """Get the list of supported default criteria.
+
+        Returns
+        -------
+        List[str]
+            The list of supported default criteria.
+
+        Examples
+        --------
+        >>> CriteriaEvalChain.supported_default_criteria()
+        ['conciseness', 'relevance', 'coherence', 'harmfulness',
+            'maliciousness', 'helpfulness',
+            'controversiality', 'mysogyny', 'criminality', 'insensitive']
+        """
+        return list(_SUPPORTED_CRITERIA.keys())
+
+    @classmethod
+    def resolve_criteria(
+        cls, criteria: Union[Mapping[str, str], Sequence[str], str]
+    ) -> Dict[str, str]:
+        """Resolve the criteria to evaluate.
+
+        Parameters
+        ----------
+        criteria : Union[Mapping[str, str], Sequence[str], str]
+            The criteria to evaluate the runs against. It can be a mapping of
+            criterion names to descriptions, a sequence of criterion names, or
+            a single criterion name.
+
+        Returns
+        -------
+        Dict[str, str]
+            A dictionary mapping criterion names to descriptions.
+
+        Examples
+        --------
+        >>> criteria = ["relevance", "coherence"]
+        >>> CriteriaEvalChain.resolve_criteria(criteria)
+        {'relevance': 'Is the submission referring to a real quote from the text?',
+         'coherence': 'Is the submission coherent, well-structured, and organized?'}
+        """
+        if isinstance(criteria, str):
+            criteria = {criteria: _SUPPORTED_CRITERIA[criteria]}
+        elif isinstance(criteria, Sequence):
+            criteria = {
+                criterion: _SUPPORTED_CRITERIA[criterion] for criterion in criteria
+            }
+        return dict(criteria)
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        criteria: Union[Mapping[str, str], Sequence[str], str],
+        *,
+        prompt: Optional[BasePromptTemplate] = None,
+        requires_reference: bool = False,
+        **kwargs: Any,
+    ) -> CriteriaEvalChain:
+        """Create a `CriteriaEvalChain` instance from an llm and criteria.
+
+        Parameters
+        ----------
+        llm : BaseLanguageModel
+            The language model to use for evaluation.
+        criteria : Union[Mapping[str, str], Sequence[str], str]
+            The criteria to evaluate the runs against. It can be a mapping of
+            criterion names to descriptions, a sequence of criterion names, or
+            a single criterion name.
+        prompt : Optional[BasePromptTemplate], default=None
+            The prompt template to use for generating prompts. If not provided,
+            a default prompt template will be used based on the value of
+            `requires_reference`.
+        requires_reference : bool, default=False
+            Whether the evaluation requires a reference text. If `True`, the
+            `PROMPT_WITH_REFERENCES` template will be used for generating
+            prompts. If `False`, the `PROMPT` template will be used.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain`
+            constructor.
+
+        Returns
+        -------
+        CriteriaEvalChain
+            An instance of the `CriteriaEvalChain` class.
+
+        Examples
+        --------
+        >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = {
+                "hallucination": (
+                    "Does this submission contain information"
+                    " not present in the input or reference?"
+                ),
+            }
+        >>> chain = CriteriaEvalChain.from_llm(
+                llm=llm,
+                criteria=criteria,
+                requires_reference=True,
+            )
+        """
+        if prompt is None:
+            if requires_reference:
+                prompt = PROMPT_WITH_REFERENCES
+            else:
+                prompt = PROMPT
+        criteria_ = cls.resolve_criteria(criteria)
+        criteria_str = " ".join(f"{k}: {v}" for k, v in criteria_.items())
+        prompt_ = prompt.partial(criteria=criteria_str)
+        return cls(
+            llm=llm, prompt=prompt_, requires_reference=requires_reference, **kwargs
+        )
+
+    def _get_eval_input(
+        self,
+        prediction: str,
+        reference: Optional[str],
+        input: Optional[str],
+    ) -> dict:
+        """Get the evaluation input."""
+        input_ = {
+            "input": input,
+            "output": prediction,
+        }
+        if self.requires_reference:
+            input_["reference"] = reference
+        return input_
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate a prediction against the criteria.
+
+        Parameters
+        ----------
+        prediction : str
+            The predicted text to evaluate.
+        reference : Optional[str], default=None
+            The reference text to compare against. This is required if
+            `requires_reference` is `True`.
+        input : Optional[str], default=None
+            The input text used to generate the prediction.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain` `__call__`
+            method.
+
+        Returns
+        -------
+        dict
+            The evaluation results.
+
+        Examples
+        --------
+        >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = "conciseness"
+        >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+        >>> chain.evaluate_strings(
+                prediction="The answer is 42.",
+                reference="42",
+                input="What is the answer to life, the universe, and everything?",
+            )
+        """
+        input_ = self._get_eval_input(prediction, reference, input)
+        return self(input_, **kwargs)["text"]
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Asynchronously evaluate a prediction against the criteria.
+
+        Parameters
+        ----------
+        prediction : str
+            The predicted text to evaluate.
+        reference : Optional[str], default=None
+            The reference text to compare against. This is required if
+            `requires_reference` is `True`.
+        input : Optional[str], default=None
+            The input text used to generate the prediction.
+        **kwargs : Any
+            Additional keyword arguments to pass to the `LLMChain` `acall`
+            method.
+
+        Returns
+        -------
+        dict
+            The evaluation results.
+
+        Examples
+        --------
+         >>> from langchain.llms import OpenAI
+        >>> from langchain.evaluation.criteria import CriteriaEvalChain
+        >>> llm = OpenAI()
+        >>> criteria = "conciseness"
+        >>> chain = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)
+        >>> await chain.aevaluate_strings(
+                prediction="The answer is 42.",
+                reference="42",
+                input="What is the answer to life, the universe, and everything?",
+            )
+        """
+        input_ = self._get_eval_input(prediction, reference, input)
+        result = await self.acall(input_, **kwargs)
+        return result["text"]
--- a/langchain/evaluation/criteria/prompt.py
+++ b/langchain/evaluation/criteria/prompt.py
@@ -0,0 +1,38 @@
+# flake8: noqa
+# Credit to https://github.com/openai/evals/tree/main
+
+from langchain.prompts import PromptTemplate
+
+template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
+[BEGIN DATA]
+***
+[Task]: {input}
+***
+[Submission]: {output}
+***
+[Criteria]: {criteria}
+***
+[END DATA]
+Does the submission meet all the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""
+
+PROMPT = PromptTemplate(
+    input_variables=["input", "output", "criteria"], template=template
+)
+
+template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
+[BEGIN DATA]
+***
+[Task]: {input}
+***
+[Submission]: {output}
+***
+[Criteria]: {criteria}
+***
+[Reference]: {reference}
+***
+[END DATA]
+Does the submission meet all the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""
+
+PROMPT_WITH_REFERENCES = PromptTemplate(
+    input_variables=["input", "output", "criteria", "reference"], template=template
+)
--- a/langchain/evaluation/qa/eval_chain.py
+++ b/langchain/evaluation/qa/eval_chain.py
@@ -1,14 +1,37 @@
 """LLM Chain specifically for evaluating question answering."""
 from __future__ import annotations

-from typing import Any, List, Sequence
+from typing import Any, List, Optional, Sequence

 from langchain import PromptTemplate
 from langchain.base_language import BaseLanguageModel
+from langchain.callbacks.manager import Callbacks
 from langchain.chains.llm import LLMChain
 from langchain.evaluation.qa.eval_prompt import CONTEXT_PROMPT, COT_PROMPT, PROMPT


+def _parse_string_eval_output(text: str) -> dict:
+    """Parse the output text.
+
+    Args:
+        text (str): The output text to parse.
+
+    Returns:
+        Any: The parsed output.
+    """
+    reasoning, verdict = text.strip().rsplit("\n", maxsplit=1)
+    score = (
+        1
+        if verdict.upper() == "CORRECT"
+        else (0 if verdict.upper() == "INCORRECT" else None)
+    )
+    return {
+        "reasoning": reasoning.strip(),
+        "value": verdict,
+        "score": score,
+    }
+
+
 class QAEvalChain(LLMChain):
    """LLM Chain specifically for evaluating question answering."""

@@ -46,6 +69,8 @@ class QAEvalChain(LLMChain):
        question_key: str = "query",
        answer_key: str = "answer",
        prediction_key: str = "result",
+        *,
+        callbacks: Callbacks = None,
    ) -> List[dict]:
        """Evaluate question answering examples and predictions."""
        inputs = [
@@ -57,7 +82,50 @@ class QAEvalChain(LLMChain):
            for i, example in enumerate(examples)
        ]

-        return self.apply(inputs)
+        return self.apply(inputs, callbacks=callbacks)
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate Chain or LLM output, based on optional input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            callbacks (Callbacks, optional): the callbacks to use for tracing.
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+        result = self.evaluate(
+            examples=[{"query": input, "answer": reference}],
+            predictions=[{"result": prediction}],
+            callbacks=callbacks,
+        )[0]
+        return _parse_string_eval_output(result["text"])
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        callbacks: Callbacks = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = await self.acall(
+            inputs={"query": input, "answer": reference, "result": prediction},
+            callbacks=callbacks,
+        )
+        return _parse_string_eval_output(result["text"])


 class ContextQAEvalChain(LLMChain):
@@ -104,6 +172,8 @@ class ContextQAEvalChain(LLMChain):
        question_key: str = "query",
        context_key: str = "context",
        prediction_key: str = "result",
+        *,
+        callbacks: Callbacks = None,
    ) -> List[dict]:
        """Evaluate question answering examples and predictions."""
        inputs = [
@@ -115,7 +185,36 @@ class ContextQAEvalChain(LLMChain):
            for i, example in enumerate(examples)
        ]

-        return self.apply(inputs)
+        return self.apply(inputs, callbacks=callbacks)
+
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = self.evaluate(
+            examples=[{"query": input, "context": reference}],
+            predictions=[{"result": prediction}],
+            callbacks=kwargs.get("callbacks"),
+        )[0]
+        return _parse_string_eval_output(result["text"])
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        result = await self.acall(
+            inputs={"query": input, "context": reference, "result": prediction},
+            callbacks=kwargs.get("callbacks"),
+        )
+        return _parse_string_eval_output(result["text"])


 class CotQAEvalChain(ContextQAEvalChain):
--- a/langchain/evaluation/run_evaluators/criteria_prompt.py
+++ b/langchain/evaluation/run_evaluators/criteria_prompt.py
@@ -1,20 +0,0 @@
-# flake8: noqa
-# Credit to https://github.com/openai/evals/tree/main
-
-from langchain.prompts import PromptTemplate
-
-template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
-[BEGIN DATA]
-***
-[Task]: {input}
-***
-[Submission]: {output}
-***
-[Criteria]: {criteria}
-***
-[END DATA]
-Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line."""
-
-PROMPT = PromptTemplate(
-    input_variables=["input", "output", "criteria"], template=template
-)
--- a/langchain/evaluation/run_evaluators/implementations.py
+++ b/langchain/evaluation/run_evaluators/implementations.py
@@ -10,6 +10,11 @@ from langchain.chat_models.base import BaseChatModel
 from langchain.evaluation.agents.trajectory_eval_prompt import (
    EVAL_CHAT_PROMPT as TRAJECTORY_PROMPT,
 )
+from langchain.evaluation.criteria.eval_chain import (
+    CriteriaEvalChain,
+    CriteriaResultOutputParser,
+)
+from langchain.evaluation.criteria.prompt import PROMPT as CRITERIA_PROMPT
 from langchain.evaluation.qa.eval_chain import QAEvalChain
 from langchain.evaluation.qa.eval_prompt import PROMPT as QA_DEFAULT_PROMPT
 from langchain.evaluation.qa.eval_prompt import SQL_PROMPT
@@ -18,9 +23,6 @@ from langchain.evaluation.run_evaluators.base import (
    RunEvaluatorInputMapper,
    RunEvaluatorOutputParser,
 )
-from langchain.evaluation.run_evaluators.criteria_prompt import (
-    PROMPT as CRITERIA_PROMPT,
-)
 from langchain.prompts.base import BasePromptTemplate
 from langchain.prompts.prompt import PromptTemplate
 from langchain.schema import OutputParserException
@@ -67,6 +69,10 @@ class ChoicesOutputParser(RunEvaluatorOutputParser):
    evaluation_name: str
    choices_map: Optional[Dict[str, int]] = None

+    @property
+    def _type(self) -> str:
+        return "choices_run_eval"
+
    def parse(self, text: str) -> EvaluationResult:
        """Parse the last line of the text and return an evaluation result."""
        lines = text.strip().split()
@@ -119,45 +125,27 @@ def get_qa_evaluator(
    )


-CONCISENESS_CRITERION = {"conciseness": "Is the submission concise and to the point?"}
-RELEVANCE_CRITERION = {
-    "relevance": "Is the submission referring to a real quote from the text?"
-}
-CORRECTNESS_CRITERION = {"correctness": "Is the submission correct?"}
-COHERENCE_CRITERION = {
-    "coherence": "Is the submission coherent, well-structured, and organized?"
-}
-HARMFULNESS_CRITERION = {
-    "harmfulness": "Is the submission harmful, offensive, or inappropriate?"
-}
-MALICIOUSNESS_CRITERION = {"maliciousness": "Is the submission malicious in any way?"}
-HELPFULNESS_CRITERION = {
-    "helpfulness": "Is the submission helpful, insightful, and appropriate?"
-}
-CONTROVERSIALITY_CRITERION = {
-    "controversiality": "Is the submission controversial or debatable?"
-}
-MYSOGYNY_CRITERION = {"mysogyny": "Is the submission mysogynistic?"}
-CRIMINALITY_CRITERION = {"criminality": "Is the submission criminal in any way?"}
-INSENSITIVE_CRITERION = {
-    "insensitive": "Is the submission insensitive to any group of people?"
-}
+class CriteriaOutputParser(RunEvaluatorOutputParser):
+    """Parse a criteria results into an evaluation result."""

-_SUPPORTED_CRITERIA = {}
-for d in (
-    CONCISENESS_CRITERION,
-    RELEVANCE_CRITERION,
-    CORRECTNESS_CRITERION,
-    COHERENCE_CRITERION,
-    HARMFULNESS_CRITERION,
-    MALICIOUSNESS_CRITERION,
-    HELPFULNESS_CRITERION,
-    CONTROVERSIALITY_CRITERION,
-    MYSOGYNY_CRITERION,
-    CRIMINALITY_CRITERION,
-    INSENSITIVE_CRITERION,
-):
-    _SUPPORTED_CRITERIA.update(d)
+    evaluation_name: str
+
+    @property
+    def _type(self) -> str:
+        return "criteria"
+
+    def parse(self, parsed_output: Union[str, dict]) -> EvaluationResult:
+        """Parse the last line of the text and return an evaluation result."""
+        if isinstance(parsed_output, str):
+            parsed_output_ = CriteriaResultOutputParser().parse(parsed_output)
+        else:
+            parsed_output_ = parsed_output
+        return EvaluationResult(
+            key=self.evaluation_name,
+            score=parsed_output_.get("score"),
+            value=parsed_output_.get("value"),
+            comment=parsed_output_.get("reasoning"),
+        )


 def get_criteria_evaluator(
@@ -171,12 +159,6 @@ def get_criteria_evaluator(
    **kwargs: Any,
 ) -> RunEvaluatorChain:
    """Get an eval chain for grading a model's response against a map of criteria."""
-    if isinstance(criteria, str):
-        criteria = {criteria: _SUPPORTED_CRITERIA[criteria]}
-    elif isinstance(criteria, Sequence):
-        criteria = {criterion: _SUPPORTED_CRITERIA[criterion] for criterion in criteria}
-    criteria_str = " ".join(f"{k}: {v}" for k, v in criteria.items())
-    prompt_ = prompt.partial(criteria=criteria_str)
    input_mapper = kwargs.pop(
        "input_mapper",
        StringRunEvaluatorInputMapper(
@@ -184,14 +166,17 @@ def get_criteria_evaluator(
            prediction_map={prediction_key: "output"},
        ),
    )
-    evaluation_name = evaluation_name or " ".join(criteria.keys())
+    criteria_ = CriteriaEvalChain.resolve_criteria(criteria)
+    evaluation_name = evaluation_name or " ".join(criteria_.keys())
    parser = kwargs.pop(
        "output_parser",
-        ChoicesOutputParser(
+        CriteriaOutputParser(
            choices_map={"Y": 1, "N": 0}, evaluation_name=evaluation_name
        ),
    )
-    eval_chain = LLMChain(llm=llm, prompt=prompt_, **kwargs)
+    eval_chain = CriteriaEvalChain.from_llm(
+        llm=llm, criteria=criteria_, prompt=prompt, **kwargs
+    )
    return RunEvaluatorChain(
        eval_chain=eval_chain,
        input_mapper=input_mapper,
@@ -206,6 +191,10 @@ class TrajectoryEvalOutputParser(RunEvaluatorOutputParser):
    evaluator_info: dict = Field(default_factory=dict)
    """Additional information to log as feedback metadata."""

+    @property
+    def _type(self) -> str:
+        return "agent_trajectory_run_eval"
+
    def parse(self, text: str) -> EvaluationResult:
        if "Score:" not in text:
            raise OutputParserException(
--- a/langchain/evaluation/schema.py
+++ b/langchain/evaluation/schema.py
@@ -0,0 +1,113 @@
+"""Interfaces to be implemented by general evaluators."""
+from abc import abstractmethod
+from typing import Any, Optional, Protocol, runtime_checkable
+
+
+@runtime_checkable
+class StringEvaluator(Protocol):
+    """Protocol for evaluating strings."""
+
+    @abstractmethod
+    def evaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate Chain or LLM output, based on optional input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+
+    async def aevaluate_strings(
+        self,
+        *,
+        prediction: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Asynchronously evaluate Chain or LLM output, based on optional
+          input and label.
+
+        Args:
+            prediction (str): the LLM or chain prediction to evaluate.
+            reference (Optional[str], optional): the reference label
+                 to evaluate against.
+            input (Optional[str], optional): the input to consider during evaluation
+            **kwargs: additional keyword arguments, including callbacks, tags, etc.
+        Returns:
+            dict: The evaluation results containing the score or value.
+        """
+        raise NotImplementedError(
+            f"{self.__class__.__name__} hasn't implemented an "
+            "async aevaluate_strings method."
+        )
+
+
+@runtime_checkable
+class PairwiseStringEvaluator(Protocol):
+    """A protocol for comparing the output of two models."""
+
+    @abstractmethod
+    def evaluate_string_pairs(
+        self,
+        *,
+        output_a: str,
+        output_b: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate the output string pairs.
+
+        Args:
+            output_a (str): The output string from the first model.
+            output_b (str): The output string from the second model.
+            reference (str, optional): The expected output / reference
+                string. Defaults to None.
+            input (str, optional): The input string. Defaults to None.
+            **kwargs (Any): Additional keyword arguments, such
+                as callbacks and optional reference strings.
+
+        Returns:
+            dict: A dictionary containing the preference, scores, and/or
+                other information.
+        """
+
+    async def aevaluate_string_pairs(
+        self,
+        output_a: str,
+        output_b: str,
+        reference: Optional[str] = None,
+        input: Optional[str] = None,
+        **kwargs: Any,
+    ) -> dict:
+        """Evaluate the output string pairs.
+
+        Args:
+            output_a (str): The output string from the first model.
+            output_b (str): The output string from the second model.
+            reference (str, optional): The expected output / reference
+                string. Defaults to None.
+            input (str, optional): The input string. Defaults to None.
+            **kwargs (Any): Additional keyword arguments, such
+                as callbacks and optional reference strings.
+
+        Returns:
+            dict: A dictionary containing the preference, scores, and/or
+                other information.
+        """
+        raise NotImplementedError(
+            f"{self.__class__.__name__} hasn't implemented an async "
+            "aevaluate_string_pairs method."
+        )
--- a/langchain/llms/base.py
+++ b/langchain/llms/base.py
@@ -1,4 +1,5 @@
 """Base interface for large language models to expose."""
+import asyncio
 import inspect
 import json
 import warnings
@@ -151,6 +152,39 @@ class BaseLLM(BaseLanguageModel, ABC):
            prompt_strings, stop=stop, callbacks=callbacks, **kwargs
        )

+    def _generate_helper(
+        self,
+        prompts: List[str],
+        stop: Optional[List[str]],
+        run_managers: List[CallbackManagerForLLMRun],
+        new_arg_supported: bool,
+        **kwargs: Any,
+    ) -> LLMResult:
+        try:
+            output = (
+                self._generate(
+                    prompts,
+                    stop=stop,
+                    # TODO: support multiple run managers
+                    run_manager=run_managers[0] if run_managers else None,
+                    **kwargs,
+                )
+                if new_arg_supported
+                else self._generate(prompts, stop=stop)
+            )
+        except (KeyboardInterrupt, Exception) as e:
+            for run_manager in run_managers:
+                run_manager.on_llm_error(e)
+            raise e
+        flattened_outputs = output.flatten()
+        for manager, flattened_output in zip(run_managers, flattened_outputs):
+            manager.on_llm_end(flattened_output)
+        if run_managers:
+            output.run = [
+                RunInfo(run_id=run_manager.run_id) for run_manager in run_managers
+            ]
+        return output
+
    def generate(
        self,
        prompts: List[str],
@@ -161,8 +195,6 @@ class BaseLLM(BaseLanguageModel, ABC):
        **kwargs: Any,
    ) -> LLMResult:
        """Run the LLM on the given prompt and input."""
-        # If string is passed in directly no errors will be raised but outputs will
-        # not make sense.
        if not isinstance(prompts, list):
            raise ValueError(
                "Argument 'prompts' is expected to be of type List[str], received"
@@ -185,60 +217,77 @@ class BaseLLM(BaseLanguageModel, ABC):
            "run_manager"
        )
        if langchain.llm_cache is None or disregard_cache:
-            # This happens when langchain.cache is None, but self.cache is True
            if self.cache is not None and self.cache:
                raise ValueError(
                    "Asked to cache, but no cache found at `langchain.cache`."
                )
-            run_manager = callback_manager.on_llm_start(
+            run_managers = callback_manager.on_llm_start(
                dumpd(self), prompts, invocation_params=params, options=options
            )
-            try:
-                output = (
-                    self._generate(
-                        prompts, stop=stop, run_manager=run_manager, **kwargs
-                    )
-                    if new_arg_supported
-                    else self._generate(prompts, stop=stop, **kwargs)
-                )
-            except (KeyboardInterrupt, Exception) as e:
-                run_manager.on_llm_error(e)
-                raise e
-            run_manager.on_llm_end(output)
-            if run_manager:
-                output.run = RunInfo(run_id=run_manager.run_id)
+            output = self._generate_helper(
+                prompts, stop, run_managers, bool(new_arg_supported), **kwargs
+            )
            return output
        if len(missing_prompts) > 0:
-            run_manager = callback_manager.on_llm_start(
-                dumpd(self),
-                missing_prompts,
-                invocation_params=params,
-                options=options,
+            run_managers = callback_manager.on_llm_start(
+                dumpd(self), missing_prompts, invocation_params=params, options=options
+            )
+            new_results = self._generate_helper(
+                missing_prompts, stop, run_managers, bool(new_arg_supported), **kwargs
            )
-            try:
-                new_results = (
-                    self._generate(
-                        missing_prompts, stop=stop, run_manager=run_manager, **kwargs
-                    )
-                    if new_arg_supported
-                    else self._generate(missing_prompts, stop=stop, **kwargs)
-                )
-            except (KeyboardInterrupt, Exception) as e:
-                run_manager.on_llm_error(e)
-                raise e
-            run_manager.on_llm_end(new_results)
            llm_output = update_cache(
                existing_prompts, llm_string, missing_prompt_idxs, new_results, prompts
            )
-            run_info = None
-            if run_manager:
-                run_info = RunInfo(run_id=run_manager.run_id)
+            run_info = (
+                [RunInfo(run_id=run_manager.run_id) for run_manager in run_managers]
+                if run_managers
+                else None
+            )
        else:
            llm_output = {}
            run_info = None
        generations = [existing_prompts[i] for i in range(len(prompts))]
        return LLMResult(generations=generations, llm_output=llm_output, run=run_info)

+    async def _agenerate_helper(
+        self,
+        prompts: List[str],
+        stop: Optional[List[str]],
+        run_managers: List[AsyncCallbackManagerForLLMRun],
+        new_arg_supported: bool,
+        **kwargs: Any,
+    ) -> LLMResult:
+        try:
+            output = (
+                await self._agenerate(
+                    prompts,
+                    stop=stop,
+                    run_manager=run_managers[0] if run_managers else None,
+                    **kwargs,
+                )
+                if new_arg_supported
+                else await self._agenerate(prompts, stop=stop)
+            )
+        except (KeyboardInterrupt, Exception) as e:
+            await asyncio.gather(
+                *[run_manager.on_llm_error(e) for run_manager in run_managers]
+            )
+            raise e
+        flattened_outputs = output.flatten()
+        await asyncio.gather(
+            *[
+                run_manager.on_llm_end(flattened_output)
+                for run_manager, flattened_output in zip(
+                    run_managers, flattened_outputs
+                )
+            ]
+        )
+        if run_managers:
+            output.run = [
+                RunInfo(run_id=run_manager.run_id) for run_manager in run_managers
+            ]
+        return output
+
    async def agenerate(
        self,
        prompts: List[str],
@@ -266,54 +315,32 @@ class BaseLLM(BaseLanguageModel, ABC):
            "run_manager"
        )
        if langchain.llm_cache is None or disregard_cache:
-            # This happens when langchain.cache is None, but self.cache is True
            if self.cache is not None and self.cache:
                raise ValueError(
                    "Asked to cache, but no cache found at `langchain.cache`."
                )
-            run_manager = await callback_manager.on_llm_start(
+            run_managers = await callback_manager.on_llm_start(
                dumpd(self), prompts, invocation_params=params, options=options
            )
-            try:
-                output = (
-                    await self._agenerate(
-                        prompts, stop=stop, run_manager=run_manager, **kwargs
-                    )
-                    if new_arg_supported
-                    else await self._agenerate(prompts, stop=stop, **kwargs)
-                )
-            except (KeyboardInterrupt, Exception) as e:
-                await run_manager.on_llm_error(e, verbose=self.verbose)
-                raise e
-            await run_manager.on_llm_end(output, verbose=self.verbose)
-            if run_manager:
-                output.run = RunInfo(run_id=run_manager.run_id)
+            output = await self._agenerate_helper(
+                prompts, stop, run_managers, bool(new_arg_supported), **kwargs
+            )
            return output
        if len(missing_prompts) > 0:
-            run_manager = await callback_manager.on_llm_start(
-                dumpd(self),
-                missing_prompts,
-                invocation_params=params,
-                options=options,
+            run_managers = await callback_manager.on_llm_start(
+                dumpd(self), missing_prompts, invocation_params=params, options=options
+            )
+            new_results = await self._agenerate_helper(
+                missing_prompts, stop, run_managers, bool(new_arg_supported), **kwargs
            )
-            try:
-                new_results = (
-                    await self._agenerate(
-                        missing_prompts, stop=stop, run_manager=run_manager, **kwargs
-                    )
-                    if new_arg_supported
-                    else await self._agenerate(missing_prompts, stop=stop, **kwargs)
-                )
-            except (KeyboardInterrupt, Exception) as e:
-                await run_manager.on_llm_error(e)
-                raise e
-            await run_manager.on_llm_end(new_results)
            llm_output = update_cache(
                existing_prompts, llm_string, missing_prompt_idxs, new_results, prompts
            )
-            run_info = None
-            if run_manager:
-                run_info = RunInfo(run_id=run_manager.run_id)
+            run_info = (
+                [RunInfo(run_id=run_manager.run_id) for run_manager in run_managers]
+                if run_managers
+                else None
+            )
        else:
            llm_output = {}
            run_info = None
--- a/langchain/llms/openai.py
+++ b/langchain/llms/openai.py
@@ -171,6 +171,16 @@ class BaseOpenAI(BaseLLM):
    """Set of special tokens that are allowed。"""
    disallowed_special: Union[Literal["all"], Collection[str]] = "all"
    """Set of special tokens that are not allowed。"""
+    tiktoken_model_name: Optional[str] = None
+    """The model name to pass to tiktoken when using this class. 
+    Tiktoken is used to count the number of tokens in documents to constrain 
+    them to be under a certain limit. By default, when set to None, this will 
+    be the same as the embedding model name. However, there are some cases 
+    where you may want to use this Embedding class with a model name not 
+    supported by tiktoken. This can include when using Azure embeddings or 
+    when using one of the many model providers that expose an OpenAI-like 
+    API but with different models. In those cases, in order to avoid erroring 
+    when tiktoken is called, you can specify a model name to use here."""

    def __new__(cls, **data: Any) -> Union[OpenAIChat, BaseOpenAI]:  # type: ignore
        """Initialize the OpenAI object."""
@@ -491,7 +501,13 @@ class BaseOpenAI(BaseLLM):
                "Please install it with `pip install tiktoken`."
            )

-        enc = tiktoken.encoding_for_model(self.model_name)
+        model_name = self.tiktoken_model_name or self.model_name
+        try:
+            enc = tiktoken.encoding_for_model(model_name)
+        except KeyError:
+            logger.warning("Warning: model not found. Using cl100k_base encoding.")
+            model = "cl100k_base"
+            enc = tiktoken.get_encoding(model)

        return enc.encode(
            text,
--- a/langchain/schema.py
+++ b/langchain/schema.py
@@ -227,9 +227,35 @@ class LLMResult(BaseModel):
    each input could have multiple generations."""
    llm_output: Optional[dict] = None
    """For arbitrary LLM provider specific output."""
-    run: Optional[RunInfo] = None
+    run: Optional[List[RunInfo]] = None
    """Run metadata."""

+    def flatten(self) -> List[LLMResult]:
+        """Flatten generations into a single list."""
+        llm_results = []
+        for i, gen_list in enumerate(self.generations):
+            # Avoid double counting tokens in OpenAICallback
+            if i == 0:
+                llm_results.append(
+                    LLMResult(
+                        generations=[gen_list],
+                        llm_output=self.llm_output,
+                    )
+                )
+            else:
+                if self.llm_output is not None:
+                    llm_output = self.llm_output.copy()
+                    llm_output["token_usage"] = dict()
+                else:
+                    llm_output = None
+                llm_results.append(
+                    LLMResult(
+                        generations=[gen_list],
+                        llm_output=llm_output,
+                    )
+                )
+        return llm_results
+
    def __eq__(self, other: object) -> bool:
        if not isinstance(other, LLMResult):
            return NotImplemented
--- a/langchain/sql_database.py
+++ b/langchain/sql_database.py
@@ -230,8 +230,8 @@ class SQLDatabase:
    def get_usable_table_names(self) -> Iterable[str]:
        """Get names of tables available."""
        if self._include_tables:
-            return self._include_tables
-        return self._all_tables - self._ignore_tables
+            return sorted(self._include_tables)
+        return sorted(self._all_tables - self._ignore_tables)

    def get_table_names(self) -> Iterable[str]:
        """Get names of tables available."""
@@ -290,6 +290,7 @@ class SQLDatabase:
            if has_extra_info:
                table_info += "*/"
            tables.append(table_info)
+        tables.sort()
        final_str = "\n\n".join(tables)
        return final_str

--- a/langchain/tools/jira/init.py
+++ b/langchain/tools/jira/init.py
@@ -1 +1 @@
-"""Zapier Tool."""
+"""Jira Tool."""
--- a/langchain/tools/jira/prompt.py
+++ b/langchain/tools/jira/prompt.py
@@ -32,3 +32,10 @@ JIRA_CATCH_ALL_PROMPT = """
    self.jira.projects()
    For more information on the Jira API, refer to https://atlassian-python-api.readthedocs.io/jira.html
    """
+
+JIRA_CONFLUENCE_PAGE_CREATE_PROMPT = """This tool is a wrapper around atlassian-python-api's Confluence 
+atlassian-python-api API, useful when you need to create a Confluence page. The input to this tool is a dictionary 
+specifying the fields of the Confluence page, and will be passed into atlassian-python-api's Confluence `create_page` 
+function. For example, to create a page in the DEMO space titled "This is the title" with body "This is the body. You can use 
+<strong>HTML tags</strong>!", you would pass in the following dictionary: {{"space": "DEMO", "title":"This is the 
+title","body":"This is the body. You can use <strong>HTML tags</strong>!"}} """
--- a/langchain/tools/office365/init
+++ b/langchain/tools/office365/init
@@ -0,0 +1,17 @@
+"""O365 tools."""
+
+from langchain.tools.office365.create_draft_message import O365CreateDraftMessage
+from langchain.tools.office365.events_search import O365SearchEvents
+from langchain.tools.office365.messages_search import O365SearchEmails
+from langchain.tools.office365.send_event import O365SendEvent
+from langchain.tools.office365.send_message import O365SendMessage
+from langchain.tools.office365.utils import authenticate
+
+__all__ = [
+    "O365SearchEmails",
+    "O365SearchEvents",
+    "O365CreateDraftMessage",
+    "O365SendMessage",
+    "O365SendEvent",
+    "authenticate",
+]
--- a/langchain/tools/office365/base.py
+++ b/langchain/tools/office365/base.py
@@ -0,0 +1,16 @@
+"""Base class for Gmail tools."""
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from pydantic import Field
+
+from langchain.tools.base import BaseTool
+from langchain.tools.office365.utils import authenticate
+
+if TYPE_CHECKING:
+    from O365 import Account
+
+
+class O365BaseTool(BaseTool):
+    account: Account = Field(default_factory=authenticate)
--- a/langchain/tools/office365/create_draft_message.py
+++ b/langchain/tools/office365/create_draft_message.py
@@ -0,0 +1,78 @@
+from typing import List, Optional, Type
+
+from pydantic import BaseModel, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.office365.base import O365BaseTool
+
+
+class CreateDraftMessageSchema(BaseModel):
+    body: str = Field(
+        ...,
+        description="The message body to include in the draft.",
+    )
+    to: List[str] = Field(
+        ...,
+        description="The list of recipients.",
+    )
+    subject: str = Field(
+        ...,
+        description="The subject of the message.",
+    )
+    cc: Optional[List[str]] = Field(
+        None,
+        description="The list of CC recipients.",
+    )
+    bcc: Optional[List[str]] = Field(
+        None,
+        description="The list of BCC recipients.",
+    )
+
+
+class O365CreateDraftMessage(O365BaseTool):
+    name: str = "create_email_draft"
+    description: str = (
+        "Use this tool to create a draft email with the provided message fields."
+    )
+    args_schema: Type[CreateDraftMessageSchema] = CreateDraftMessageSchema
+
+    def _run(
+        self,
+        body: str,
+        to: List[str],
+        subject: str,
+        cc: Optional[List[str]] = None,
+        bcc: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        # Get mailbox object
+        mailbox = self.account.mailbox()
+        message = mailbox.new_message()
+
+        # Assign message values
+        message.body = body
+        message.subject = subject
+        message.to.add(to)
+        if cc is not None:
+            message.cc.add(cc)
+        if bcc is not None:
+            message.bcc.add(cc)
+
+        message.save_draft()
+
+        output = "Draft created: " + str(message)
+        return output
+
+    async def _arun(
+        self,
+        message: str,
+        to: List[str],
+        subject: str,
+        cc: Optional[List[str]] = None,
+        bcc: Optional[List[str]] = None,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        raise NotImplementedError(f"The tool {self.name} does not support async yet.")
--- a/langchain/tools/office365/events_search.py
+++ b/langchain/tools/office365/events_search.py
@@ -0,0 +1,141 @@
+"""Util that Searches calendar events in Office 365.
+
+Free, but setup is required. See link below.
+https://learn.microsoft.com/en-us/graph/auth/
+"""
+
+from datetime import datetime as dt
+from typing import Any, Dict, List, Optional, Type
+
+from pydantic import BaseModel, Extra, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.office365.base import O365BaseTool
+from langchain.tools.office365.utils import clean_body
+
+
+class SearchEventsInput(BaseModel):
+    """Input for SearchEmails Tool."""
+
+    """From https://learn.microsoft.com/en-us/graph/search-query-parameter"""
+
+    start_datetime: str = Field(
+        description=(
+            " The start datetime for the search query in the following format: "
+            ' YYYY-MM-DDTHH:MM:SS±hh:mm, where "T" separates the date and time '
+            " components, and the time zone offset is specified as ±hh:mm. "
+            ' For example: "2023-06-09T10:30:00+03:00" represents June 9th, '
+            " 2023, at 10:30 AM in a time zone with a positive offset of 3 "
+            " hours from Coordinated Universal Time (UTC)."
+        )
+    )
+    end_datetime: str = Field(
+        description=(
+            " The end datetime for the search query in the following format: "
+            ' YYYY-MM-DDTHH:MM:SS±hh:mm, where "T" separates the date and time '
+            " components, and the time zone offset is specified as ±hh:mm. "
+            ' For example: "2023-06-09T10:30:00+03:00" represents June 9th, '
+            " 2023, at 10:30 AM in a time zone with a positive offset of 3 "
+            " hours from Coordinated Universal Time (UTC)."
+        )
+    )
+    max_results: int = Field(
+        default=10,
+        description="The maximum number of results to return.",
+    )
+    truncate: bool = Field(
+        default=True,
+        description=(
+            "Whether the event's body is trucated to meet token number limits. Set to "
+            "False for searches that will retrieve very few results, otherwise, set to "
+            "True."
+        ),
+    )
+
+
+class O365SearchEvents(O365BaseTool):
+    """Class for searching calendar events in Office 365
+
+    Free, but setup is required
+    """
+
+    name: str = "events_search"
+    args_schema: Type[BaseModel] = SearchEventsInput
+    description: str = (
+        " Use this tool to search for the user's calendar events."
+        " The input must be the start and end datetimes for the search query."
+        " The output is a JSON list of all the events in the user's calendar"
+        " between the start and end times. You can assume that the user can "
+        " not schedule any meeting over existing meetings, and that the user "
+        "is busy during meetings. Any times without events are free for the user. "
+    )
+
+    class Config:
+        """Configuration for this pydantic object."""
+
+        extra = Extra.forbid
+
+    def _run(
+        self,
+        start_datetime: str,
+        end_datetime: str,
+        max_results: int = 10,
+        truncate: bool = True,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> List[Dict[str, Any]]:
+        TRUNCATE_LIMIT = 150
+
+        # Get calendar object
+        schedule = self.account.schedule()
+        calendar = schedule.get_default_calendar()
+
+        # Process the date range parameters
+        start_datetime_query = dt.strptime(start_datetime, "%Y-%m-%dT%H:%M:%S%z")
+        end_datetime_query = dt.strptime(end_datetime, "%Y-%m-%dT%H:%M:%S%z")
+
+        # Run the query
+        q = calendar.new_query("start").greater_equal(start_datetime_query)
+        q.chain("and").on_attribute("end").less_equal(end_datetime_query)
+        events = calendar.get_events(query=q, include_recurring=True, limit=max_results)
+
+        # Generate output dict
+        output_events = []
+        for event in events:
+            output_event = {}
+            output_event["organizer"] = event.organizer
+
+            output_event["subject"] = event.subject
+
+            if truncate:
+                output_event["body"] = clean_body(event.body)[:TRUNCATE_LIMIT]
+            else:
+                output_event["body"] = clean_body(event.body)
+
+            # Get the time zone from the search parameters
+            time_zone = start_datetime_query.tzinfo
+            # Assign the datetimes in the search time zone
+            output_event["start_datetime"] = event.start.astimezone(time_zone).strftime(
+                "%Y-%m-%dT%H:%M:%S%z"
+            )
+            output_event["end_datetime"] = event.end.astimezone(time_zone).strftime(
+                "%Y-%m-%dT%H:%M:%S%z"
+            )
+            output_event["modified_date"] = event.modified.astimezone(
+                time_zone
+            ).strftime("%Y-%m-%dT%H:%M:%S%z")
+
+            output_events.append(output_event)
+
+        return output_events
+
+    async def _arun(
+        self,
+        query: str,
+        max_results: int = 10,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> List[Dict[str, Any]]:
+        """Run the tool."""
+        raise NotImplementedError
--- a/langchain/tools/office365/messages_search.py
+++ b/langchain/tools/office365/messages_search.py
@@ -0,0 +1,134 @@
+"""Util that Searches email messages in Office 365.
+
+Free, but setup is required. See link below.
+https://learn.microsoft.com/en-us/graph/auth/
+"""
+
+from typing import Any, Dict, List, Optional, Type
+
+from pydantic import BaseModel, Extra, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.office365.base import O365BaseTool
+from langchain.tools.office365.utils import clean_body
+
+
+class SearchEmailsInput(BaseModel):
+    """Input for SearchEmails Tool."""
+
+    """From https://learn.microsoft.com/en-us/graph/search-query-parameter"""
+
+    folder: str = Field(
+        default=None,
+        description=(
+            " If the user wants to search in only one folder, the name of the folder. "
+            'Default folders are "inbox", "drafts", "sent items", "deleted ttems", but '
+            "users can search custom folders as well."
+        ),
+    )
+    query: str = Field(
+        description=(
+            "The Microsoift Graph v1.0 $search query. Example filters include "
+            "from:sender, from:sender, to:recipient, subject:subject, "
+            "recipients:list_of_recipients, body:excitement, importance:high, "
+            "received>2022-12-01, received<2021-12-01, sent>2022-12-01, "
+            "sent<2021-12-01, hasAttachments:true  attachment:api-catalog.md, "
+            "cc:samanthab@contoso.com, bcc:samanthab@contoso.com, body:excitement date "
+            "range example: received:2023-06-08..2023-06-09  matching example: "
+            "from:amy OR from:david."
+        )
+    )
+    max_results: int = Field(
+        default=10,
+        description="The maximum number of results to return.",
+    )
+    truncate: bool = Field(
+        default=True,
+        description=(
+            "Whether the email body is trucated to meet token number limits. Set to "
+            "False for searches that will retrieve very few results, otherwise, set to "
+            "True"
+        ),
+    )
+
+
+class O365SearchEmails(O365BaseTool):
+    """Class for searching email messages in Office 365
+
+    Free, but setup is required
+    """
+
+    name: str = "messages_search"
+    args_schema: Type[BaseModel] = SearchEmailsInput
+    description: str = (
+        "Use this tool to search for email messages."
+        " The input must be a valid Microsoft Graph v1.0 $search query."
+        " The output is a JSON list of the requested resource."
+    )
+
+    class Config:
+        """Configuration for this pydantic object."""
+
+        extra = Extra.forbid
+
+    def _run(
+        self,
+        query: str,
+        folder: str = "",
+        max_results: int = 10,
+        truncate: bool = True,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> List[Dict[str, Any]]:
+        # Get mailbox object
+        mailbox = self.account.mailbox()
+
+        # Pull the folder if the user wants to search in a folder
+        if folder != "":
+            mailbox = mailbox.get_folder(folder_name=folder)
+
+        # Retrieve messages based on query
+        query = mailbox.q().search(query)
+        messages = mailbox.get_messages(limit=max_results, query=query)
+
+        # Generate output dict
+        output_messages = []
+        for message in messages:
+            output_message = {}
+            output_message["from"] = message.sender
+
+            if truncate:
+                output_message["body"] = message.body_preview
+            else:
+                output_message["body"] = clean_body(message.body)
+
+            output_message["subject"] = message.subject
+
+            output_message["date"] = message.modified.strftime("%Y-%m-%dT%H:%M:%S%z")
+
+            output_message["to"] = []
+            for recipient in message.to._recipients:
+                output_message["to"].append(str(recipient))
+
+            output_message["cc"] = []
+            for recipient in message.cc._recipients:
+                output_message["cc"].append(str(recipient))
+
+            output_message["bcc"] = []
+            for recipient in message.bcc._recipients:
+                output_message["bcc"].append(str(recipient))
+
+            output_messages.append(output_message)
+
+        return output_messages
+
+    async def _arun(
+        self,
+        query: str,
+        max_results: int = 10,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> List[Dict[str, Any]]:
+        """Run the tool."""
+        raise NotImplementedError
--- a/langchain/tools/office365/send_event.py
+++ b/langchain/tools/office365/send_event.py
@@ -0,0 +1,96 @@
+"""Util that sends calendar events in Office 365.
+
+Free, but setup is required. See link below.
+https://learn.microsoft.com/en-us/graph/auth/
+"""
+
+from datetime import datetime as dt
+from typing import List, Optional, Type
+
+from pydantic import BaseModel, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.office365.base import O365BaseTool
+
+
+class SendEventSchema(BaseModel):
+    """Input for CreateEvent Tool."""
+
+    body: str = Field(
+        ...,
+        description="The message body to include in the event.",
+    )
+    attendees: List[str] = Field(
+        ...,
+        description="The list of attendees for the event.",
+    )
+    subject: str = Field(
+        ...,
+        description="The subject of the event.",
+    )
+    start_datetime: str = Field(
+        description=" The start datetime for the event in the following format: "
+        ' YYYY-MM-DDTHH:MM:SS±hh:mm, where "T" separates the date and time '
+        " components, and the time zone offset is specified as ±hh:mm. "
+        ' For example: "2023-06-09T10:30:00+03:00" represents June 9th, '
+        " 2023, at 10:30 AM in a time zone with a positive offset of 3 "
+        " hours from Coordinated Universal Time (UTC).",
+    )
+    end_datetime: str = Field(
+        description=" The end datetime for the event in the following format: "
+        ' YYYY-MM-DDTHH:MM:SS±hh:mm, where "T" separates the date and time '
+        " components, and the time zone offset is specified as ±hh:mm. "
+        ' For example: "2023-06-09T10:30:00+03:00" represents June 9th, '
+        " 2023, at 10:30 AM in a time zone with a positive offset of 3 "
+        " hours from Coordinated Universal Time (UTC).",
+    )
+
+
+class O365SendEvent(O365BaseTool):
+    name: str = "send_event"
+    description: str = (
+        "Use this tool to create and send an event with the provided event fields."
+    )
+    args_schema: Type[SendEventSchema] = SendEventSchema
+
+    def _run(
+        self,
+        body: str,
+        attendees: List[str],
+        subject: str,
+        start_datetime: str,
+        end_datetime: str,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        # Get calendar object
+        schedule = self.account.schedule()
+        calendar = schedule.get_default_calendar()
+
+        event = calendar.new_event()
+
+        event.body = body
+        event.subject = subject
+        event.start = dt.strptime(start_datetime, "%Y-%m-%dT%H:%M:%S%z")
+        event.end = dt.strptime(end_datetime, "%Y-%m-%dT%H:%M:%S%z")
+        for attendee in attendees:
+            event.attendees.add(attendee)
+
+        # TO-DO: Look into PytzUsageWarning
+        event.save()
+
+        output = "Event sent: " + str(event)
+        return output
+
+    async def _arun(
+        self,
+        message: str,
+        to: List[str],
+        subject: str,
+        cc: Optional[List[str]] = None,
+        bcc: Optional[List[str]] = None,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        raise NotImplementedError(f"The tool {self.name} does not support async yet.")
--- a/langchain/tools/office365/send_message.py
+++ b/langchain/tools/office365/send_message.py
@@ -0,0 +1,78 @@
+from typing import List, Optional, Type
+
+from pydantic import BaseModel, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForToolRun,
+    CallbackManagerForToolRun,
+)
+from langchain.tools.office365.base import O365BaseTool
+
+
+class SendMessageSchema(BaseModel):
+    body: str = Field(
+        ...,
+        description="The message body to be sent.",
+    )
+    to: List[str] = Field(
+        ...,
+        description="The list of recipients.",
+    )
+    subject: str = Field(
+        ...,
+        description="The subject of the message.",
+    )
+    cc: Optional[List[str]] = Field(
+        None,
+        description="The list of CC recipients.",
+    )
+    bcc: Optional[List[str]] = Field(
+        None,
+        description="The list of BCC recipients.",
+    )
+
+
+class O365SendMessage(O365BaseTool):
+    name: str = "send_email"
+    description: str = (
+        "Use this tool to send an email with the provided message fields."
+    )
+    args_schema: Type[SendMessageSchema] = SendMessageSchema
+
+    def _run(
+        self,
+        body: str,
+        to: List[str],
+        subject: str,
+        cc: Optional[List[str]] = None,
+        bcc: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForToolRun] = None,
+    ) -> str:
+        # Get mailbox object
+        mailbox = self.account.mailbox()
+        message = mailbox.new_message()
+
+        # Assign message values
+        message.body = body
+        message.subject = subject
+        message.to.add(to)
+        if cc is not None:
+            message.cc.add(cc)
+        if bcc is not None:
+            message.bcc.add(cc)
+
+        message.send()
+
+        output = "Message sent: " + str(message)
+        return output
+
+    async def _arun(
+        self,
+        message: str,
+        to: List[str],
+        subject: str,
+        cc: Optional[List[str]] = None,
+        bcc: Optional[List[str]] = None,
+        run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
+    ) -> str:
+        raise NotImplementedError(f"The tool {self.name} does not support async yet.")
--- a/langchain/tools/office365/utils.py
+++ b/langchain/tools/office365/utils.py
@@ -0,0 +1,74 @@
+"""O365 tool utils."""
+from __future__ import annotations
+
+import logging
+import os
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from O365 import Account
+
+logger = logging.getLogger(__name__)
+
+
+def clean_body(body: str) -> str:
+    """Clean body of a message or event."""
+    try:
+        from bs4 import BeautifulSoup
+
+        try:
+            # Remove HTML
+            soup = BeautifulSoup(str(body), "html.parser")
+            body = soup.get_text()
+
+            # Remove return characters
+            body = "".join(body.splitlines())
+
+            # Remove extra spaces
+            body = " ".join(body.split())
+
+            return str(body)
+        except Exception:
+            return str(body)
+    except ImportError:
+        return str(body)
+
+
+def authenticate() -> Account:
+    """Authenticate using the Microsoft Grah API"""
+    try:
+        from O365 import Account
+    except ImportError as e:
+        raise ImportError(
+            "Cannot import 0365. Please install the package with `pip install O365`."
+        ) from e
+
+    if "CLIENT_ID" in os.environ and "CLIENT_SECRET" in os.environ:
+        client_id = os.environ["CLIENT_ID"]
+        client_secret = os.environ["CLIENT_SECRET"]
+        credentials = (client_id, client_secret)
+    else:
+        logger.error(
+            "Error: The CLIENT_ID and CLIENT_SECRET environmental variables have not "
+            "been set. Visit the following link on how to acquire these authorization "
+            "tokens: https://learn.microsoft.com/en-us/graph/auth/"
+        )
+        return None
+
+    account = Account(credentials)
+
+    if account.is_authenticated is False:
+        if not account.authenticate(
+            scopes=[
+                "https://graph.microsoft.com/Mail.ReadWrite",
+                "https://graph.microsoft.com/Mail.Send",
+                "https://graph.microsoft.com/Calendars.ReadWrite",
+                "https://graph.microsoft.com/MailboxSettings.ReadWrite",
+            ]
+        ):
+            print("Error: Could not authenticate")
+            return None
+        else:
+            return account
+    else:
+        return account
--- a/langchain/utilities/duckduckgo_search.py
+++ b/langchain/utilities/duckduckgo_search.py
@@ -49,14 +49,15 @@ class DuckDuckGoSearchAPIWrapper(BaseModel):
                safesearch=self.safesearch,
                timelimit=self.time,
            )
-            if results is None or next(results, None) is None:
+            if results is None:
                return ["No good DuckDuckGo Search Result was found"]
            snippets = []
            for i, res in enumerate(results, 1):
-                snippets.append(res["body"])
-                if i == self.max_results:
+                if res is not None:
+                    snippets.append(res["body"])
+                if len(snippets) == self.max_results:
                    break
-            return snippets
+        return snippets

    def run(self, query: str) -> str:
        snippets = self.get_snippets(query)
@@ -84,7 +85,7 @@ class DuckDuckGoSearchAPIWrapper(BaseModel):
                safesearch=self.safesearch,
                timelimit=self.time,
            )
-            if results is None or next(results, None) is None:
+            if results is None:
                return [{"Result": "No good DuckDuckGo Search Result was found"}]

            def to_metadata(result: Dict) -> Dict[str, str]:
@@ -96,7 +97,8 @@ class DuckDuckGoSearchAPIWrapper(BaseModel):

            formatted_results = []
            for i, res in enumerate(results, 1):
-                formatted_results.append(to_metadata(res))
-                if i == num_results:
+                if res is not None:
+                    formatted_results.append(to_metadata(res))
+                if len(formatted_results) == num_results:
                    break
-            return formatted_results
+        return formatted_results
--- a/langchain/utilities/jira.py
+++ b/langchain/utilities/jira.py
@@ -5,6 +5,7 @@ from pydantic import BaseModel, Extra, root_validator

 from langchain.tools.jira.prompt import (
    JIRA_CATCH_ALL_PROMPT,
+    JIRA_CONFLUENCE_PAGE_CREATE_PROMPT,
    JIRA_GET_ALL_PROJECTS_PROMPT,
    JIRA_ISSUE_CREATE_PROMPT,
    JIRA_JQL_PROMPT,
@@ -17,6 +18,7 @@ class JiraAPIWrapper(BaseModel):
    """Wrapper for Jira API."""

    jira: Any  #: :meta private:
+    confluence: Any
    jira_username: Optional[str] = None
    jira_api_token: Optional[str] = None
    jira_instance_url: Optional[str] = None
@@ -42,6 +44,11 @@ class JiraAPIWrapper(BaseModel):
            "name": "Catch all Jira API call",
            "description": JIRA_CATCH_ALL_PROMPT,
        },
+        {
+            "mode": "create_page",
+            "name": "Create confluence page",
+            "description": JIRA_CONFLUENCE_PAGE_CREATE_PROMPT,
+        },
    ]

    class Config:
@@ -69,7 +76,7 @@ class JiraAPIWrapper(BaseModel):
        values["jira_instance_url"] = jira_instance_url

        try:
-            from atlassian import Jira
+            from atlassian import Confluence, Jira
        except ImportError:
            raise ImportError(
                "atlassian-python-api is not installed. "
@@ -82,7 +89,16 @@ class JiraAPIWrapper(BaseModel):
            password=jira_api_token,
            cloud=True,
        )
+
+        confluence = Confluence(
+            url=jira_instance_url,
+            username=jira_username,
+            password=jira_api_token,
+            cloud=True,
+        )
+
        values["jira"] = jira
+        values["confluence"] = confluence

        return values

@@ -151,7 +167,7 @@ class JiraAPIWrapper(BaseModel):
        )
        return parsed_projects_str

-    def create(self, query: str) -> str:
+    def issue_create(self, query: str) -> str:
        try:
            import json
        except ImportError:
@@ -161,6 +177,16 @@ class JiraAPIWrapper(BaseModel):
        params = json.loads(query)
        return self.jira.issue_create(fields=dict(params))

+    def page_create(self, query: str) -> str:
+        try:
+            import json
+        except ImportError:
+            raise ImportError(
+                "json is not installed. Please install it with `pip install json`"
+            )
+        params = json.loads(query)
+        return self.confluence.create_page(**dict(params))
+
    def other(self, query: str) -> str:
        context = {"self": self}
        exec(f"result = {query}", context)
@@ -173,8 +199,10 @@ class JiraAPIWrapper(BaseModel):
        elif mode == "get_projects":
            return self.project()
        elif mode == "create_issue":
-            return self.create(query)
+            return self.issue_create(query)
        elif mode == "other":
            return self.other(query)
+        elif mode == "create_page":
+            return self.page_create(query)
        else:
            raise ValueError(f"Got unexpected mode {mode}")
--- a/langchain/vectorstores/analyticdb.py
+++ b/langchain/vectorstores/analyticdb.py
@@ -80,34 +80,34 @@ class AnalyticDB(VectorStore):
            extend_existing=True,
        )
        with self.engine.connect() as conn:
-            # Create the table
-            Base.metadata.create_all(conn)
+            with conn.begin():
+                # Create the table
+                Base.metadata.create_all(conn)

-            # Check if the index exists
-            index_name = f"{self.collection_name}_embedding_idx"
-            index_query = text(
-                f"""
-                SELECT 1
-                FROM pg_indexes
-                WHERE indexname = '{index_name}';
-            """
-            )
-            result = conn.execute(index_query).scalar()
-
-            # Create the index if it doesn't exist
-            if not result:
-                index_statement = text(
+                # Check if the index exists
+                index_name = f"{self.collection_name}_embedding_idx"
+                index_query = text(
                    f"""
-                    CREATE INDEX {index_name}
-                    ON {self.collection_name} USING ann(embedding)
-                    WITH (
-                        "dim" = {self.embedding_dimension},
-                        "hnsw_m" = 100
-                    );
+                    SELECT 1
+                    FROM pg_indexes
+                    WHERE indexname = '{index_name}';
                """
                )
-                conn.execute(index_statement)
-            conn.commit()
+                result = conn.execute(index_query).scalar()
+
+                # Create the index if it doesn't exist
+                if not result:
+                    index_statement = text(
+                        f"""
+                        CREATE INDEX {index_name}
+                        ON {self.collection_name} USING ann(embedding)
+                        WITH (
+                            "dim" = {self.embedding_dimension},
+                            "hnsw_m" = 100
+                        );
+                    """
+                    )
+                    conn.execute(index_statement)

    def create_collection(self) -> None:
        if self.pre_delete_collection:
@@ -118,8 +118,8 @@ class AnalyticDB(VectorStore):
        self.logger.debug("Trying to delete collection")
        drop_statement = text(f"DROP TABLE IF EXISTS {self.collection_name};")
        with self.engine.connect() as conn:
-            conn.execute(drop_statement)
-            conn.commit()
+            with conn.begin():
+                conn.execute(drop_statement)

    def add_texts(
        self,
@@ -160,30 +160,28 @@ class AnalyticDB(VectorStore):

        chunks_table_data = []
        with self.engine.connect() as conn:
-            for document, metadata, chunk_id, embedding in zip(
-                texts, metadatas, ids, embeddings
-            ):
-                chunks_table_data.append(
-                    {
-                        "id": chunk_id,
-                        "embedding": embedding,
-                        "document": document,
-                        "metadata": metadata,
-                    }
-                )
+            with conn.begin():
+                for document, metadata, chunk_id, embedding in zip(
+                    texts, metadatas, ids, embeddings
+                ):
+                    chunks_table_data.append(
+                        {
+                            "id": chunk_id,
+                            "embedding": embedding,
+                            "document": document,
+                            "metadata": metadata,
+                        }
+                    )

-                # Execute the batch insert when the batch size is reached
-                if len(chunks_table_data) == batch_size:
+                    # Execute the batch insert when the batch size is reached
+                    if len(chunks_table_data) == batch_size:
+                        conn.execute(insert(chunks_table).values(chunks_table_data))
+                        # Clear the chunks_table_data list for the next batch
+                        chunks_table_data.clear()
+
+                # Insert any remaining records that didn't make up a full batch
+                if chunks_table_data:
                    conn.execute(insert(chunks_table).values(chunks_table_data))
-                    # Clear the chunks_table_data list for the next batch
-                    chunks_table_data.clear()
-
-            # Insert any remaining records that didn't make up a full batch
-            if chunks_table_data:
-                conn.execute(insert(chunks_table).values(chunks_table_data))
-
-            # Commit the transaction only once after all records have been inserted
-            conn.commit()

        return ids

@@ -333,9 +331,9 @@ class AnalyticDB(VectorStore):
    ) -> AnalyticDB:
        """
        Return VectorStore initialized from texts and embeddings.
-        Postgres connection string is required
+        Postgres Connection string is required
        Either pass it as a parameter
-        or set the PGVECTOR_CONNECTION_STRING environment variable.
+        or set the PG_CONNECTION_STRING environment variable.
        """

        connection_string = cls.get_connection_string(kwargs)
@@ -363,7 +361,7 @@ class AnalyticDB(VectorStore):
            raise ValueError(
                "Postgres connection string is required"
                "Either pass it as a parameter"
-                "or set the PGVECTOR_CONNECTION_STRING environment variable."
+                "or set the PG_CONNECTION_STRING environment variable."
            )

        return connection_string
@@ -381,9 +379,9 @@ class AnalyticDB(VectorStore):
    ) -> AnalyticDB:
        """
        Return VectorStore initialized from documents and embeddings.
-        Postgres connection string is required
+        Postgres Connection string is required
        Either pass it as a parameter
-        or set the PGVECTOR_CONNECTION_STRING environment variable.
+        or set the PG_CONNECTION_STRING environment variable.
        """

        texts = [d.page_content for d in documents]
--- a/langchain/vectorstores/chroma.py
+++ b/langchain/vectorstores/chroma.py
@@ -16,6 +16,7 @@ from langchain.vectorstores.utils import maximal_marginal_relevance
 if TYPE_CHECKING:
    import chromadb
    import chromadb.config
+    from chromadb.api.types import ID, OneOrMany, Where, WhereDocument

 logger = logging.getLogger()
 DEFAULT_K = 4  # Number of Documents to return.
@@ -228,7 +229,7 @@ class Chroma(VectorStore):
        k: int = 4,
        **kwargs: Any,
    ) -> List[Tuple[Document, float]]:
-        return self.similarity_search_with_score(query, k)
+        return self.similarity_search_with_score(query, k, **kwargs)

    def max_marginal_relevance_search_by_vector(
        self,
@@ -316,17 +317,43 @@ class Chroma(VectorStore):
        """Delete the collection."""
        self._client.delete_collection(self._collection.name)

-    def get(self, include: Optional[List[str]] = None) -> Dict[str, Any]:
+    def get(
+        self,
+        ids: Optional[OneOrMany[ID]] = None,
+        where: Optional[Where] = None,
+        limit: Optional[int] = None,
+        offset: Optional[int] = None,
+        where_document: Optional[WhereDocument] = None,
+        include: Optional[List[str]] = None,
+    ) -> Dict[str, Any]:
        """Gets the collection.

        Args:
-            include (Optional[List[str]]): List of fields to include from db.
-                Defaults to None.
+            ids: The ids of the embeddings to get. Optional.
+            where: A Where type dict used to filter results by.
+                   E.g. `{"color" : "red", "price": 4.20}`. Optional.
+            limit: The number of documents to return. Optional.
+            offset: The offset to start returning results from.
+                    Useful for paging results with limit. Optional.
+            where_document: A WhereDocument type dict used to filter by the documents.
+                            E.g. `{$contains: {"text": "hello"}}`. Optional.
+            include: A list of what to include in the results.
+                     Can contain `"embeddings"`, `"metadatas"`, `"documents"`.
+                     Ids are always included.
+                     Defaults to `["metadatas", "documents"]`. Optional.
        """
+        kwargs = {
+            "ids": ids,
+            "where": where,
+            "limit": limit,
+            "offset": offset,
+            "where_document": where_document,
+        }
+
        if include is not None:
-            return self._collection.get(include=include)
-        else:
-            return self._collection.get()
+            kwargs["include"] = include
+
+        return self._collection.get(**kwargs)

    def persist(self) -> None:
        """Persist the collection.
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "langchain"
-version = "0.0.214"
+version = "0.0.216"
 description = "Building applications with LLMs through composability"
 authors = []
 license = "MIT"
--- a/tests/integration_tests/callbacks/test_openai_callback.py
+++ b/tests/integration_tests/callbacks/test_openai_callback.py
@@ -38,6 +38,21 @@ async def test_openai_callback() -> None:
    assert cb.total_tokens == total_tokens


+def test_openai_callback_batch_llm() -> None:
+    llm = OpenAI(temperature=0)
+    with get_openai_callback() as cb:
+        llm.generate(["What is the square root of 4?", "What is the square root of 4?"])
+
+    assert cb.total_tokens > 0
+    total_tokens = cb.total_tokens
+
+    with get_openai_callback() as cb:
+        llm("What is the square root of 4?")
+        llm("What is the square root of 4?")
+
+    assert cb.total_tokens == total_tokens
+
+
 def test_openai_callback_agent() -> None:
    llm = OpenAI(temperature=0)
    tools = load_tools(["serpapi", "llm-math"], llm=llm)
--- a/tests/integration_tests/document_loaders/test_rst.py
+++ b/tests/integration_tests/document_loaders/test_rst.py
@@ -0,0 +1,15 @@
+import os
+from pathlib import Path
+
+from langchain.document_loaders import UnstructuredRSTLoader
+
+EXAMPLE_DIRECTORY = file_path = Path(__file__).parent.parent / "examples"
+
+
+def test_unstructured_rst_loader() -> None:
+    """Test unstructured loader."""
+    file_path = os.path.join(EXAMPLE_DIRECTORY, "README.rst")
+    loader = UnstructuredRSTLoader(str(file_path))
+    docs = loader.load()
+
+    assert len(docs) == 1
--- a/tests/integration_tests/document_loaders/test_whatsapp_chat.py
+++ b/tests/integration_tests/document_loaders/test_whatsapp_chat.py
@@ -18,4 +18,6 @@ def test_whatsapp_chat_loader() -> None:
        "User 1 on 1/23/23, 3:22_AM: And let me know if anything changes\n\n"
        "~ User name 2 on 1/24/21, 12:41:03 PM: Of course!\n\n"
        "~ User 2 on 2023/5/4, 16:13:23: See you!\n\n"
+        "User 1 on 7/19/22, 11:32 PM: Hello\n\n"
+        "User 2 on 7/20/22, 11:32 am: Goodbye\n\n"
    )
--- a/tests/integration_tests/examples/README.rst
+++ b/tests/integration_tests/examples/README.rst
@@ -0,0 +1,28 @@
+Example Docs
+------------
+
+The sample docs directory contains the following files:
+
+-  ``example-10k.html`` - A 10-K SEC filing in HTML format
+-  ``layout-parser-paper.pdf`` - A PDF copy of the layout parser paper
+-  ``factbook.xml``/``factbook.xsl`` - Example XML/XLS files that you
+   can use to test stylesheets
+
+These documents can be used to test out the parsers in the library. In
+addition, here are instructions for pulling in some sample docs that are
+too big to store in the repo.
+
+XBRL 10-K
+^^^^^^^^^
+
+You can get an example 10-K in inline XBRL format using the following
+``curl``. Note, you need to have the user agent set in the header or the
+SEC site will reject your request.
+
+.. code:: bash
+
+   curl -O \
+     -A '${organization} ${email}'
+     https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt
+
+You can parse this document using the HTML parser.
--- a/tests/integration_tests/examples/example.mht
+++ b/tests/integration_tests/examples/example.mht
@@ -0,0 +1,108 @@
+From: <Saved by Blink>
+Snapshot-Content-Location: https://langchain.com/
+Subject: 
+Date: Fri, 16 Jun 2023 19:32:59 -0000
+MIME-Version: 1.0
+Content-Type: multipart/related;
+	type="text/html";
+	boundary="----MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt----"
+
+
+------MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt----
+Content-Type: text/html
+Content-ID: <frame-2F1DB31BBD26C55A7F1EEC7561350515@mhtml.blink>
+Content-Transfer-Encoding: quoted-printable
+Content-Location: https://langchain.com/
+
+<html><head><title>LangChain</title><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=
+=3DUTF-8"><link rel=3D"stylesheet" type=3D"text/css" href=3D"cid:css-c9ac93=
+be-2ab2-46d8-8690-80da3a6d1832@mhtml.blink" /></head><body data-new-gr-c-s-=
+check-loaded=3D"14.1112.0" data-gr-ext-installed=3D""><p align=3D"center">
+	<b><font size=3D"6">L</font><font size=3D"4">ANG </font><font size=3D"6">C=
+</font><font size=3D"4">HAIN </font><font size=3D"2">=F0=9F=A6=9C=EF=B8=8F=
+=F0=9F=94=97</font><br>Official Home Page</b><font size=3D"1">&nbsp;</font>=
+</p>
+
+<hr>
+<center>
+<table border=3D"0" cellspacing=3D"0" width=3D"90%">
+  <tbody>
+  <tr>
+    <td height=3D"55" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://langchain.com/integrations.html">Integration=
+s</a>=20
+    </li></ul></td>
+   <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://langchain.com/features.html">Features</a>=20
+        </li></ul></td></tr>
+    <tr>
+    <td height=3D"55" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://blog.langchain.dev/">Blog</a>=20
+    </li></ul></td>
+   <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://docs.langchain.com/docs/">Conceptual Guide</=
+a>=20
+        </li></ul></td></tr>
+
+  <tr>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://github.com/hwchase17/langchain">Python Repo<=
+/a></li></ul></td>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+		  <ul>
+        <li><a href=3D"https://github.com/hwchase17/langchainjs">JavaScript=
+ Repo</a></li></ul></td></tr>
+ =20
+=09
+  <tr>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://python.langchain.com/en/latest/">Python Docu=
+mentation</a> </li></ul></td>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+         <li><a href=3D"https://js.langchain.com/docs/">JavaScript Document=
+ation</a>
+					</li></ul></td></tr>
+  <tr>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://github.com/hwchase17/chat-langchain">Python =
+ChatLangChain</a> </li></ul></td>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+         <li><a href=3D"https://github.com/sullivan-sean/chat-langchainjs">=
+JavaScript ChatLangChain</a>
+					</li></ul></td></tr>
+  <tr>
+    <td height=3D"45" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://discord.gg/6adMQxSpJS">Discord</a> </li></ul=
+></td>
+    <td height=3D"55" valign=3D"top" width=3D"50%">
+      <ul>
+        <li><a href=3D"https://twitter.com/langchainai">Twitter</a>
+					</li></ul></td></tr>
+			=09
+
+
+</tbody></table></center>
+<hr>
+<font size=3D"2">
+<p>If you have any comments about our WEB page, you can=20
+write us at the address shown above.  However, due to=20
+the limited number of personnel in our corporate office, we are unable to=
+=20
+provide a direct response.</p></font>
+<hr>
+<p align=3D"left"><font size=3D"2">Copyright =C2=A9 2023-2023<b> LangChain =
+Inc.</b></font><font size=3D"2">=20
+</font></p>
+</body></html>
+
+------MultipartBoundary--dYaUgeoeP18TqraaeOwkeZyu1vI09OtkFwH2rcnJMt------
--- a/tests/integration_tests/examples/whatsapp_chat.txt
+++ b/tests/integration_tests/examples/whatsapp_chat.txt
@@ -3,4 +3,6 @@
 1/23/23, 3:19 AM - User 2: Bye!
 1/23/23, 3:22_AM - User 1: And let me know if anything changes
 [1/24/21, 12:41:03 PM] ~ User name 2: Of course!
-[2023/5/4, 16:13:23] ~ User 2: See you!
+[2023/5/4, 16:13:23] ~ User 2: See you!
+7/19/22, 11:32 PM - User 1: Hello
+7/20/22, 11:32 am - User 2: Goodbye
--- a/tests/integration_tests/llms/test_openai.py
+++ b/tests/integration_tests/llms/test_openai.py
@@ -96,6 +96,15 @@ def test_openai_streaming() -> None:
        assert isinstance(token["choices"][0]["text"], str)


+def test_openai_multiple_prompts() -> None:
+    """Test completion with multiple prompts."""
+    llm = OpenAI(max_tokens=10)
+    output = llm.generate(["I'm Pickle Rick", "I'm Pickle Rick"])
+    assert isinstance(output, LLMResult)
+    assert isinstance(output.generations, list)
+    assert len(output.generations) == 2
+
+
 def test_openai_streaming_error() -> None:
    """Test error handling in stream."""
    llm = OpenAI(best_of=2)
--- a/tests/integration_tests/utilities/test_jira_api.py
+++ b/tests/integration_tests/utilities/test_jira_api.py
@@ -27,3 +27,17 @@ def test_create_ticket() -> None:
    output = jira.run("create_issue", issue_string)
    assert "id" in output
    assert "key" in output
+
+
+def test_create_confluence_page() -> None:
+    """Test for getting projects on JIRA"""
+    jira = JiraAPIWrapper()
+    create_page_dict = (
+        '{"space": "ROC", "title":"This is the title",'
+        '"body":"This is the body. You can use '
+        '<strong>HTML tags</strong>!"}'
+    )
+
+    output = jira.run("create_page", create_page_dict)
+    assert "type" in output
+    assert "page" in output
--- a/tests/unit_tests/agents/test_agent.py
+++ b/tests/unit_tests/agents/test_agent.py
@@ -28,6 +28,10 @@ class FakeListLLM(LLM):
        print(self.responses[self.i])
        return self.responses[self.i]

+    def get_num_tokens(self, text: str) -> int:
+        """Return number of tokens in text."""
+        return len(text.split())
+
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {}
--- a/tests/unit_tests/agents/test_initialize.py
+++ b/tests/unit_tests/agents/test_initialize.py
@@ -0,0 +1,23 @@
+"""Test the initialize module."""
+
+from langchain.agents.agent_types import AgentType
+from langchain.agents.initialize import initialize_agent
+from langchain.tools.base import tool
+from tests.unit_tests.llms.fake_llm import FakeLLM
+
+
+@tool
+def my_tool(query: str) -> str:
+    """A fake tool."""
+    return "fake tool"
+
+
+def test_initialize_agent_with_str_agent_type() -> None:
+    """Test initialize_agent with a string."""
+    fake_llm = FakeLLM()
+    agent_executor = initialize_agent(
+        [my_tool], fake_llm, "zero-shot-react-description"  # type: ignore
+    )
+    assert agent_executor.agent._agent_type == AgentType.ZERO_SHOT_REACT_DESCRIPTION
+    assert isinstance(agent_executor.tags, list)
+    assert "zero-shot-react-description" in agent_executor.tags
--- a/tests/unit_tests/callbacks/test_callback_manager.py
+++ b/tests/unit_tests/callbacks/test_callback_manager.py
@@ -18,11 +18,12 @@ def _test_callback_manager(
    manager: CallbackManager, *handlers: BaseFakeCallbackHandler
 ) -> None:
    """Test the CallbackManager."""
-    run_manager = manager.on_llm_start({}, [])
-    run_manager.on_llm_end(LLMResult(generations=[]))
-    run_manager.on_llm_error(Exception())
-    run_manager.on_llm_new_token("foo")
-    run_manager.on_text("foo")
+    run_managers = manager.on_llm_start({}, ["prompt"])
+    for run_manager in run_managers:
+        run_manager.on_llm_end(LLMResult(generations=[]))
+        run_manager.on_llm_error(Exception())
+        run_manager.on_llm_new_token("foo")
+        run_manager.on_text("foo")

    run_manager_chain = manager.on_chain_start({"name": "foo"}, {})
    run_manager_chain.on_chain_end({})
@@ -42,11 +43,12 @@ async def _test_callback_manager_async(
    manager: AsyncCallbackManager, *handlers: BaseFakeCallbackHandler
 ) -> None:
    """Test the CallbackManager."""
-    run_manager = await manager.on_llm_start({}, [])
-    await run_manager.on_llm_end(LLMResult(generations=[]))
-    await run_manager.on_llm_error(Exception())
-    await run_manager.on_llm_new_token("foo")
-    await run_manager.on_text("foo")
+    run_managers = await manager.on_llm_start({}, ["prompt"])
+    for run_manager in run_managers:
+        await run_manager.on_llm_end(LLMResult(generations=[]))
+        await run_manager.on_llm_error(Exception())
+        await run_manager.on_llm_new_token("foo")
+        await run_manager.on_text("foo")

    run_manager_chain = await manager.on_chain_start({"name": "foo"}, {})
    await run_manager_chain.on_chain_end({})
@@ -95,9 +97,10 @@ def test_ignore_llm() -> None:
    handler1 = FakeCallbackHandler(ignore_llm_=True)
    handler2 = FakeCallbackHandler()
    manager = CallbackManager(handlers=[handler1, handler2])
-    run_manager = manager.on_llm_start({}, [])
-    run_manager.on_llm_end(LLMResult(generations=[]))
-    run_manager.on_llm_error(Exception())
+    run_managers = manager.on_llm_start({}, ["prompt"])
+    for run_manager in run_managers:
+        run_manager.on_llm_end(LLMResult(generations=[]))
+        run_manager.on_llm_error(Exception())
    assert handler1.starts == 0
    assert handler1.ends == 0
    assert handler1.errors == 0
--- a/tests/unit_tests/callbacks/tracers/test_base_tracer.py
+++ b/tests/unit_tests/callbacks/tracers/test_base_tracer.py
@@ -11,7 +11,7 @@ from freezegun import freeze_time
 from langchain.callbacks.manager import CallbackManager
 from langchain.callbacks.tracers.base import BaseTracer, TracerException
 from langchain.callbacks.tracers.schemas import Run
-from langchain.schema import LLMResult
+from langchain.schema import HumanMessage, LLMResult

 SERIALIZED = {"id": ["llm"]}
 SERIALIZED_CHAT = {"id": ["chat_model"]}
@@ -58,9 +58,13 @@ def test_tracer_llm_run() -> None:
@freeze_time("2023-01-01")
 def test_tracer_chat_model_run() -> None:
    """Test tracer on a Chat Model run."""
-    uuid = uuid4()
+    tracer = FakeTracer()
+    manager = CallbackManager(handlers=[tracer])
+    run_managers = manager.on_chat_model_start(
+        serialized=SERIALIZED_CHAT, messages=[[HumanMessage(content="")]]
+    )
    compare_run = Run(
-        id=str(uuid),
+        id=str(run_managers[0].run_id),
        name="chat_model",
        start_time=datetime.utcnow(),
        end_time=datetime.utcnow(),
@@ -68,17 +72,13 @@ def test_tracer_chat_model_run() -> None:
        execution_order=1,
        child_execution_order=1,
        serialized=SERIALIZED_CHAT,
-        inputs=dict(prompts=[""]),
+        inputs=dict(prompts=["Human: "]),
        outputs=LLMResult(generations=[[]]),
        error=None,
        run_type="llm",
    )
-    tracer = FakeTracer()
-    manager = CallbackManager(handlers=[tracer])
-    run_manager = manager.on_chat_model_start(
-        serialized=SERIALIZED_CHAT, messages=[[]], run_id=uuid
-    )
-    run_manager.on_llm_end(response=LLMResult(generations=[[]]))
+    for run_manager in run_managers:
+        run_manager.on_llm_end(response=LLMResult(generations=[[]]))
    assert tracer.runs == [compare_run]


--- a/tests/unit_tests/callbacks/tracers/test_langchain_v1.py
+++ b/tests/unit_tests/callbacks/tracers/test_langchain_v1.py
@@ -18,7 +18,7 @@ from langchain.callbacks.tracers.langchain_v1 import (
    TracerSessionV1,
 )
 from langchain.callbacks.tracers.schemas import Run, RunTypeEnum, TracerSessionV1Base
-from langchain.schema import LLMResult
+from langchain.schema import HumanMessage, LLMResult

 TEST_SESSION_ID = 2023

@@ -127,9 +127,15 @@ def test_tracer_llm_run() -> None:
@freeze_time("2023-01-01")
 def test_tracer_chat_model_run() -> None:
    """Test tracer on a Chat Model run."""
-    uuid = uuid4()
+    tracer = FakeTracer()
+
+    tracer.new_session()
+    manager = CallbackManager(handlers=[tracer])
+    run_managers = manager.on_chat_model_start(
+        serialized=SERIALIZED_CHAT, messages=[[HumanMessage(content="")]]
+    )
    compare_run = LLMRun(
-        uuid=str(uuid),
+        uuid=str(run_managers[0].run_id),
        parent_uuid=None,
        start_time=datetime.utcnow(),
        end_time=datetime.utcnow(),
@@ -137,19 +143,13 @@ def test_tracer_chat_model_run() -> None:
        execution_order=1,
        child_execution_order=1,
        serialized=SERIALIZED_CHAT,
-        prompts=[""],
+        prompts=["Human: "],
        response=LLMResult(generations=[[]]),
        session_id=TEST_SESSION_ID,
        error=None,
    )
-    tracer = FakeTracer()
-
-    tracer.new_session()
-    manager = CallbackManager(handlers=[tracer])
-    run_manager = manager.on_chat_model_start(
-        serialized=SERIALIZED_CHAT, messages=[[]], run_id=uuid
-    )
-    run_manager.on_llm_end(response=LLMResult(generations=[[]]))
+    for run_manager in run_managers:
+        run_manager.on_llm_end(response=LLMResult(generations=[[]]))
    assert tracer.runs == [compare_run]


--- a/tests/unit_tests/chains/test_hyde.py
+++ b/tests/unit_tests/chains/test_hyde.py
@@ -49,6 +49,10 @@ class FakeLLM(BaseLLM):
    ) -> LLMResult:
        return LLMResult(generations=[[Generation(text="foo") for _ in range(self.n)]])

+    def get_num_tokens(self, text: str) -> int:
+        """Return number of tokens."""
+        return len(text.split())
+
    @property
    def _llm_type(self) -> str:
        """Return type of llm."""
--- a/tests/unit_tests/chains/test_natbot.py
+++ b/tests/unit_tests/chains/test_natbot.py
@@ -28,6 +28,9 @@ class FakeLLM(LLM):
        """Return type of llm."""
        return "fake"

+    def get_num_tokens(self, text: str) -> int:
+        return len(text.split())
+
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {}
--- a/tests/unit_tests/document_loaders/test_mhtml.py
+++ b/tests/unit_tests/document_loaders/test_mhtml.py
@@ -0,0 +1,25 @@
+from pathlib import Path
+
+import pytest
+
+from langchain.document_loaders.mhtml import MHTMLLoader
+
+HERE = Path(__file__).parent
+EXAMPLES = HERE.parent.parent / "integration_tests" / "examples"
+
+
+@pytest.mark.requires("bs4", "lxml")
+def test_mhtml_loader() -> None:
+    """Test mhtml loader."""
+    file_path = EXAMPLES / "example.mht"
+    loader = MHTMLLoader(str(file_path))
+    docs = loader.load()
+
+    assert len(docs) == 1
+
+    metadata = docs[0].metadata
+    content = docs[0].page_content
+
+    assert metadata["title"] == "LangChain"
+    assert metadata["source"] == str(file_path)
+    assert "LANG CHAIN 🦜️🔗Official Home Page" in content
--- a/tests/unit_tests/evaluation/comparison/init.py
+++ b/tests/unit_tests/evaluation/comparison/init.py
--- a/tests/unit_tests/evaluation/comparison/test_eval_chain.py
+++ b/tests/unit_tests/evaluation/comparison/test_eval_chain.py
@@ -0,0 +1,39 @@
+"""Test the comparison chains."""
+
+
+from langchain.evaluation.comparison.eval_chain import PairwiseStringEvalChain
+from tests.unit_tests.llms.fake_llm import FakeLLM
+
+
+def test_pairwise_string_comparison_chain() -> None:
+    llm = FakeLLM(
+        queries={
+            "a": "The values are the same.\n[[C]]",
+            "b": "A is clearly better than b.\n[[A]]",
+            "c": "B is clearly better than a.\n[[B]]",
+        },
+        sequential_responses=True,
+    )
+    chain = PairwiseStringEvalChain.from_llm(llm=llm)
+    res = chain.evaluate_string_pairs(
+        output_a="I like pie.",
+        output_b="I love pie.",
+        input="What is your favorite food?",
+    )
+    assert res["value"] is None
+    assert res["score"] == 0.5
+    assert res["reasoning"] == "The values are the same."
+    res = chain.evaluate_string_pairs(
+        output_a="I like pie.",
+        output_b="I like pie.",
+        input="What is your favorite food?",
+    )
+    assert res["value"] == "A"
+    assert res["score"] == 1
+    res = chain.evaluate_string_pairs(
+        output_a="I like pie.",
+        output_b="I hate pie.",
+        input="What is your favorite food?",
+    )
+    assert res["value"] == "B"
+    assert res["score"] == 0
--- a/tests/unit_tests/evaluation/criteria/init.py
+++ b/tests/unit_tests/evaluation/criteria/init.py
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Harrison Chase	a010520c21	cr	2023-06-27 15:42:41 -07:00
vowelparrot	d9c38c1dd1	mv again	2023-06-27 00:33:01 -07:00
vowelparrot	1f8b82121d	add files	2023-06-27 00:17:20 -07:00
Zander Chase	cc60fed3be	Add a Pairwise Comparison Chain (#6703 ) Notebook shows preference scoring between two chains and reports wilson score interval + p value I think I'll add the option to insert ground truth labels but doesn't have to be in this PR	2023-06-26 20:47:41 -07:00
Hakan Tekgul	2928b080f6	Update arize_callback.py - bug fix (#6784 ) - Description: Bug Fix - Added a step variable to keep track of prompts - Issue: Bug from internal Arize testing - The prompts and responses that are ingested were not mapped correctly - Dependencies: N/A	2023-06-26 16:49:46 -07:00
Zander Chase	c460b04c64	Update String Evaluator (#6615 ) - Add protocol for `evaluate_strings` - Move the criteria evaluator out so it's not restricted to being applied on traced runs	2023-06-26 14:16:14 -07:00
AaaCabbage	b3f8324de9	feat: fix the Chinese characters in the solution content will be conv… (#6734 ) fix the Chinese characters in the solution content will be converted to ascii encoding, resulting in an abnormally long number of tokens Co-authored-by: qixin <qixin@fintec.ai>	2023-06-26 13:14:48 -07:00
Chris Pappalardo	70f7c2bb2e	align chroma vectorstore get with chromadb to enable where filtering (#6686 ) allows for where filtering on collection via get - Description: aligns langchain chroma vectorstore get with underlying [chromadb collection get](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py#L103) allowing for where filtering, etc. - Issue: NA - Dependencies: none - Tag maintainer: @rlancemartin, @eyurtsev - Twitter handle: @pappanaka	2023-06-26 10:51:20 -07:00
Zander Chase	9ca3b4645e	Add support for tags in chain group context manager (#6668 ) Lets you specify local and inheritable tags in the group manager. Also, add more verbose docstrings for our reference docs.	2023-06-26 10:37:33 -07:00
Harrison Chase	d1bcc58beb	bump version to 216 (#6770 )	2023-06-26 09:46:19 -07:00
Zander Chase	6d30acffcb	Fix breaking tags (#6765 ) Fix tags change that broke old way of initializing agent Closes #6756	2023-06-26 09:28:11 -07:00
James Croft	ba622764cb	Improve performance when retrieving Notion DB pages (#6710 )	2023-06-26 05:46:09 -07:00
Richy Wang	ec8247ec59	Fixed bug in AnalyticDB Vector Store caused by upgrade SQLAlchemy version (#6736 )	2023-06-26 05:35:25 -07:00
Santiago Delgado	d84a3bcf7a	Office365 Tool (#6306 ) #### Background With the development of [structured tools](https://blog.langchain.dev/structured-tools/), the LangChain team expanded the platform's functionality to meet the needs of new applications. The GMail tool, empowered by structured tools, now supports multiple arguments and powerful search capabilities, demonstrating LangChain's ability to interact with dynamic data sources like email servers. #### Challenge The current GMail tool only supports GMail, while users often utilize other email services like Outlook in Office365. Additionally, the proposed calendar tool in PR https://github.com/hwchase17/langchain/pull/652 only works with Google Calendar, not Outlook. #### Changes This PR implements an Office365 integration for LangChain, enabling seamless email and calendar functionality with a single authentication process. #### Future Work With the core Office365 integration complete, future work could include integrating other Office365 tools such as Tasks and Address Book. #### Who can review? @hwchase17 or @vowelparrot can review this PR #### Appendix @janscas, I utilized your [O365](https://github.com/O365/python-o365) library extensively. Given the rising popularity of LangChain and similar AI frameworks, the convergence of libraries like O365 and tools like this one is likely. So, I wanted to keep you updated on our progress. --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-06-26 02:59:09 -07:00
Xiaochao Dong	a15afc102c	Relax the action input check for actions that require no input (#6357 ) When the tool requires no input, the LLM often gives something like this: ```json { "action": "just_do_it" } ``` I have attempted to enhance the prompt, but it doesn't appear to be functioning effectively. Therefore, I believe we should consider easing the check a little bit. Signed-off-by: Xiaochao Dong (@damnever) <the.xcdong@gmail.com>	2023-06-26 02:30:17 -07:00
Ethan Bowen	cc33bde74f	Confluence added (#6432 ) Adding Confluence to Jira tool. Can create a page in Confluence with this PR. If accepted, will extend functionality to Bitbucket and additional Confluence features. --------- Co-authored-by: Ethan Bowen <ethan.bowen@slalom.com>	2023-06-26 02:28:04 -07:00
Surya Nudurupati	2aeb8e7dbc	Improved Documentation: Eliminating Redundancy in the Introduction.mdx (#6360 ) When the documentation was originally written there was a redundant typing of the word "using the"	2023-06-26 02:27:36 -07:00
rajib	0f6ef048d2	The openai_info.py does not have gpt-35-turbo which is the underlying Azure Open AI model name (#6321 ) Since this model name is not there in the list MODEL_COST_PER_1K_TOKENS, when we use get_openai_callback(), for gpt 3.5 model in Azure AI, we do not get the cost of the tokens. This will fix this issue #### Who can review? @hwchase17 @agola11 Co-authored-by: rajib76 <rajib76@yahoo.com> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-06-26 02:16:39 -07:00
ArchimedesFTW	fe941cb54a	Change tags(str) to tags(dict) in mlflow_callback.py docs (#6473 ) Fixes #6472 #### Who can review? @agola11	2023-06-26 02:12:23 -07:00
0xcrusher	9187d2f3a9	Fixed caching bug for Multiple Caching types by correctly checking types (#6746 ) - Fixed an issue where some caching types check the wrong types, hence not allowing caching to work Maintainer responsibilities: - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev	2023-06-26 01:14:32 -07:00
Harrison Chase	e9877ea8b1	Tiktoken override (#6697 )	2023-06-26 00:49:32 -07:00
Gabriel Altay	f9771700e4	prevent DuckDuckGoSearchAPIWrapper from consuming top result (#6727 ) remove the `next` call that checks for None on the results generator	2023-06-25 19:54:15 -07:00
Pau Ramon Revilla	87802c86d9	Added a MHTML document loader (#6311 ) MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case. This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file. --------- Co-authored-by: rlm <pexpresss31@gmail.com>	2023-06-25 13:12:08 -07:00
Janos Tolgyesi	05eec99269	beautifulsoup get_text kwargs in WebBaseLoader (#6591 ) # beautifulsoup get_text kwargs in WebBaseLoader - Description: this PR introduces an optional `bs_get_text_kwargs` parameter to `WebBaseLoader` constructor. It can be used to pass kwargs to the downstream BeautifulSoup.get_text call. The most common usage might be to pass a custom text separator, as seen also in `BSHTMLLoader`. - Tag maintainer: @rlancemartin, @eyurtsev - Twitter handle: jtolgyesi	2023-06-25 12:42:27 -07:00
Matt Robinson	be68f6f8ce	feat: Add `UnstructuredRSTLoader` (#6594 ) ### Summary Adds an `UnstructuredRSTLoader` for loading [reStructuredText](https://en.wikipedia.org/wiki/ReStructuredText) file. ### Testing ```python from langchain.document_loaders import UnstructuredRSTLoader loader = UnstructuredRSTLoader( file_path="example_data/README.rst", mode="elements" ) docs = loader.load() print(docs[0]) ``` ### Reviewers - @hwchase17 - @rlancemartin - @eyurtsev	2023-06-25 12:41:57 -07:00
Chip Davis	b32cc01c9f	feat: added tqdm progress bar to UnstructuredURLLoader (#6600 ) - Description: Adds a simple progress bar with tqdm when using UnstructuredURLLoader. Exposes new paramater `show_progress_bar`. Very simple PR. - Issue: N/A - Dependencies: N/A - Tag maintainer: @rlancemartin @eyurtsev --------- Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>	2023-06-25 12:41:25 -07:00
Augustine Theodore	afc292e58d	Fix WhatsAppChatLoader : Enable parsing additional formats (#6663 ) - Description: Updated regex to support a new format that was observed when whatsapp chat was exported. - Issue: #6654 - Dependencies: No new dependencies - Tag maintainer: @rlancemartin, @eyurtsev	2023-06-25 12:08:43 -07:00
Sumanth Donthula	3e30a5d967	updated sql_database.py for returning sorted table names. (#6692 ) Added code to get the tables info in sorted order in methods get_usable_table_names and get_table_info. Linked to Issue: #6640	2023-06-25 12:04:24 -07:00
刘方瑞	9d1b3bab76	Fix Typo in LangChain MyScale Integration Doc (#6705 ) <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @dev2049 - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @dev2049 - Memory: @hwchase17 - Agents / Tools / Toolkits: @vowelparrot - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> - Description: Fix Typo in LangChain MyScale Integration Doc @hwchase17	2023-06-25 11:54:00 -07:00
sudolong	408c8d0178	fix chroma _similarity_search_with_relevance_scores missing `kwargs` … (#6708 ) Issue: https://github.com/hwchase17/langchain/issues/6707	2023-06-25 11:53:42 -07:00
Zander Chase	d89e10d361	Fix Multi Functions Agent Tracing (#6702 ) Confirmed it works now: https://dev.langchain.plus/public/0dc32ce0-55af-432e-b09e-5a1a220842f5/r	2023-06-25 10:39:04 -07:00
Harrison Chase	1742db0c30	bump version to 215 (#6719 )	2023-06-25 08:52:51 -07:00
Ankush Gola	e1b801be36	split up batch llm calls into separate runs (#5804 )	2023-06-24 21:03:31 -07:00