# Evaluating Agent Trajectories

Good evaluation is key for quickly iterating on your agent's prompts and tools. One way we recommend 

Here we provide an example of how to use the TrajectoryEvalChain to evaluate the efficacy of the actions taken by your agent.

## Setup

Let's start by defining our agent.

In [2]:
from langchain import Wikipedia
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
from langchain.memory import ConversationBufferMemory
from langchain import LLMMathChain
from langchain.llms import OpenAI

from langchain import SerpAPIWrapper

docstore = DocstoreExplorer(Wikipedia())

math_llm = OpenAI(temperature=0)

llm_math_chain = LLMMathChain.from_llm(llm=math_llm, verbose=True)

search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search. Must call before lookup.",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup. Only call after a successfull 'Search'.",
    ),
    Tool(
        name="Calculator",
        func=llm_math_chain.run,
        description="useful for arithmetic. Expects strict numeric input, no words.",
    ),
    Tool(
        name="Search-the-Web-SerpAPI",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    ),
]

memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True, output_key="output"
)

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-0613")

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    memory=memory,
    return_intermediate_steps=True,  # This is needed for the evaluation later
)

## Test the Agent

Now let's try our agent out on some example queries.

In [3]:
query_one = (
    "How many ping pong balls would it take to fill the entire Empire State Building?"
)

test_outputs_one = agent({"input": query_one}, return_only_outputs=False)



[1m> Entering new  chain...[0m
[32;1m[1;3m
Invoking: `Calculator` with `1040000 / (4/100)^3 / 1000000`
responded: {content}

[0m

[1m> Entering new  chain...[0m
1040000 / (4/100)^3 / 1000000[32;1m[1;3m```text
1040000 / (4/100)**3 / 1000000
```
...numexpr.evaluate("1040000 / (4/100)**3 / 1000000")...
[0m
Answer: [33;1m[1;3m16249.999999999998[0m
[1m> Finished chain.[0m
[38;5;200m[1;3mAnswer: 16249.999999999998[0m[32;1m[1;3mIt would take approximately 16,250 ping pong balls to fill the entire Empire State Building.[0m

[1m> Finished chain.[0m


This looks alright.. Let's try it out on another query.

In [4]:
query_two = "If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?"

test_outputs_two = agent({"input": query_two}, return_only_outputs=False)



[1m> Entering new  chain...[0m
[32;1m[1;3m
Invoking: `Search` with `length of the US from coast to coast`


[0m[36;1m[1;3m
== Watercraft ==[0m[32;1m[1;3m
Invoking: `Search` with `distance from coast to coast of the US`


[0m[36;1m[1;3mThe Oregon Coast is a coastal region of the U.S. state of Oregon. It is bordered by the Pacific Ocean to its west and the Oregon Coast Range to the east, and stretches approximately 362 miles (583 km) from the California state border in the south to the Columbia River in the north. The region is not a specific geological, environmental, or political entity, and includes the Columbia River Estuary.
The Oregon Beach Bill of 1967 allows free beach access to everyone.  In return for a pedestrian easement and relief from construction, the bill eliminates property taxes on private beach land and allows its owners to retain certain beach land rights.Traditionally, the Oregon Coast is regarded as three distinct sub–regions:
The North Coast, which s

This doesn't look so good. Let's try running some evaluation.

## Evaluating the Agent

Let's start by defining the TrajectoryEvalChain.

In [5]:
from langchain.evaluation.agents import TrajectoryEvalChain

# Define chain
eval_llm = ChatOpenAI(temperature=0, model_name="gpt-4")
eval_chain = TrajectoryEvalChain.from_llm(
    llm=eval_llm,  # Note: This must be a chat model
    agent_tools=agent.tools,
    return_reasoning=True,
)

Let's try evaluating the first query.

In [6]:
question, steps, answer = (
    test_outputs_one["input"],
    test_outputs_one["intermediate_steps"],
    test_outputs_one["output"],
)

evaluation = eval_chain.evaluate_agent_trajectory(
    input=test_outputs_one["input"],
    output=test_outputs_one["output"],
    agent_trajectory=test_outputs_one["intermediate_steps"],
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  1
Reasoning:  i. Is the final answer helpful?
The final answer is not helpful because it is incorrect. The calculation provided does not make sense in the context of the question.

ii. Does the AI language use a logical sequence of tools to answer the question?
The AI language model does not use a logical sequence of tools. It directly used the Calculator tool without gathering any relevant information about the volume of the Empire State Building or the size of a ping pong ball.

iii. Does the AI language model use the tools in a helpful way?
The AI language model does not use the tools in a helpful way. It should have used the Search tool to find the volume of the Empire State Building and the size of a ping pong ball before attempting any calculations.

iv. Does the AI language model use too many steps to answer the question?
The AI language model used only one step, which was not enough to answer the question correctly. It should have used more steps to gather t

**That seems about right. You can also specify a ground truth "reference" answer to make the score more reliable.**

In [13]:
evaluation = eval_chain.evaluate_agent_trajectory(
    input=test_outputs_one["input"],
    output=test_outputs_one["output"],
    agent_trajectory=test_outputs_one["intermediate_steps"],
    reference=(
        "You need many more than 100,000 ping-pong balls in the empire state building."
    ),
)


print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  1
Reasoning:  i. Is the final answer helpful?
The final answer is not helpful, as it is incorrect. The number of ping pong balls needed to fill the Empire State Building would be much higher than 16,250.

ii. Does the AI language use a logical sequence of tools to answer the question?
The AI language model does not use a logical sequence of tools. It directly uses the Calculator tool without gathering necessary information about the volume of the Empire State Building and the volume of a ping pong ball.

iii. Does the AI language model use the tools in a helpful way?
The AI language model does not use the tools in a helpful way. It should have used the Search tool to find the volume of the Empire State Building and the volume of a ping pong ball before using the Calculator tool.

iv. Does the AI language model use too many steps to answer the question?
The AI language model does not use too many steps, but it skips essential steps to answer the question correctly.



**Let's try the second query. This time, use the async API. If we wanted to
evaluate multiple runs at once, this would led us add some concurrency**

In [14]:
evaluation = await eval_chain.aevaluate_agent_trajectory(
    input=test_outputs_two["input"],
    output=test_outputs_two["output"],
    agent_trajectory=test_outputs_two["intermediate_steps"],
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  2
Reasoning:  i. Is the final answer helpful?
The final answer is not helpful because it uses the wrong distance for the coast-to-coast measurement of the US. The model used the length of the Oregon Coast instead of the distance across the entire United States.

ii. Does the AI language use a logical sequence of tools to answer the question?
The sequence of tools is logical, but the information obtained from the Search tool is incorrect, leading to an incorrect final answer.

iii. Does the AI language model use the tools in a helpful way?
The AI language model uses the tools in a helpful way, but the information obtained from the Search tool is incorrect. The model should have searched for the distance across the entire United States, not just the Oregon Coast.

iv. Does the AI language model use too many steps to answer the question?
The AI language model does not use too many steps to answer the question. The number of steps is appropriate, but the information obt

## Conclusion

In this example, you evaluated an agent based its entire "trajectory" using the `TrajectoryEvalChain`. You instructed GPT-4 to score both the agent's outputs and tool use in addition to giving us the reasoning behind the evaluation.

Agents can be complicated, and testing them thoroughly requires using multiple methodologies. Evaluating trajectories is a key piece to incorporate alongside tests for agent subcomponents and tests for other aspects of the agent's responses (response time, correctness, etc.) 