# Evaluating an OpenAPI Chain

This notebook goes over ways to semantically evaluate an [OpenAPI Chain](/docs/modules/chains/additional/openapi.html), which calls an endpoint defined by the OpenAPI specification using purely natural language.

In [1]:
from langchain.tools import OpenAPISpec, APIOperation
from langchain.chains import OpenAPIEndpointChain, LLMChain
from langchain.requests import Requests
from langchain.llms import OpenAI

## Load the API Chain

Load a wrapper of the spec (so we can work with it more easily). You can load from a url or from a local file.

In [2]:
# Load and parse the OpenAPI Spec
spec = OpenAPISpec.from_url(
    "https://www.klarna.com/us/shopping/public/openai/v0/api-docs/"
)
# Load a single endpoint operation
operation = APIOperation.from_openapi_spec(spec, "/public/openai/v0/products", "get")
verbose = False
# Select any LangChain LLM
llm = OpenAI(temperature=0, max_tokens=1000)
# Create the endpoint chain
api_chain = OpenAPIEndpointChain.from_api_operation(
    operation,
    llm,
    requests=Requests(),
    verbose=verbose,
    return_intermediate_steps=True,  # Return request and response text
)

Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.


### *Optional*: Generate Input Questions and Request Ground Truth Queries

See [Generating Test Datasets](#Generating-Test-Datasets) at the end of this notebook for more details.

In [3]:
# import re
# from langchain.prompts import PromptTemplate

# template = """Below is a service description:

# {spec}

# Imagine you're a new user trying to use {operation} through a search bar. What are 10 different things you want to request?
# Wants/Questions:
# 1. """

# prompt = PromptTemplate.from_template(template)

# generation_chain = LLMChain(llm=llm, prompt=prompt)

# questions_ = generation_chain.run(spec=operation.to_typescript(), operation=operation.operation_id).split('\n')
# # Strip preceding numeric bullets
# questions = [re.sub(r'^\d+\. ', '', q).strip() for q in questions_]
# questions

In [4]:
# ground_truths = [
# {"q": ...} # What are the best queries for each input?
# ]

## Run the API Chain

The two simplest questions a user of the API Chain are:
- Did the chain succesfully access the endpoint?
- Did the action accomplish the correct result?


In [5]:
from collections import defaultdict

# Collect metrics to report at completion
scores = defaultdict(list)

In [6]:
from langchain.evaluation.loading import load_dataset

dataset = load_dataset("openapi-chain-klarna-products-get")

Found cached dataset json (/Users/harrisonchase/.cache/huggingface/datasets/LangChainDatasets___json/LangChainDatasets--openapi-chain-klarna-products-get-5d03362007667626/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
dataset

[{'question': 'What iPhone models are available?',
  'expected_query': {'max_price': None, 'q': 'iPhone'}},
 {'question': 'Are there any budget laptops?',
  'expected_query': {'max_price': 300, 'q': 'laptop'}},
 {'question': 'Show me the cheapest gaming PC.',
  'expected_query': {'max_price': 500, 'q': 'gaming pc'}},
 {'question': 'Are there any tablets under $400?',
  'expected_query': {'max_price': 400, 'q': 'tablet'}},
 {'question': 'What are the best headphones?',
  'expected_query': {'max_price': None, 'q': 'headphones'}},
 {'question': 'What are the top rated laptops?',
  'expected_query': {'max_price': None, 'q': 'laptop'}},
 {'question': 'I want to buy some shoes. I like Adidas and Nike.',
  'expected_query': {'max_price': None, 'q': 'shoe'}},
 {'question': 'I want to buy a new skirt',
  'expected_query': {'max_price': None, 'q': 'skirt'}},
 {'question': 'My company is asking me to get a professional Deskopt PC - money is no object.',
  'expected_query': {'max_price': 10000, 'q

In [8]:
questions = [d["question"] for d in dataset]

In [9]:
## Run the the API chain itself
raise_error = False  # Stop on first failed example - useful for development
chain_outputs = []
failed_examples = []
for question in questions:
    try:
        chain_outputs.append(api_chain(question))
        scores["completed"].append(1.0)
    except Exception as e:
        if raise_error:
            raise e
        failed_examples.append({"q": question, "error": e})
        scores["completed"].append(0.0)

In [10]:
# If the chain failed to run, show the failing examples
failed_examples

[]

In [11]:
answers = [res["output"] for res in chain_outputs]
answers

['There are currently 10 Apple iPhone models available: Apple iPhone 14 Pro Max 256GB, Apple iPhone 12 128GB, Apple iPhone 13 128GB, Apple iPhone 14 Pro 128GB, Apple iPhone 14 Pro 256GB, Apple iPhone 14 Pro Max 128GB, Apple iPhone 13 Pro Max 128GB, Apple iPhone 14 128GB, Apple iPhone 12 Pro 512GB, and Apple iPhone 12 mini 64GB.',
 'Yes, there are several budget laptops in the API response. For example, the HP 14-dq0055dx and HP 15-dw0083wm are both priced at $199.99 and $244.99 respectively.',
 'The cheapest gaming PC available is the Alarco Gaming PC (X_BLACK_GTX750) for $499.99. You can find more information about it here: https://www.klarna.com/us/shopping/pl/cl223/3203154750/Desktop-Computers/Alarco-Gaming-PC-%28X_BLACK_GTX750%29/?utm_source=openai&ref-site=openai_plugin',
 'Yes, there are several tablets under $400. These include the Apple iPad 10.2" 32GB (2019), Samsung Galaxy Tab A8 10.5 SM-X200 32GB, Samsung Galaxy Tab A7 Lite 8.7 SM-T220 32GB, Amazon Fire HD 8" 32GB (10th Gene

## Evaluate the requests chain

The API Chain has two main components:
1. Translate the user query to an API request (request synthesizer)
2. Translate the API response to a natural language response

Here, we construct an evaluation chain to grade the request synthesizer against selected human queries 

In [12]:
import json

truth_queries = [json.dumps(data["expected_query"]) for data in dataset]

In [13]:
# Collect the API queries generated by the chain
predicted_queries = [
    output["intermediate_steps"]["request_args"] for output in chain_outputs
]

In [14]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The query you know you should be executing against the API is:

> Query: {truth_query}

Is the following predicted query semantically the same (eg likely to produce the same answer)?

> Predicted Query: {predict_query}

Please give the Predicted Query a grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)

In [15]:
request_eval_results = []
for question, predict_query, truth_query in list(
    zip(questions, predicted_queries, truth_queries)
):
    eval_output = eval_chain.run(
        question=question,
        truth_query=truth_query,
        predict_query=predict_query,
    )
    request_eval_results.append(eval_output)
request_eval_results

[' The original query is asking for all iPhone models, so the "q" parameter is correct. The "max_price" parameter is also correct, as it is set to null, meaning that no maximum price is set. The predicted query adds two additional parameters, "size" and "min_price". The "size" parameter is not necessary, as it is not relevant to the question being asked. The "min_price" parameter is also not necessary, as it is not relevant to the question being asked and it is set to 0, which is the default value. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',
 ' The original query is asking for laptops with a maximum price of 300. The predicted query is asking for laptops with a minimum price of 0 and a maximum price of 500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. Therefore, the predicted query is not seman

In [16]:
import re
from typing import List


# Parse the evaluation chain responses into a rubric
def parse_eval_results(results: List[str]) -> List[float]:
    rubric = {"A": 1.0, "B": 0.75, "C": 0.5, "D": 0.25, "F": 0}
    return [rubric[re.search(r"Final Grade: (\w+)", res).group(1)] for res in results]


parsed_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores["request_synthesizer"].extend(parsed_results)

## Evaluate the Response Chain

The second component translated the structured API response to a natural language response.
Evaluate this against the user's original question.

In [17]:
from langchain.prompts import PromptTemplate

template = """You are trying to answer the following question by querying an API:

> Question: {question}

The API returned a response of:

> API result: {api_response}

Your response to the user: {answer}

Please evaluate the accuracy and utility of your response to the user's original question, conditioned on the information available.
Give a letter grade of either an A, B, C, D, or F, along with an explanation of why. End the evaluation with 'Final Grade: <the letter>'

> Explanation: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

eval_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)

In [18]:
# Extract the API responses from the chain
api_responses = [
    output["intermediate_steps"]["response_text"] for output in chain_outputs
]

In [19]:
# Run the grader chain
response_eval_results = []
for question, api_response, answer in list(zip(questions, api_responses, answers)):
    request_eval_results.append(
        eval_chain.run(question=question, api_response=api_response, answer=answer)
    )
request_eval_results

[' The original query is asking for all iPhone models, so the "q" parameter is correct. The "max_price" parameter is also correct, as it is set to null, meaning that no maximum price is set. The predicted query adds two additional parameters, "size" and "min_price". The "size" parameter is not necessary, as it is not relevant to the question being asked. The "min_price" parameter is also not necessary, as it is not relevant to the question being asked and it is set to 0, which is the default value. Therefore, the predicted query is not semantically the same as the original query and is not likely to produce the same answer. Final Grade: D',
 ' The original query is asking for laptops with a maximum price of 300. The predicted query is asking for laptops with a minimum price of 0 and a maximum price of 500. This means that the predicted query is likely to return more results than the original query, as it is asking for a wider range of prices. Therefore, the predicted query is not seman

In [20]:
# Reusing the rubric from above, parse the evaluation chain responses
parsed_response_results = parse_eval_results(request_eval_results)
# Collect the scores for a final evaluation table
scores["result_synthesizer"].extend(parsed_response_results)

In [21]:
# Print out Score statistics for the evaluation session
header = "{:<20}\t{:<10}\t{:<10}\t{:<10}".format("Metric", "Min", "Mean", "Max")
print(header)
for metric, metric_scores in scores.items():
    mean_scores = (
        sum(metric_scores) / len(metric_scores)
        if len(metric_scores) > 0
        else float("nan")
    )
    row = "{:<20}\t{:<10.2f}\t{:<10.2f}\t{:<10.2f}".format(
        metric, min(metric_scores), mean_scores, max(metric_scores)
    )
    print(row)

Metric              	Min       	Mean      	Max       
completed           	1.00      	1.00      	1.00      
request_synthesizer 	0.00      	0.23      	1.00      
result_synthesizer  	0.00      	0.55      	1.00      


In [22]:
# Re-show the examples for which the chain failed to complete
failed_examples

[]

## Generating Test Datasets

To evaluate a chain against your own endpoint, you'll want to generate a test dataset that's conforms to the API.

This section provides an overview of how to bootstrap the process.

First, we'll parse the OpenAPI Spec. For this example, we'll [Speak](https://www.speak.com/)'s OpenAPI specification.

In [23]:
# Load and parse the OpenAPI Spec
spec = OpenAPISpec.from_url("https://api.speak.com/openapi.yaml")

Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.
Attempting to load an OpenAPI 3.0.1 spec.  This may result in degraded performance. Convert your OpenAPI spec to 3.1.* spec for better support.


In [24]:
# List the paths in the OpenAPI Spec
paths = sorted(spec.paths.keys())
paths

['/v1/public/openai/explain-phrase',
 '/v1/public/openai/explain-task',
 '/v1/public/openai/translate']

In [25]:
# See which HTTP Methods are available for a given path
methods = spec.get_methods_for_path("/v1/public/openai/explain-task")
methods

['post']

In [26]:
# Load a single endpoint operation
operation = APIOperation.from_openapi_spec(
    spec, "/v1/public/openai/explain-task", "post"
)

# The operation can be serialized as typescript
print(operation.to_typescript())

type explainTask = (_: {
/* Description of the task that the user wants to accomplish or do. For example, "tell the waiter they messed up my order" or "compliment someone on their shirt" */
  task_description?: string,
/* The foreign language that the user is learning and asking about. The value can be inferred from question - for example, if the user asks "how do i ask a girl out in mexico city", the value should be "Spanish" because of Mexico City. Always use the full name of the language (e.g. Spanish, French). */
  learning_language?: string,
/* The user's native language. Infer this value from the language the user asked their question in. Always use the full name of the language (e.g. Spanish, French). */
  native_language?: string,
/* A description of any additional context in the user's question that could affect the explanation - e.g. setting, scenario, situation, tone, speaking style and formality, usage notes, or any other qualifiers. */
  additional_context?: string,
/* Ful

In [27]:
# Compress the service definition to avoid leaking too much input structure to the sample data
template = """In 20 words or less, what does this service accomplish?
{spec}

Function: It's designed to """
prompt = PromptTemplate.from_template(template)
generation_chain = LLMChain(llm=llm, prompt=prompt)
purpose = generation_chain.run(spec=operation.to_typescript())

In [28]:
template = """Write a list of {num_to_generate} unique messages users might send to a service designed to{purpose} They must each be completely unique.

1."""


def parse_list(text: str) -> List[str]:
    # Match lines starting with a number then period
    # Strip leading and trailing whitespace
    matches = re.findall(r"^\d+\. ", text)
    return [re.sub(r"^\d+\. ", "", q).strip().strip('"') for q in text.split("\n")]


num_to_generate = 10  # How many examples to use for this test set.
prompt = PromptTemplate.from_template(template)
generation_chain = LLMChain(llm=llm, prompt=prompt)
text = generation_chain.run(purpose=purpose, num_to_generate=num_to_generate)
# Strip preceding numeric bullets
queries = parse_list(text)
queries

["Can you explain how to say 'hello' in Spanish?",
 "I need help understanding the French word for 'goodbye'.",
 "Can you tell me how to say 'thank you' in German?",
 "I'm trying to learn the Italian word for 'please'.",
 "Can you help me with the pronunciation of 'yes' in Portuguese?",
 "I'm looking for the Dutch word for 'no'.",
 "Can you explain the meaning of 'hello' in Japanese?",
 "I need help understanding the Russian word for 'thank you'.",
 "Can you tell me how to say 'goodbye' in Chinese?",
 "I'm trying to learn the Arabic word for 'please'."]

In [29]:
# Define the generation chain to get hypotheses
api_chain = OpenAPIEndpointChain.from_api_operation(
    operation,
    llm,
    requests=Requests(),
    verbose=verbose,
    return_intermediate_steps=True,  # Return request and response text
)

predicted_outputs = [api_chain(query) for query in queries]
request_args = [
    output["intermediate_steps"]["request_args"] for output in predicted_outputs
]

# Show the generated request
request_args

['{"task_description": "say \'hello\'", "learning_language": "Spanish", "native_language": "English", "full_query": "Can you explain how to say \'hello\' in Spanish?"}',
 '{"task_description": "understanding the French word for \'goodbye\'", "learning_language": "French", "native_language": "English", "full_query": "I need help understanding the French word for \'goodbye\'."}',
 '{"task_description": "say \'thank you\'", "learning_language": "German", "native_language": "English", "full_query": "Can you tell me how to say \'thank you\' in German?"}',
 '{"task_description": "Learn the Italian word for \'please\'", "learning_language": "Italian", "native_language": "English", "full_query": "I\'m trying to learn the Italian word for \'please\'."}',
 '{"task_description": "Help with pronunciation of \'yes\' in Portuguese", "learning_language": "Portuguese", "native_language": "English", "full_query": "Can you help me with the pronunciation of \'yes\' in Portuguese?"}',
 '{"task_description

In [30]:
## AI Assisted Correction
correction_template = """Correct the following API request based on the user's feedback. If the user indicates no changes are needed, output the original without making any changes.

REQUEST: {request}

User Feedback / requested changes: {user_feedback}

Finalized Request: """

prompt = PromptTemplate.from_template(correction_template)
correction_chain = LLMChain(llm=llm, prompt=prompt)

In [31]:
ground_truth = []
for query, request_arg in list(zip(queries, request_args)):
    feedback = input(f"Query: {query}\nRequest: {request_arg}\nRequested changes: ")
    if feedback == "n" or feedback == "none" or not feedback:
        ground_truth.append(request_arg)
        continue
    resolved = correction_chain.run(request=request_arg, user_feedback=feedback)
    ground_truth.append(resolved.strip())
    print("Updated request:", resolved)

Query: Can you explain how to say 'hello' in Spanish?
Request: {"task_description": "say 'hello'", "learning_language": "Spanish", "native_language": "English", "full_query": "Can you explain how to say 'hello' in Spanish?"}
Requested changes: 
Query: I need help understanding the French word for 'goodbye'.
Request: {"task_description": "understanding the French word for 'goodbye'", "learning_language": "French", "native_language": "English", "full_query": "I need help understanding the French word for 'goodbye'."}
Requested changes: 
Query: Can you tell me how to say 'thank you' in German?
Request: {"task_description": "say 'thank you'", "learning_language": "German", "native_language": "English", "full_query": "Can you tell me how to say 'thank you' in German?"}
Requested changes: 
Query: I'm trying to learn the Italian word for 'please'.
Request: {"task_description": "Learn the Italian word for 'please'", "learning_language": "Italian", "native_language": "English", "full_query": "I

**Now you can use the `ground_truth` as shown above in [Evaluate the Requests Chain](#Evaluate-the-requests-chain)!**

In [32]:
# Now you have a new ground truth set to use as shown above!
ground_truth

['{"task_description": "say \'hello\'", "learning_language": "Spanish", "native_language": "English", "full_query": "Can you explain how to say \'hello\' in Spanish?"}',
 '{"task_description": "understanding the French word for \'goodbye\'", "learning_language": "French", "native_language": "English", "full_query": "I need help understanding the French word for \'goodbye\'."}',
 '{"task_description": "say \'thank you\'", "learning_language": "German", "native_language": "English", "full_query": "Can you tell me how to say \'thank you\' in German?"}',
 '{"task_description": "Learn the Italian word for \'please\'", "learning_language": "Italian", "native_language": "English", "full_query": "I\'m trying to learn the Italian word for \'please\'."}',
 '{"task_description": "Help with pronunciation of \'yes\' in Portuguese", "learning_language": "Portuguese", "native_language": "English", "full_query": "Can you help me with the pronunciation of \'yes\' in Portuguese?"}',
 '{"task_description