# QA Correctness

The QAEvalChain compares a question-answering model's response to a reference response.


In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.evaluation import QAEvalChain

llm = ChatOpenAI(model="gpt-4", temperature=0)
criterion = "conciseness"
eval_chain = QAEvalChain.from_llm(llm=llm)

In [2]:
eval_chain.evaluate_strings(
    input="What's last quarter's sales numbers?",
    prediction="Last quarter we sold 600,000 total units of product.",
    reference="Last quarter we sold 100,000 units of product A, 200,000 units of product B, and 300,000 units of product C.",
)

{'reasoning': None, 'value': 'CORRECT', 'score': 1}

## SQL Correctness

You can use an LLM to check the equivalence of a SQL query against a reference SQL query. using the sql prompt.

In [3]:
from langchain.evaluation.qa.eval_prompt import SQL_PROMPT

eval_chain = QAEvalChain.from_llm(llm=llm, prompt=SQL_PROMPT)

In [4]:
eval_chain.evaluate_strings(
    input="What's last quarter's sales numbers?",
    prediction="""SELECT SUM(sale_amount) AS last_quarter_sales
FROM sales
WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE();
""",
    reference="""SELECT SUM(sub.sale_amount) AS last_quarter_sales
FROM (
    SELECT sale_amount
    FROM sales
    WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE()
) AS sub;
""",
)

{'reasoning': 'The expert answer and the submission are very similar in their approach to solving the problem. Both queries are trying to calculate the sum of sales from the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\n\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\n\nHowever, this difference does not af

## Using Context

Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the ContextQAEvalChain.

In [5]:
from langchain.evaluation import ContextQAEvalChain

eval_chain = ContextQAEvalChain.from_llm(llm=llm)

eval_chain.evaluate_strings(
    input="Who won the NFC championship game in 2023?",
    prediction="Eagles",
    reference="NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7",
)

{'reasoning': None, 'value': 'CORRECT', 'score': 1}

## CoT With Context

The same prompt strategies such as chain of thought can be used to make the evaluation results more reliable.
The `CotQAEvalChain`'s default prompt instructs the model to do this.

In [6]:
from langchain.evaluation import CotQAEvalChain

eval_chain = CotQAEvalChain.from_llm(llm=llm)

eval_chain.evaluate_strings(
    input="Who won the NFC championship game in 2023?",
    prediction="Eagles",
    reference="NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7",
)

{'reasoning': 'The context states that the Philadelphia Eagles won the NFC championship game in 2023. The student\'s answer, "Eagles," matches the team that won according to the context. Therefore, the student\'s answer is correct.',
 'value': 'CORRECT',
 'score': 1}