langchain/docs/extras/modules/evaluation/string/qa.ipynb
William FH 3179ee3a56
Evals docs (#7460)
Still don't have good "how to's", and the guides / examples section
could be further pruned and improved, but this PR adds a couple examples
for each of the common evaluator interfaces.

- [x] Example docs for each implemented evaluator
- [x] "how to make a custom evalutor" notebook for each low level APIs
(comparison, string, agent)
- [x] Move docs to modules area
- [x] Link to reference docs for more information
- [X] Still need to finish the evaluation index page
- ~[ ] Don't have good data generation section~
- ~[ ] Don't have good how to section for other common scenarios / FAQs
like regression testing, testing over similar inputs to measure
sensitivity, etc.~
2023-07-18 01:00:01 -07:00

227 lines
7.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "c701fcaf-e5dc-42a2-b8a7-027d13ff465f",
"metadata": {},
"source": [
"# QA Correctness\n",
"\n",
"The QAEvalChain compares a question-answering model's response to a reference response.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9672fdb9-b53f-41e4-8f72-f21d11edbeac",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.evaluation import QAEvalChain\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
"criterion = \"conciseness\"\n",
"eval_chain = QAEvalChain.from_llm(llm=llm)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b4db474a-9c9d-473f-81b1-55070ee584a6",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval_chain.evaluate_strings(\n",
" input=\"What's last quarter's sales numbers?\",\n",
" prediction=\"Last quarter we sold 600,000 total units of product.\",\n",
" reference=\"Last quarter we sold 100,000 units of product A, 200,000 units of product B, and 300,000 units of product C.\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a5b345aa-7f45-4eea-bedf-9b0d5e824be3",
"metadata": {},
"source": [
"## SQL Correctness\n",
"\n",
"You can use an LLM to check the equivalence of a SQL query against a reference SQL query. using the sql prompt."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6c803b8c-fe1f-4fb7-8ea0-d9c67b855eb3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from langchain.evaluation.qa.eval_prompt import SQL_PROMPT\n",
"\n",
"eval_chain = QAEvalChain.from_llm(llm=llm, prompt=SQL_PROMPT)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e28b8d07-248f-405c-bcef-e0ebe3a05c3e",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'reasoning': 'The expert answer and the submission are very similar in their approach to solving the problem. Both queries are trying to calculate the sum of sales from the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\\n\\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\\n\\nHowever, this difference does not affect the result of the query. Both queries will return the same result, which is the sum of sales from the last quarter.\\n\\nCORRECT',\n",
" 'value': 'CORRECT',\n",
" 'score': 1}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eval_chain.evaluate_strings(\n",
" input=\"What's last quarter's sales numbers?\",\n",
" prediction=\"\"\"SELECT SUM(sale_amount) AS last_quarter_sales\n",
"FROM sales\n",
"WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE();\n",
"\"\"\",\n",
" reference=\"\"\"SELECT SUM(sub.sale_amount) AS last_quarter_sales\n",
"FROM (\n",
" SELECT sale_amount\n",
" FROM sales\n",
" WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE()\n",
") AS sub;\n",
"\"\"\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e0c3dcad-408e-4d26-9e25-848ebacac2c4",
"metadata": {},
"source": [
"## Using Context\n",
"\n",
"Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the ContextQAEvalChain."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "9f3ae116-3a2f-461d-ba6f-7352b42c1b0c",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.evaluation import ContextQAEvalChain\n",
"\n",
"eval_chain = ContextQAEvalChain.from_llm(llm=llm)\n",
"\n",
"eval_chain.evaluate_strings(\n",
" input=\"Who won the NFC championship game in 2023?\",\n",
" prediction=\"Eagles\",\n",
" reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ba5eac17-08b6-4e4f-a896-79e7fc637018",
"metadata": {},
"source": [
"## CoT With Context\n",
"\n",
"The same prompt strategies such as chain of thought can be used to make the evaluation results more reliable.\n",
"The `CotQAEvalChain`'s default prompt instructs the model to do this."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "26e3b686-98f4-45a5-9854-7071ec2893f1",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{'reasoning': 'The context states that the Philadelphia Eagles won the NFC championship game in 2023. The student\\'s answer, \"Eagles,\" matches the team that won according to the context. Therefore, the student\\'s answer is correct.',\n",
" 'value': 'CORRECT',\n",
" 'score': 1}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.evaluation import CotQAEvalChain\n",
"\n",
"eval_chain = CotQAEvalChain.from_llm(llm=llm)\n",
"\n",
"eval_chain.evaluate_strings(\n",
" input=\"Who won the NFC championship game in 2023?\",\n",
" prediction=\"Eagles\",\n",
" reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}