mirror of
https://github.com/hwchase17/langchain.git
synced 2026-01-06 08:28:11 +00:00
@@ -12,7 +12,7 @@
|
||||
"The `criteria` evaluator is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
|
||||
"properly define those criteria.\n",
|
||||
"\n",
|
||||
"For more details, check out the reference docs for the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain) on the class definition\n",
|
||||
"For more details, check out the reference docs for the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain)'s class definition.\n",
|
||||
"\n",
|
||||
"### Without References\n",
|
||||
"\n",
|
||||
|
||||
@@ -8,12 +8,12 @@
|
||||
"source": [
|
||||
"# Embedding Distance\n",
|
||||
"\n",
|
||||
"To measure semantic similarity (or dissimilarity) between a prediction and a reference label string, you could use a vector vector distance metric the two embedded representations using the `embeding_distance` evaulator.<a name=\"cite_ref-1\"></a>[<sup>[1]</sup>](#cite_note-1)\n",
|
||||
"To measure semantic similarity (or dissimilarity) between a prediction and a reference label string, you could use a vector vector distance metric the two embedded representations using the `embedding_distance` evaluator.<a name=\"cite_ref-1\"></a>[<sup>[1]</sup>](#cite_note-1)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**Note:** This returns a **distance** score, meaning that the lower the number, the **more** similar the prediction is to the reference, according to their embedded representation.\n",
|
||||
"\n",
|
||||
"Check out the reference docs for the [PairwiseEmbeddingDistanceEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain.html#langchain.evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain) for more info."
|
||||
"Check out the reference docs for the [EmbeddingDistanceEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.embedding_distance.base.EmbeddingDistanceEvalChain.html#langchain.evaluation.embedding_distance.base.EmbeddingDistanceEvalChain) for more info."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -26,7 +26,7 @@
|
||||
"source": [
|
||||
"from langchain.evaluation import load_evaluator\n",
|
||||
"\n",
|
||||
"evaluator = load_evaluator(\"pairwise_embedding_distance\")"
|
||||
"evaluator = load_evaluator(\"embedding_distance\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -48,7 +48,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I shan't go\")"
|
||||
"evaluator.evaluate_strings(prediction=\"I shall go\", reference=\"I shan't go\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -70,7 +70,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I will go\")"
|
||||
"evaluator.evaluate_strings(prediction=\"I shall go\", reference=\"I will go\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -118,8 +118,9 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# You can load by enum or by raw python string\n",
|
||||
"evaluator = load_evaluator(\n",
|
||||
" \"pairwise_embedding_distance\", distance_metric=EmbeddingDistance.EUCLIDEAN\n",
|
||||
" \"embedding_distance\", distance_metric=EmbeddingDistance.EUCLIDEAN\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -134,7 +135,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
@@ -143,12 +144,12 @@
|
||||
"from langchain.embeddings import HuggingFaceEmbeddings\n",
|
||||
"\n",
|
||||
"embedding_model = HuggingFaceEmbeddings()\n",
|
||||
"hf_evaluator = load_evaluator(\"pairwise_embedding_distance\", embeddings=embedding_model)"
|
||||
"hf_evaluator = load_evaluator(\"embedding_distance\", embeddings=embedding_model)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
@@ -159,18 +160,18 @@
|
||||
"{'score': 0.5486443280477362}"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"hf_evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I shan't go\")"
|
||||
"hf_evaluator.evaluate_strings(prediction=\"I shall go\", reference=\"I shan't go\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
@@ -181,20 +182,20 @@
|
||||
"{'score': 0.21018880025138598}"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"hf_evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I will go\")"
|
||||
"hf_evaluator.evaluate_strings(prediction=\"I shall go\", reference=\"I will go\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a name=\"cite_note-1\"></a><i>1. Note: When it comes to semantic similarity, this often gives better results than older string distance metrics (such as those in the `PairwiseStringDistanceEvalChain`), though it tends to be less reliable than evaluators that use the LLM directly (such as the `PairwiseStringEvalChain`) </i>"
|
||||
"<a name=\"cite_note-1\"></a><i>1. Note: When it comes to semantic similarity, this often gives better results than older string distance metrics (such as those in the [StringDistanceEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.string_distance.base.StringDistanceEvalChain.html#langchain.evaluation.string_distance.base.StringDistanceEvalChain)), though it tends to be less reliable than evaluators that use the LLM directly (such as the [QAEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain.evaluation.qa.eval_chain.QAEvalChain) or [LabeledCriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.LabeledCriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.LabeledCriteriaEvalChain)) </i>"
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
@@ -2,12 +2,16 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c701fcaf-e5dc-42a2-b8a7-027d13ff465f",
|
||||
"metadata": {},
|
||||
"id": "d63696a8-d035-4cf7-9605-c3210f0b551d",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# QA Correctness\n",
|
||||
"\n",
|
||||
"The QAEvalChain compares a question-answering model's response to a reference response.\n"
|
||||
"When thinking about a QA system, one of the most important questions to ask is whether the final generated result is correct. The `\"qa\"` evaluator compares a question-answering model's response to a reference answer to provide this level of information. If you are able to annotate a test dataset, this evaluator will be useful.\n",
|
||||
"\n",
|
||||
"For more details, check out the reference docs for the [QAEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html#langchain.evaluation.qa.eval_chain.QAEvalChain)'s class definition."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -20,11 +24,12 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from langchain.chat_models import ChatOpenAI\n",
|
||||
"from langchain.evaluation import QAEvalChain\n",
|
||||
"from langchain.evaluation import load_evaluator\n",
|
||||
"\n",
|
||||
"llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
|
||||
"criterion = \"conciseness\"\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm=llm)"
|
||||
"\n",
|
||||
"# Note: the eval_llm is optional. A gpt-4 model will be provided by default if not specified\n",
|
||||
"evaluator = load_evaluator(\"qa\", eval_llm=llm)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -47,10 +52,10 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"eval_chain.evaluate_strings(\n",
|
||||
"evaluator.evaluate_strings(\n",
|
||||
" input=\"What's last quarter's sales numbers?\",\n",
|
||||
" prediction=\"Last quarter we sold 600,000 total units of product.\",\n",
|
||||
" reference=\"Last quarter we sold 100,000 units of product A, 200,000 units of product B, and 300,000 units of product C.\",\n",
|
||||
" reference=\"Last quarter we sold 100,000 units of product A, 210,000 units of product B, and 300,000 units of product C.\",\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
@@ -61,7 +66,7 @@
|
||||
"source": [
|
||||
"## SQL Correctness\n",
|
||||
"\n",
|
||||
"You can use an LLM to check the equivalence of a SQL query against a reference SQL query. using the sql prompt."
|
||||
"You can use an LLM to check the equivalence of a SQL query against a reference SQL query using the sql prompt."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -75,7 +80,7 @@
|
||||
"source": [
|
||||
"from langchain.evaluation.qa.eval_prompt import SQL_PROMPT\n",
|
||||
"\n",
|
||||
"eval_chain = QAEvalChain.from_llm(llm=llm, prompt=SQL_PROMPT)"
|
||||
"eval_chain = load_evaluator(\"qa\", eval_llm=llm, prompt=SQL_PROMPT)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -89,7 +94,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'reasoning': 'The expert answer and the submission are very similar in their approach to solving the problem. Both queries are trying to calculate the sum of sales from the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\\n\\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\\n\\nHowever, this difference does not affect the result of the query. Both queries will return the same result, which is the sum of sales from the last quarter.\\n\\nCORRECT',\n",
|
||||
"{'reasoning': 'The expert answer and the submission are very similar in their structure and logic. Both queries are trying to calculate the sum of sales amounts for the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\\n\\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\\n\\nHowever, this difference does not affect the result of the query. Both queries will return the same result, which is the sum of the sales amounts for the last quarter.\\n\\nCORRECT',\n",
|
||||
" 'value': 'CORRECT',\n",
|
||||
" 'score': 1}"
|
||||
]
|
||||
@@ -123,7 +128,7 @@
|
||||
"source": [
|
||||
"## Using Context\n",
|
||||
"\n",
|
||||
"Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the ContextQAEvalChain."
|
||||
"Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the [ContextQAEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.ContextQAEvalChain.html#langchain.evaluation.qa.eval_chain.ContextQAEvalChain)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -146,9 +151,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation import ContextQAEvalChain\n",
|
||||
"\n",
|
||||
"eval_chain = ContextQAEvalChain.from_llm(llm=llm)\n",
|
||||
"eval_chain = load_evaluator(\"context_qa\", eval_llm=llm)\n",
|
||||
"\n",
|
||||
"eval_chain.evaluate_strings(\n",
|
||||
" input=\"Who won the NFC championship game in 2023?\",\n",
|
||||
@@ -165,7 +168,7 @@
|
||||
"## CoT With Context\n",
|
||||
"\n",
|
||||
"The same prompt strategies such as chain of thought can be used to make the evaluation results more reliable.\n",
|
||||
"The `CotQAEvalChain`'s default prompt instructs the model to do this."
|
||||
"The [CotQAEvalChain's](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.CotQAEvalChain.html#langchain.evaluation.qa.eval_chain.CotQAEvalChain) default prompt instructs the model to do this."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -179,8 +182,8 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"{'reasoning': 'The context states that the Philadelphia Eagles won the NFC championship game in 2023. The student\\'s answer, \"Eagles,\" matches the team that won according to the context. Therefore, the student\\'s answer is correct.',\n",
|
||||
" 'value': 'CORRECT',\n",
|
||||
"{'reasoning': 'The student\\'s answer is \"Eagles\". The context states that the Philadelphia Eagles won the NFC championship game in 2023. Therefore, the student\\'s answer matches the information provided in the context.',\n",
|
||||
" 'value': 'GRADE: CORRECT',\n",
|
||||
" 'score': 1}"
|
||||
]
|
||||
},
|
||||
@@ -190,9 +193,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from langchain.evaluation import CotQAEvalChain\n",
|
||||
"\n",
|
||||
"eval_chain = CotQAEvalChain.from_llm(llm=llm)\n",
|
||||
"eval_chain = load_evaluator(\"cot_qa\", eval_llm=llm)\n",
|
||||
"\n",
|
||||
"eval_chain.evaluate_strings(\n",
|
||||
" input=\"Who won the NFC championship game in 2023?\",\n",
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
"source": [
|
||||
"# Custom Trajectory Evaluator\n",
|
||||
"\n",
|
||||
"You can make your own custom trajectory evaluators by inheriting from the `AgentTrajectoryEvaluator` class and overwriting the `_evaluate_agent_trajectory` (and `_aevaluate_agent_action`) method.\n",
|
||||
"You can make your own custom trajectory evaluators by inheriting from the [AgentTrajectoryEvaluator](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.AgentTrajectoryEvaluator.html#langchain.evaluation.schema.AgentTrajectoryEvaluator) class and overwriting the `_evaluate_agent_trajectory` (and `_aevaluate_agent_action`) method.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"In this example, you will make a simple trajectory evaluator that uses an LLM to determine if any actions were unnecessary."
|
||||
@@ -67,7 +67,9 @@
|
||||
"id": "297dea4b-fb28-4292-b6e0-1c769cfb9cbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The example above will return a score of 1 if the language model predicts that any of the actions were unnecessary, and it returns a score of 0 if all of them were predicted to be necessary."
|
||||
"The example above will return a score of 1 if the language model predicts that any of the actions were unnecessary, and it returns a score of 0 if all of them were predicted to be necessary.\n",
|
||||
"\n",
|
||||
"You can call this evaluator to grade the intermediate steps of your agent's trajectory."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -107,6 +109,12 @@
|
||||
" ],\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "77353528-723e-4075-939e-aebdb17c1e4f",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
@@ -100,11 +100,13 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2df34eed-45a5-4f91-88d3-9aa55f28391a",
|
||||
"metadata": {},
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"## Evaluate Trajectory\n",
|
||||
"\n",
|
||||
"Pass the input, trajectory, and output to the `evaluate_agent_trajectory` function."
|
||||
"Pass the input, trajectory, and pass to the [evaluate_agent_trajectory](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.schema.AgentTrajectoryEvaluator.html#langchain.evaluation.schema.AgentTrajectoryEvaluator.evaluate_agent_trajectory) method."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -149,7 +151,7 @@
|
||||
"source": [
|
||||
"## Configuring the Evaluation LLM\n",
|
||||
"\n",
|
||||
"If you don't select an LLM to use for evaluation, the `load_evaluator` function will use `gpt-4` to power the evaluation chain. You can select any chat model for the agent trajectory evaluator as below."
|
||||
"If you don't select an LLM to use for evaluation, the [load_evaluator](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.loading.load_evaluator.html#langchain.evaluation.loading.load_evaluator) function will use `gpt-4` to power the evaluation chain. You can select any chat model for the agent trajectory evaluator as below."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user