mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-21 01:24:04 +00:00
[feature] ColossalEval: Evaluation Pipeline for LLMs (#4786)
* Add ColossalEval * Delete evaluate in Chat --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com>
This commit is contained in:
@@ -1,396 +0,0 @@
|
||||
# Evaluation
|
||||
|
||||
In this directory, we introduce how you can evaluate your model with our pipeline. This pipeline is now available for evaluation of both Chinese and English capability.
|
||||
|
||||
## Installation
|
||||
|
||||
To start model evaluation, you need to install required packages which listed in `requirements.txt` under `evaluate` folder.
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Evaluation Pipeline
|
||||
|
||||
The whole evaluation pipeline consists of three methods:
|
||||
|
||||
1. `GPT Evaluation`: evaluates model predictions using GPT models.
|
||||
- Compare the performance of two different models (battle).
|
||||
- Rate the model according to pre-defined metrics using prompting design.
|
||||
- Rate the model according to pre-defined metrics with additional reference answer using prompting design.
|
||||
2. `Automatic Evaluation`: evaluates model predictions using automatic metrics.
|
||||
3. `UniEval`: evaluates model predictions using UniEval models(English only).
|
||||
|
||||
### Evaluation Category
|
||||
|
||||
Our evaluation pipeline examines the model's capability using 10 categories of questions. The following table introduces each category:
|
||||
|
||||
| Evaluation Category | Description |
|
||||
| :-----------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Brainstorming | Models are asked to generate a range of creative and diverse ideas according to the question. The capability of creativity is required. |
|
||||
| Chat | Models are asked to continue a multi-round dialogue given the roles involved. The capability of understanding, memorizing previous rounds of the dialogue and answering according to the persona provided is required. |
|
||||
| Classification | Models are asked to do classification tasks. The capability of accurate classification is required. |
|
||||
| Closed QA | Models are asked to answer a closed QA question. The capability of answering questions with limited scope (such as single/multiple choice question) is required. |
|
||||
| Extraction | Models are asked to extract information from a given material. The capability of extracting required information is required. |
|
||||
| Generation | Models are asked to generate an email, letter, article, etc. The capability of generating texts in a high quality and human-written way is required. |
|
||||
| Open QA | Models are asked to answer an open QA question(without context provided). The capability of answering questions with the models' own knowledge base is required. |
|
||||
| Roleplay | Models are asked to play the role provided. The capability of engaging in the scenario and effectively interacting with the user is required. |
|
||||
| Rewriting | Models are asked to do rewriting tasks such as translation and grammar correction. The capability of rewriting according to different instructions is required. |
|
||||
| Summarization | Models are asked to summarize the given paragraph or passage. The capability of summarization is required. |
|
||||
|
||||
To better understand each evaluation category, here are some example questions provided.
|
||||
|
||||
| Evaluation Category | Chinese Example | English Example |
|
||||
| :-----------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Brainstorming | **Example 1:**<br/>请介绍一下人工智能的多个领域。<br/><br/>**Example 2:**<br/>请给出管理家庭财务的 3 个小技巧。<br/> | **Example 1:**<br/>How can I improve my memory? Any useful techniques you can suggest?<br/><br/>**Example 2:**<br/>What are some ways to increase productivity while working from home? |
|
||||
| Chat | **Example 1:**<br/>基于以下角色信息完成一段对话。小张是一名新手爱好者,对养鸡有浓厚的兴趣。老李是一名有丰富经验的养鸡大师。<br/>小张:您好,老李,我最近开始对养鸡感兴趣了,想请教您一些问题。 <br/>老李:你好,小张,我很乐意帮助你。你想问些什么? <br/>小张:我想知道如何确定鸡的品种和性别? <br/>老李:确切的品种可以通过鸡的外貌特征来确定,而性别一般是通过鸡卵的大小和形状来判断。还有什么问题吗?<br/> 小张:<br/><br/>**Example 2:**<br/>基于以下角色信息完成一段对话。小明是一名医生,一位老年病患者想要停药,但他对病情有所忽视并有担忧;王叔叔是老年病患者的儿子,希望能够听取医生的建议。<br/>小明:你好,王叔叔,我了解你想要让你父亲停药。<br/>王叔叔:是的,我父亲已经吃了那么久的药,我担心药物对他的身体会有副作用。<br/>小明: | **Example 1:**<br/>Complete a conversation based on the following character information. Amy is a 30-year-old chef who runs her own restaurant. Jack is a food blogger who specializes in reviewing local restaurants.<br/>Amy: Hi Jack, I heard that you're a food blogger. Nice to meet you. <br/>Jack: Hi Amy, yes I am. Your restaurant has been receiving a lot of good reviews lately. <br/>Amy: Yes, we use only fresh and quality ingredients, and every dish is carefully crafted. <br/>Jack: <br/><br/>**Example 2:**<br/>Complete a dialogue based on the following role information. A: Elementary student B: Teacher<br/>B: Good morning, Student A. Today we're going to learn about addition and subtraction.<br/>A: Teacher, I already know this very well. Why do I need to learn it again?<br/>B: |
|
||||
| Classification | **Example 1:**<br/>新闻标题:今日立夏,有一上联,立夏万物并秀,下联怎么对?<br/>请根据以上新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。<br/><br/> **Example 2:**<br/>新闻标题:赵丽颖很久没有登上微博热搜了,但你们别急,她只是在憋大招而已。<br/>请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。 | **Example 1:**<br/>Title: Fighting for Love (2020) <br/>Description: Jasmine got obsessed with a man and now he's obsessed with her. Steamy nights, kisses and rules being broken awaits them. She turned his whole world upside down and now he's doing it to hers. In this free fall, can they survive each others love?\"<br/>Based on the above information, determine which genre the work of art belongs to. You can only choose one from \"sport\", \"horror\", \"drama\", \"history\", \"romance\", \"biography\", \"science fiction\", \"comedy\", \"animation\", \"documentary\", \"music\" and \"news\".<br/><br/>**Example2:** <br/>Title: Summer Breeze: The Isley Brothers Greatest Hits Live (2005)<br/>Description: Filmed in the US in 2005 and captured in excellent form led by Ron Isley's vocals and Ernie Isley's hard edged guitar. Virtually every track is a hit including Shout, Who's That Lady, Twist And Shout, Summer Breeze and Harvest For The World.<br/>Based on the above information, determine which genre the work of art belongs to. You can only choose one from \"sport\", \"horror\", \"drama\", \"history\", \"romance\", \"biography\", \"science fiction\", \"comedy\", \"animation\", \"documentary\", \"music\" and \"news\"." |
|
||||
| Closed QA | **Example 1:**<br/>请从以下选项中选择正确答案。以下哪个是世界上最高山峰? <br/>A. 长城 <br/>B. 泰山 <br/>C. 珠穆朗玛峰 <br/>D. 黄山<br/><br/>**Example 2:**<br/>请从以下选项中选择一个最佳答案回答下面的问题。问题:非洲最高的山是哪座山?<br/> 选项: <br/>A. 麦金利山 <br/>B. 喜马拉雅山 <br/>C. 乞力马扎罗山 | **Example 1:**<br/>Which of the following options is NOT a primary color?<br/>(a) yellow<br/>(b) blue<br/>(c) orange<br/>(d) red<br/><br/>**Example 2:**<br/>Choose the correct option to complete the following sentence: \"Harry Potter and the Chamber of Secrets\" is the **\_\_\_\_** book in the Harry Potter series.<br/>(A) first<br/>(B) second<br/>(C) third<br/>(D) fourth |
|
||||
| Extraction | **Example 1:**<br/>根据以下新闻文本,提取新闻报道时间,例如回答时按照格式“新闻报道时间:2007 年 8 月 10 日”<br/>新闻文本如下:2007-4-7 中新网 4 月 7 日电据中国消防在线消息,4 月 4 日晚上 7 时 30 分左右,湖南长潭高速公路上发生一起 6 车连环相撞失火事故。长株潭三地消防部门共出动消防车 21 台,警力 100 余人。经过消防官兵近 2 个小时奋力扑救,大火被成功扑灭。据初步调查,有 1 人在此次事故中死亡。<br/><br/>**Example 2:**<br/>根据以下新闻文本,提取新闻报道时间,例如回答时按照格式“新闻报道时间:2007 年 8 月 10 日”<br/>新闻文本如下:2014 年 1 月 15 日,据外媒《俄罗斯报》报道称,位于北半球的澳大利亚现在正处于炎热的夏季,而近日也到了高温酷暑的时候,当地时间 1 月 14 日晚,澳大利亚南部一夜间发生至少 250 起火灾。受炎热天气及雷雨天气影响,澳大利亚南部一夜间发生至少 250 起火灾,灾情多集中在维多利亚州。火灾发生后,救援人员立即展开救灾行动。目前,大部分起火点火势已被控制。 | **Example 1:**<br/>Ernest Hemingway, an American literary giant known for his spare and direct writing style, has penned timeless works such as 'The Old Man and the Sea', 'For Whom the Bell Tolls', and 'A Farewell to Arms', which have made a profound impact on the literary world and continue to be widely read and admired today.<br/>Extract the name of the author mentioned above.<br/><br/>**Example 2:**<br/>In the epic fantasy series 'A Song of Ice and Fire', George R.R. Martin weaves a complex web of political intrigue, war, and magic across the fictional continents of Westeros and Essos. Martin's richly developed characters and intricate plotlines have captivated readers worldwide, much like his other acclaimed works such as 'A Clash of Kings' and 'A Storm of Swords'.<br/>Extract the name of the author in the above material. |
|
||||
| Generation | **Example 1:**<br/>请撰写一篇文章,介绍如何通过改善生活习惯来预防疾病和延长寿命。<br/><br/>**Example 2:**<br/>请根据以下情节撰写一篇短篇小说:一名年轻人被困在一个荒岛上,他必须想办法生存下去直到被救援。但他很快发现自己并不孤单。 | **Example 1:**<br/>Write a descriptive paragraph about an island to relax and unwind, including details about the location and atmosphere.<br/><br/>**Example 2:**<br/>Can you help me write a persuasive email to my colleagues encouraging them to participate in a charitable fundraising event? |
|
||||
| Open QA | **Example 1:**<br/>请问万有引力定律由谁提出的?<br/><br/>**Example 2:**<br/>哪些国家参与了第一次世界大战? | **Example 1:**<br/>What are the four basic tastes of the human palate?<br/><br/>**Example 2:**<br/>Who painted the The Scream? |
|
||||
| Rewriting | **Example 1:**<br/>请将以下句子改为正确的语序。 <br/>生日快乐你祝他了吗?<br/><br/>**Example 2:**<br/>将以下文本翻译成英语:<br/>“这个周末我要去海边玩” | **Example 1:**<br/>Please translate the following sentences, which are a mixture of Chinese and English, into full English. <br/>我需要买一些 healthy snacks,比如 nuts 和 dried fruits,作为我的 office 的午餐.<br/><br/>**Example 2:**<br/>Please rewrite the sentence using an inverted sentence structure.<br/>We won't begin our journey until the sun sets. |
|
||||
| Roleplay | **Example 1:**<br/>我想让你担任 Android 开发工程师面试官。我将成为候选人,您将向我询问 Android 开发工程师职位的面试问题。我希望你只作为面试官回答。不要一次写出所有的问题。我希望你只对我进行采访。问我问题,等待我的回答。不要写解释。像面试官一样一个一个问我,等我回答。我的第一句话是“面试官你好”。 <br/><br/>**Example 2:**<br/>我想让你扮演讲故事的角色。你会想出引人入胜、富有想象力和吸引观众的有趣故事。它可以是童话故事、教育故事或任何其他类型的有潜力的故事以吸引人们的注意力和想象力。根据目标受众,您可以为您的讲故事环节选择特定的主题或主题,例如,如果是儿童,那么您可以谈论动物;如果是成人,那么基于历史的故事可能会更好地吸引他们等。我的第一个请求是我需要一个关于毅力的有趣故事。 | **Example 1:**<br/>Assume the role of a marriage counselor. Develop a series of communication exercises for a couple who are experiencing difficulties in their relationship. These exercises should promote active listening, empathy, and effective expression of emotions. Your first assignment is to provide a set of three exercises that focus on resolving conflicts and rebuilding trust. <br/><br/>**Example 2:**<br/>I want you to act as a travel agent. I will tell you my desired destination, travel dates, and budget, and it will be your job to suggest the best travel itinerary for me. Your recommendations should include the best transportation options, hotel accommodations, and any popular tourist attractions nearby. My first request is "I want to plan a trip to Tokyo for a week, with a budget of $2000. I want to explore the culture and food of the city." |
|
||||
| Summarization | **Example 1:**<br/>请简要总结概括以下段落材料。<br/>当地时间 29 日,泰国卫生部通报,新增 143 名新冠肺炎确诊病例和 1 名死亡病例。截止到当地时间 29 日上午,泰国累计确诊病例 1388 例,其中泰国籍 1172 例,非泰国籍 216 例。死亡病例累计 7 例。(原题为《泰国新增 143 例新冠肺炎确诊病例累计确诊 1388 例》)<br/><br/> **Example 2:**<br/>请简要总结概括以下段落材料。<br/>近期,参与京雄高铁站站房建设的中铁十二局,因在施工过程中存在环境违法行为被雄安新区公开通报。通报发出后,引起社会广泛关注。近日,人民网记者从雄安新区相关部门及中铁十二局获悉,新区有关部门已经集中约谈了中铁十二局等 24 个参与雄安建设的项目单位。对于约谈内容和结果,中铁十二局有关宣传负责人回应:“具体内容不清楚,最好找雄安新区相关部门了解情况。”新区有关部门负责人表示,此前涉及的环境违法行为,中铁十二局已基本整改到位,但约谈内容和结果暂不公开,接下来,将按部就班推进环境治理工作。(原题为《雄安新区:中铁十二局涉环境违法已基本整改到位》) | **Example 1:**<br/>The 21 year-old-woman was treated by paramedics after the kitchen fire in Botfield Road in Shifnal, Shropshire. West Mercia Police said it is treating Wednesday morning's incident as arson and are appealing for any witnesses to contact them.The 50-year-old man has been arrested on suspicion of arson with intent to endanger life. For more on this and other stories from Shropshire.<br/>Please briefly summarize the above material within 20 words.<br/><br/>**Example 2:**<br/>South Wales Police were called to a property in Heolgerrig, Merthyr Tydfil, at about 13:40 BST on Sunday. The child was airlifted to Prince Charles Hospital but died shortly afterwards. Police are investigating the circumstances surrounding the incident and have appealed for witnesses. The girl's family are being supported by specially trained officers.<br/>Please briefly summarize the above material within 20 words. |
|
||||
|
||||
### Evaluation Metrics
|
||||
|
||||
#### GPT Evaluation
|
||||
|
||||
GPT evaluation uses GPT models to evaluate the prediction of different models and different pre-defined evaluation metrics are applied to different categories. The following table shows the 11 pre-defined evaluation metrics both in Chinese and English:
|
||||
|
||||
| Evaluation Metric | Prompt Words | CoT(Chain-of-Thought) |
|
||||
| :----------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| 语言组织<br/>(Language organization) | 语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。</br></br>Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc. | 1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。<br/> 2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说<br/> 3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。<br/> 4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。<br/> 5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。<br/> 6. 根据以上因素综合评估答案的语言组织,并给出一个 1 到 5 的分数,其中 5 表示语言组织非常好,而 1 表示语言组织非常差。</br></br>1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.<br>2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.<br>3. Determine if the answer is relevant to the question or topic and conveys a clear message.<br>4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.<br>5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.<br>6. Evaluate the linguistic organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good linguistic organization and 1 indicates very poor linguistic organization. |
|
||||
| 切题<br/>(Relevance) | 切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。</br></br>Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic. | 1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。<br/> 2. 阅读答案,确认答案是否直接回答了题目所问的问题。<br/> 3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。<br/> 4. 根据以上因素综合评估答案的切题程度,并给出一个 1 到 5 的分数,其中 5 表示答案非常切题,而 1 表示答案完全没有切题。</br></br>1. Read the question to determine what the question asks and what aspects of the question need to be answered.<br>2. Read the answers to make sure that they directly answer the question asked.<br>3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.<br>4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all. |
|
||||
| 创意性<br/>(Creativity) | 创意性(1-5):某些头脑风暴问题可能需要答案具有创意,提出新的思路。</br></br>Creativity (1-5): Some brainstorming questions may require answers that are creative and suggest new ideas. | 1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则创意性评分可能会受到影响。<br/> 3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠,但仍然可以被认为是有创意的,只要它提供了新的角度或方法来解决问题。<br/> 4. 根据答案的创意性,给出一个 1 到 5 的评分。如果答案缺乏创意,则应给出一个较低的评分。如果答案具有创意并提供了新的思路,应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the creativity score may be affected.<br>3. Consider whether the answer contains novel ideas or unique thoughts. An answer may overlap with a known solution and still be considered creative, as long as it offers a new perspective or approach to the problem.<br>4. Give a score of 1 to 5 depending on the creativity of the answer. If the answer lacks creativity, a lower score should be given. If the answer is creative and provides a new idea, a higher score should be given. |
|
||||
| 实用性<br/>(Practicality) | 实用性(1-5):某些头脑风暴问题可能需要答案提出实用的建议或解决方法。</br></br>Practicality (1-5): Some brainstorming questions may require answers to suggest practical suggestions or solutions. | 1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则实用性评分可能会受到影响。<br/> 3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好,但如果无法实现或应用,则实用性评分可能会受到影响。<br/> 4. 根据答案的实用性,给出一个 1 到 5 的评分。如果答案缺乏实用性,则应给出一个较低的评分。如果答案提出了实用的建议或解决方法,并且可以很好地解决问题,则应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the practicality score may be affected.<br>3. Consider whether the suggestions or solutions presented in the answer are practical and workable. The answer may look good, but if it cannot be implemented or applied, the practicality score may be affected.<br>4. Give a score of 1 to 5 depending on the practicality of the answer. If the answer lacks practicality, a lower score should be given. If the answer makes a practical suggestion or solution and solves the problem well, a higher score should be given. |
|
||||
| 正确性<br/>(Correctness) | 正确性(1-5):正确性(1-5):答案是否正确。</br></br> Correctness (1-5): whether the answer is correct or not. | 1. 仔细阅读题目,尝试自己回答该问题。<br/>2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为 5 分。如果答案是部分正确的,则可以给予适当的得分,例如 2 分、3 分或 4 分。如果答案完全不正确,则只得 1 分。<br/><br/>1. Read the question carefully and try to answer the question yourself. <br/>2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded. |
|
||||
| 自然<br/>(Naturalness) | 自然(1-5):答案是否自然,并且符合问题给定的身份。</br></br>Naturalness (1-5): whether the answer is natural and fits the identity given by the question. | 1. 阅读题目,确定题目提供的身份信息。<br/> 2. 检查答案内容是否符合题目给定的身份。<br/> 3. 根据以上因素,对该回答的自然性进行打分,分数从 1 到 5,其中 1 表示不自然,5 表示非常自然,并符合问题给定的身份。</br></br>1. Read the question and determine the identity information provided in the question.<br>2. Check whether the content of the answer matches the identity given in the question.<br>3. Based on the above factors, score the naturalness of the response on a scale from 1 to 5, where 1 means unnatural and 5 means very natural and in accordance with the identity given in the question. |
|
||||
| 参与感<br/>(Engagingness) | 参与感(1-5):答案是否对前面的对话内容做出了恰当的反应,是否理解对话的语境和背景。</br></br>Engagingness (1-5): whether the answer responds appropriately to the content of the preceding conversation and whether it understands the context and background of the conversation. | 1. 阅读题目,确定对话的语境和背景。<br/> 2. 检查答案是否充分理解对话的语境和背景,能否自然地融入到对话中而不显得突兀。<br/> 3. 根据以上因素,对该回答的参与感进行打分,分数从 1 到 5,其中 1 表示没有参与感,5 表示非常有参与感,并且恰当地理解了对话的语境和背景。</br></br>1. Read the questions to determine the context and background of the dialogue.<br>2. Check that the answer fully understands the context and background of the conversation and that it fits naturally into the conversation without seeming abrupt.<br>3. Based on the above factors, rate the response's engagement on a scale from 1 to 5, where 1 means not engaged and 5 means very engaged and appropriately understands the context and background of the conversation. |
|
||||
| 合理性<br/>(Reasonableness) | 合理性(1-5):答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。</br></br>Reasonableness (1-5): Whether the answer can form a logical connection with the content of the previous dialogue, whether it is consistent with common sense, and whether it can reasonably exist in this context. | 1. 阅读题目,确定对话的主题以及问题期望的回答方向。<br/> 2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。<br/> 3. 根据以上因素,对该回答的合理性进行打分,分数从 1 到 5,其中 1 表示不合理,5 表示非常合理,并且能够与前面的对话内容形成逻辑上的衔接,并符合常理。</br></br>1. Read the question and determine the topic of the conversation and the direction the question expects the answer to go.<br>2. Determine whether the answer can be logically connected to the preceding conversation, whether it makes common sense, and whether it can reasonably exist in this context.<br>3. Based on the above factors, rate the reasonableness of the answer on a scale from 1 to 5, where 1 means unreasonable and 5 means very reasonable and able to form a logical connection with the preceding dialogue content and consistent with common sense. |
|
||||
| 多样性<br/>(Diversity) | 多样性(1-5):答案使用语言是否优美,具有有一定的创造性和想象力。然而,回答也应该保持合理和适度,不要过于夸张或离题。</br></br>Diversity (1-5): Whether the answers use beautiful language and have some creativity and imagination. However, answers should also be kept reasonable and moderate, not overly exaggerated or off-topic. | 1. 仔细阅读整个回答,确保完全理解回答所表达的内容和主题。<br/> 2. 在阅读回答的同时,注意语言的质量,例如措辞是否正确,语言是否生动等。<br/> 3. 检查回答的创造性和想象力,看看回答是否能够吸引人阅读下去。<br/> 4. 检查回答的合理性和适度,看看回答是否夸张或离题。5. 将多样性的评分打分在 1 到 5 之间,5 分表示回答的质量很好,能够吸引人阅读,1 分表示回答的内容生硬或者有离题的问题。</br></br>1. Read the entire response carefully to ensure that you fully understand the content and theme expressed in the response.<br>2. While reading the response, pay attention to the quality of the language, such as whether the wording is correct and the language is vivid.<br>3. Check the creativity and imagination of the response to see if the response is engaging to read on.<br>4. Check the reasonableness and appropriateness of the responses to see if the responses are exaggerated or off-topic.<br>5. Rate the diversity on a scale of 1 to 5, with a 5 indicating a good quality response that is engaging to read and a 1 indicating a raw response or a question that is off-topic. |
|
||||
| 保真度<br/>(Fidelity) | 保真度(1-5):答案是否能够严格遵守角色的设定回答给定的请求。</br></br>Fidelity (1-5): whether the answer is able to answer the given request in strict compliance with the role setting. | 1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。<br/> 阅读题目的请求,确认回答请求时需要注意的细节。<br/> 3. 对比提供的回答与该角色的设定,评估回答是否能够严格遵守角色的设定。<br/> 4. 结合以上评估结果给出保真度的评分,范围从 1 到 5 分,其中 1 分表示回答与角色设定完全不符,5 分表示回答完全符合角色设定且满足给定请求。</br></br>1. Read the question carefully to understand how the character is set up and represented in the question, including aspects such as occupation, background, point of view, and personality.<br>2. Read the question's request and confirm the details that need to be taken into account when answering the request.<br>3. Compare the provided answer with the setting of the role and assess whether the answer can strictly adhere to the setting of the role.<br>4. Combine the results of the above assessment to give a fidelity score ranging from 1 to 5, where a score of 1 means that the response does not match the persona at all, and a score of 5 means that the response fully complies with the persona and satisfies the given request. |
|
||||
| 简明扼要<br/>(Conciseness) | 简明扼要(1-5):答案是否简明扼要,没有冗余内容。</br></br>Conciseness (1-5): answers should be concise and without redundant content. | 1. 阅读题目,提取出材料的重点。<br/> 2. 阅读该总结,并注意其中的主要观点和信息。<br/> 3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息,而不是冗长的段落或文章。<br/> 4. 检查总结是否包含与主要观点无关的信息或冗余信息。<br/> 5. 确定总结涵盖了材料中的关键信息,并且没有忽略任何重要细节。<br/> 6. 给总结打出 1-5 的分数,其中 5 表示总结简明扼要,没有冗余内容,而 1 表示总结冗长或包含不必要的信息,难以理解或记忆。根据您的判断,打出适当的得分。</br></br>1. Read the title and extract the main points of the material.<br>2. Read the summary and note the main ideas and messages in it.<br>3. Assess the length of the summary. A concise summary should usually convey key information within a few sentences or paragraphs, rather than lengthy paragraphs or essays.<br>4. Check that the summary does not contain information that is not relevant to the main ideas or that is redundant.<br>5. Make sure that the summary covers the key information in the material and that no important details have been omitted.<br>6. Rate the summary on a scale of 1-5, where 5 means the summary is concise and free of redundancy, and 1 means the summary is lengthy or contains unnecessary information that is difficult to understand or remember. Based on your judgment, assign the appropriate score. |
|
||||
|
||||
GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.
|
||||
|
||||
> **NOTE 1:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "Whether the answer is correct or not."(this is for category `classification`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
|
||||
|
||||
> **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).
|
||||
|
||||
#### Automatic Evaluation
|
||||
|
||||
Automated metrics evaluate the capability of a model by comparing model predictions with reference answers.
|
||||
There are two ways to obtain reference answers:
|
||||
|
||||
- For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
|
||||
- For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.
|
||||
|
||||
There are 6 types of automatic evaluation metrics listed in the table below:
|
||||
|
||||
| Automatic Evaluation Metric | Description |
|
||||
| :---------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| BLEU-n | Measure the accuracy between prediction and reference.<br/> BLEU-1 (Unigram) evaluates accuracy in word level.<br/> BLEU-n (n-gram) evaluate the fluency in sentence level. |
|
||||
| ROUGE | ROUGE-N measures the number of matching n-grams between prediction and reference. <br/> ROUGE-L measures the number of matching longest common subsequence (LCS) between prediction and reference. |
|
||||
| Distinct | Measure the diversity of generation text by counting the unique n-grams. |
|
||||
| BERTScore | Measure the semantic similarity between tokens of predictions and references with BERT. |
|
||||
| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories). |
|
||||
| CHRF | Measure the similarity of character n-grams between prediction and reference. |
|
||||
|
||||
#### UniEval Evaluation
|
||||
|
||||
UniEval converts all evaluation tasks of different dimensions(metrics) into Boolean QA problems and utilize the model to answer with “Yes” or “No”. Compared with similarity-based metrics such as ROUGE and BLEU, UniEval can achieve a more comprehensive evaluation. In addition, UniEval also demonstrates its ability to transfer to unseen dimensions and tasks.
|
||||
|
||||
In our evaluation pipeline, two pre-trained UniEval evaluators are used. One is [unieval-sum](https://huggingface.co/MingZhong/unieval-sum) and the other is [unieval-dialog](https://huggingface.co/MingZhong/unieval-dialog). The two models can be used for the 3 tasks, `summarization`, `dialogue` and `data2text`. Each task has different evaluation dimensions.
|
||||
|
||||
| UniEval Model | Task | Dimension(Metric) |
|
||||
| :------------: | :------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| unieval-sum | summarization | coherence: whether the summary is coherent<br/>consistency: whether the claim is consistent with the given document<br/>fluency: whether the paragraph is fluent<br/>relevance: whether the summary is relevant to the reference |
|
||||
| unieval-sum | data2text | naturalness: whether the utterance is fluent<br/>informativeness: whether the utterance is informative according to the reference |
|
||||
| unieval-dialog | dialogue | naturalness: whether the response is natural in the dialogue<br/>coherence: whether the response is coherent in the dialogue history<br/>understandability: whether the response is understandable in the dialogue |
|
||||
|
||||
> **NOTE 1:** Task "data2text" uses the same model as task "summarization".
|
||||
|
||||
> **NOTE 2:** In UniEval paper, the `unieval-sum` model demonstrates the best transfer ability and so you can evaluate your customized metric with this model. Details of adding customized metrics can be found in [FAQ](#faq).
|
||||
|
||||
> **NOTE 3:** We consider not including all metrics provided in UniEval in our pipeline because the data structure and content of the instructions we want to evaluate are not suitable for direct use of some UniEval metrics.
|
||||
|
||||
## Evaluation Process
|
||||
|
||||
### Data Format
|
||||
|
||||
#### Target Answers / Predictions
|
||||
|
||||
A JSON file contains one list. Each element in the list is a target answer / prediction record for one instruction / question.
|
||||
An element should have the following fields:
|
||||
|
||||
- `category` (str, compulsory): The category of the instruction / question.
|
||||
- `instruction` (str, compulsory): The instruction / question for the LLM.
|
||||
- `input` (str, optional): The additional context of the instruction / question.
|
||||
- `output` (str, optional): The sample output of the instruction (default: GPT-3.5).
|
||||
- `target` (str, optional): The target answer for the instruction.
|
||||
- `id` (int, compulsory): The ID of the instruction / question.
|
||||
|
||||
If the `input` has a target answer, the `output` can be empty. Otherwise, we generate answers from GPT-3.5 as the `output`, and the `target` field is empty.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "brainstorming",
|
||||
"instruction": "请介绍一下人工智能的多个领域。",
|
||||
"input": "",
|
||||
"output": "{GPT-3.5 Answers}",
|
||||
"target": "",
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"category": "classification",
|
||||
"instruction": "新闻标题:为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经?请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。",
|
||||
"input": "",
|
||||
"output": "",
|
||||
"target": "{target answer}",
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### Model Answers / Predictions
|
||||
|
||||
A JSON file contains one list. Each element in the list is a model answer / prediction record for one instruction / question.
|
||||
|
||||
An element should have the following fields:
|
||||
|
||||
- `category` (str, compulsory): The category of the instruction / question.
|
||||
- `instruction` (str, compulsory): The instruction / question for the LLM.
|
||||
- `input` (str, optional): The additional context of the instruction / question.
|
||||
- `output` (str, compulsory): The output from the LLM.
|
||||
- `target` (str, optional): The target answer for the instruction.
|
||||
- `id` (int, compulsory): The ID of the instruction / question.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "brainstorming",
|
||||
"instruction": "请介绍一下人工智能的多个领域。",
|
||||
"input": "",
|
||||
"output": "{Model Answers / Predictions}",
|
||||
"target": "",
|
||||
"id": 1
|
||||
},
|
||||
{
|
||||
"category": "classification",
|
||||
"instruction": "新闻标题:为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经?请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。",
|
||||
"input": "",
|
||||
"output": "{Model Answers / Predictions}",
|
||||
"target": "{target answer}",
|
||||
"id": 2
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Prompt
|
||||
|
||||
#### Battle Prompt
|
||||
|
||||
The following is the Chinese battle prompt. In the battle prompt, the question and answers from two different models are fed into the prompt template. You can find example battle prompt files for Chinese and English in `prompt/battle_prompt`.
|
||||
|
||||
```json
|
||||
{
|
||||
"id": 1,
|
||||
"system_prompt": "你是一个检查回答质量的好助手。",
|
||||
"prompt_template": "[问题]\n{question}\n\n[1号AI助手的答案]\n{answer_1}\n\n[1号AI助手答案终止]\n\n[2号AI助手的答 案]\n{answer_2}\n\n[2号AI助手答案终止]\n\n[要求]\n{prompt}\n\n",
|
||||
"prompt": "我们需要你评价这两个AI助手回答的性能。\n请对他们的回答的有用性、相关性、准确性、详细程度进行评分。每个AI助手都会得到一个1到10分的总分,分数越高表示整体表现越好。\n请首先输出一行,该行只包含两个数值,分别表示1号和2号AI助手的分数。这两个分数之间要有一个空格。在随后的一行中,请对你的评价作出全面的解释,避免任何潜在的偏见,并确保AI助手回答的顺序不会影响您的判断。"
|
||||
}
|
||||
```
|
||||
|
||||
#### Evaluation Prompt
|
||||
|
||||
The following is an example of a Chinese GPT evaluation prompt. In an evaluation prompt, you should define your metrics in `metrics` and provide CoT(Chain-of-Thought) in `CoT`. You can find example evaluation prompt files for Chinese and English in `prompt/evaluation_prompt`.
|
||||
|
||||
```json
|
||||
{
|
||||
"brainstorming": {
|
||||
"id": 1,
|
||||
"category": "brainstorming",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`"metrics"`: the metrics that can be used in GPT evaluation. This field determines which metrics can be added to your config file.
|
||||
|
||||
`"CoT"`: evaluation steps you prompt to GPT models for each metric defined in `"metrics"`.
|
||||
|
||||
### Evaluation
|
||||
|
||||
#### Configuration
|
||||
|
||||
The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics, automatic metrics and UniEval metrics in key `GPT`, `Metrics` and `UniEval`(English only). You can find an example English config file in `config`.
|
||||
|
||||
```json
|
||||
{
|
||||
"language": "en",
|
||||
"path_for_UniEval": {
|
||||
"summarization": "path to unieval-sum model",
|
||||
"dialogue": "path to unieval-dialog model",
|
||||
"data2text": "path to unieval-sum model"
|
||||
},
|
||||
"category": {
|
||||
"brainstorming": {
|
||||
"GPT": ["relevance", "creativity", "practicality", "reasonableness"],
|
||||
"Metrics": ["Distinct"],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"chat": {
|
||||
"GPT": ["relevance", "naturalness", "engagingness", "reasonableness"],
|
||||
"Metrics": ["Distinct"],
|
||||
"UniEval": [
|
||||
"dialogue-naturalness",
|
||||
"dialogue-coherence",
|
||||
"dialogue-understandability"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.
|
||||
|
||||
`"path_for_UniEval"`: path to the UniEval model.
|
||||
|
||||
`"category"`: the category/categories needed to evaluate the model capability.
|
||||
|
||||
`"GPT"`: the metrics you want to use for GPT evaluation.
|
||||
|
||||
`"Metrics"`: the metrics you want to use for automatic metrics evaluation.
|
||||
|
||||
`"UniEval"`: the metrics you want to use for UniEval metrics evaluation. The metric has to be in the `"{task}-{metric}"` format because different tasks have same metrics such as naturalness and coherence.
|
||||
|
||||
You can remove the key such as `"Metrics"` to skip evaluating answers using its corresponding evaluation metrics.
|
||||
|
||||
You can create your config file based on available settings listed in following table.
|
||||
|
||||
| "category" | "GPT" | "Metrics" | "UniEval" |
|
||||
| :--------------: | :---------------------: | :---------: | :--------------------------: |
|
||||
| "brainstorming" | "language organization" | "BLEU" | "dialogue-naturalness" |
|
||||
| "chat" | "relevance" | "ROUGE" | "dialogue-coherence" |
|
||||
| "classification" | "creativity" | "Distinct" | "dialogue-understandability" |
|
||||
| "closed_qa" | "practicality" | "BERTScore" | "data2text-naturalness" |
|
||||
| "extraction" | "correctness" | "Precision" | "data2text-informativeness" |
|
||||
| "generation" | "naturalness" | "Recall" | "summarization-coherence" |
|
||||
| "open_qa" | "engagingness" | "F1 score" | "summarization-consistency" |
|
||||
| "rewriting" | "reasonableness" | "CHRF" | "summarization-fluency" |
|
||||
| "roleplay" | "diversity" | | "summarization-relevance" |
|
||||
| "summarization" | "fidelity" | | |
|
||||
| | "conciseness" | | |
|
||||
|
||||
> **NOTE:** For categories which don't have standard answers such as `brainstorming`, you should avoid using automatic metrics such as `BLEU` and `ROUGE` which are based on similarity measures and you should use `Distinct` instead in your config file.
|
||||
|
||||
#### Evaluate
|
||||
|
||||
After setting the configuration file, you can evaluate the model using `eval.py`. If you want to make comparisons between answers of two different models, you should specify two answer files in the argument `answer_file_list` and two model names in the argument `model_name_list`. If you want to evaluate one answer file, the length of both `answer_file_list` and `model_name_list` should be 1 and the program will perform evaluation using automatic metrics and GPT models.
|
||||
|
||||
An example script is provided as follows:
|
||||
|
||||
```shell
|
||||
python eval.py \
|
||||
--config_file "path to the config file" \
|
||||
--battle_prompt_file "path to the prompt file for battle" \
|
||||
--gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
|
||||
--target_file "path to the target answer file" \
|
||||
--answer_file_list "path to the answer files of at most 2 models" \
|
||||
--model_name_list "the names of at most 2 models" \
|
||||
--gpt_model "which GPT model to use for evaluation" \
|
||||
--save_path "path to save results" \
|
||||
--openai_key "your openai key" \
|
||||
```
|
||||
|
||||
If you want GPT evaluation with reference, you can add an argument `--gpt_with_reference`.
|
||||
|
||||
## FAQ
|
||||
|
||||
<details><summary><b>How can I add a new GPT evaluation metric?</b></summary>
|
||||
|
||||
For example, if you want to add a new metric `persuasiveness` into category `brainstorming`, you should add the metric definition and its corresponding CoT(Chain-of-thought) in the evaluation prompt file in `prompt/evaluation_promt`. The CoT can be generated using ChatGPT. You can prompt ChatGPT to generate evaluation steps for the new metric.
|
||||
|
||||
```json
|
||||
{
|
||||
"brainstorming": {
|
||||
"id": 1,
|
||||
"category": "brainstorming",
|
||||
"metrics": {
|
||||
"persuasiveness": "persuasiveness(1-5):a short description for persuasiveness"
|
||||
},
|
||||
"CoT": {
|
||||
"persuasiveness": "CoT for persuasiveness\n\npersuasiveness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details><summary><b>How can I add a new UniEval evaluation metric?</b></summary>
|
||||
|
||||
For example, if you want to add a new metric `persuasiveness` into task `data2text`, you should add a Boolean QA question about the metric in function `add_question` in `unieval/utils.py`. Please do note that how effectively the model would evaluate this metric is unknown, and you may need some experiments to test whether the model is capable of evaluating this metric.
|
||||
|
||||
```python
|
||||
if task == 'data2text':
|
||||
if dimension == 'persuasiveness':
|
||||
cur_input = 'question: Is this a persuasive utterence </s> utterance: ' + output[i]
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## To Do
|
||||
|
||||
- [x] Add evaluation for English capability
|
||||
- [x] Support UniEval
|
||||
- [x] Support GPT-4 evaluation
|
||||
- [x] Support GPT evaluation with reference
|
||||
|
||||
## Citations
|
||||
|
||||
```bibtex
|
||||
@misc{vicuna2023,
|
||||
title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
|
||||
url = {https://vicuna.lmsys.org},
|
||||
author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
|
||||
month = {March},
|
||||
year = {2023}
|
||||
}
|
||||
|
||||
@misc{liu2023geval,
|
||||
title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
|
||||
author={Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu},
|
||||
year={2023},
|
||||
eprint={2303.16634},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL}
|
||||
}
|
||||
|
||||
@misc{zhong2022unified,
|
||||
title={Towards a Unified Multi-Dimensional Evaluator for Text Generation},
|
||||
author={Ming Zhong and Yang Liu and Da Yin and Yuning Mao and Yizhu Jiao and Pengfei Liu and Chenguang Zhu and Heng Ji and Jiawei Han},
|
||||
year={2022},
|
||||
eprint={2210.07197},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL}
|
||||
}
|
||||
```
|
@@ -1,204 +0,0 @@
|
||||
{
|
||||
"language": "cn",
|
||||
"category": {
|
||||
"brainstorming": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"creativity",
|
||||
"practicality",
|
||||
"reasonableness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
]
|
||||
},
|
||||
"chat": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"naturalness",
|
||||
"engagingness",
|
||||
"fidelity"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
]
|
||||
},
|
||||
"classification": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Precision",
|
||||
"Recall",
|
||||
"F1 score",
|
||||
"CHRF"
|
||||
]
|
||||
},
|
||||
"closed_qa": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore",
|
||||
"CHRF"
|
||||
]
|
||||
},
|
||||
"extraction": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Precision",
|
||||
"Recall",
|
||||
"F1 score",
|
||||
"CHRF"
|
||||
]
|
||||
},
|
||||
"generation": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"diversity"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore"
|
||||
]
|
||||
},
|
||||
"logical_reasoning": {
|
||||
"GPT": [
|
||||
"correctness",
|
||||
"relevance",
|
||||
"reasonableness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore",
|
||||
"CHRF"
|
||||
]
|
||||
},
|
||||
"open_qa": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
]
|
||||
},
|
||||
"rewriting": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore"
|
||||
]
|
||||
},
|
||||
"roleplay": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"fidelity",
|
||||
"creativity"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
]
|
||||
},
|
||||
"summarization": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness",
|
||||
"conciseness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Finance": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Law": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Education": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Medical": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"STEM": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"SocialScience": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Humanity": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"Other": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
},
|
||||
"ethics": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
@@ -1,283 +0,0 @@
|
||||
{
|
||||
"language": "en",
|
||||
"path_for_UniEval": {
|
||||
"summarization": "path to unieval-sum",
|
||||
"dialogue": "path to unieval-dialog",
|
||||
"data2text": "path to unieval-sum"
|
||||
},
|
||||
"category": {
|
||||
"brainstorming": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"creativity",
|
||||
"practicality",
|
||||
"reasonableness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"chat": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"naturalness",
|
||||
"engagingness",
|
||||
"fidelity"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"dialogue-naturalness",
|
||||
"dialogue-coherence",
|
||||
"dialogue-understandability",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"classification": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Precision",
|
||||
"Recall",
|
||||
"F1 score",
|
||||
"CHRF"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"closed_qa": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore",
|
||||
"CHRF"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"extraction": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Precision",
|
||||
"Recall",
|
||||
"F1 score",
|
||||
"CHRF"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"generation": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"diversity"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"logical_reasoning": {
|
||||
"GPT": [
|
||||
"correctness",
|
||||
"relevance",
|
||||
"reasonableness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore",
|
||||
"CHRF"
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"open_qa": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"rewriting": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"roleplay": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"fidelity",
|
||||
"creativity"
|
||||
],
|
||||
"Metrics": [
|
||||
"Distinct"
|
||||
],
|
||||
"UniEval": [
|
||||
"summarization-fluency",
|
||||
"data2text-naturalness",
|
||||
"data2text-informativeness"
|
||||
]
|
||||
},
|
||||
"summarization": {
|
||||
"GPT": [
|
||||
"language organization",
|
||||
"relevance",
|
||||
"correctness",
|
||||
"conciseness"
|
||||
],
|
||||
"Metrics": [
|
||||
"BLEU",
|
||||
"ROUGE",
|
||||
"BERTScore",
|
||||
"CHRF"
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Finance": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Law": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Education": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Medical": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"STEM": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"SocialScience": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Humanity": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"Other": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
},
|
||||
"ethics": {
|
||||
"GPT": [
|
||||
"relevance",
|
||||
"correctness"
|
||||
],
|
||||
"Metrics": [
|
||||
],
|
||||
"UniEval": [
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
@@ -1,120 +0,0 @@
|
||||
import argparse
|
||||
import os
|
||||
|
||||
import openai
|
||||
from evaluator import Evaluator
|
||||
from utils import jload
|
||||
|
||||
|
||||
def main(args):
|
||||
assert len(args.answer_file_list) == len(
|
||||
args.model_name_list
|
||||
), "The number of answer files and model names should be equal!"
|
||||
|
||||
# load config
|
||||
config = jload(args.config_file)
|
||||
|
||||
if config["language"] in ["cn", "en"]:
|
||||
# get metric settings for all categories
|
||||
metrics_per_category = {}
|
||||
for category in config["category"].keys():
|
||||
metrics_all = {}
|
||||
for metric_type, metrics in config["category"][category].items():
|
||||
metrics_all[metric_type] = metrics
|
||||
metrics_per_category[category] = metrics_all
|
||||
|
||||
battle_prompt = None
|
||||
if args.battle_prompt_file:
|
||||
battle_prompt = jload(args.battle_prompt_file)
|
||||
|
||||
gpt_evaluation_prompt = None
|
||||
if args.gpt_evaluation_prompt_file:
|
||||
gpt_evaluation_prompt = jload(args.gpt_evaluation_prompt_file)
|
||||
|
||||
if len(args.model_name_list) == 2 and not battle_prompt:
|
||||
raise Exception("No prompt file for battle provided. Please specify the prompt file for battle!")
|
||||
|
||||
if len(args.model_name_list) == 1 and not gpt_evaluation_prompt:
|
||||
raise Exception(
|
||||
"No prompt file for gpt evaluation provided. Please specify the prompt file for gpt evaluation!"
|
||||
)
|
||||
|
||||
if args.gpt_model == "text-davinci-003" and args.gpt_with_reference:
|
||||
raise Exception(
|
||||
"GPT evaluation with reference is not supported for text-davinci-003. You should specify chat models such as gpt-3.5-turbo or gpt-4."
|
||||
)
|
||||
|
||||
# initialize evaluator
|
||||
evaluator = Evaluator(
|
||||
metrics_per_category,
|
||||
battle_prompt,
|
||||
gpt_evaluation_prompt,
|
||||
args.gpt_model,
|
||||
config["language"],
|
||||
config.get("path_for_UniEval", None),
|
||||
args.gpt_with_reference,
|
||||
)
|
||||
if len(args.model_name_list) == 2:
|
||||
answers1 = jload(args.answer_file_list[0])
|
||||
answers2 = jload(args.answer_file_list[1])
|
||||
|
||||
assert len(answers1) == len(answers2), "The number of answers for two models should be equal!"
|
||||
|
||||
evaluator.battle(answers1=answers1, answers2=answers2)
|
||||
evaluator.save(args.save_path, args.model_name_list)
|
||||
elif len(args.model_name_list) == 1:
|
||||
targets = jload(args.target_file)
|
||||
answers = jload(args.answer_file_list[0])
|
||||
|
||||
assert len(targets) == len(answers), "The number of target answers and model answers should be equal!"
|
||||
|
||||
evaluator.evaluate(answers=answers, targets=targets)
|
||||
evaluator.save(args.save_path, args.model_name_list)
|
||||
else:
|
||||
raise ValueError("Unsupported number of answer files and model names!")
|
||||
else:
|
||||
raise ValueError(f'Unsupported language {config["language"]}!')
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="ColossalAI LLM evaluation pipeline.")
|
||||
parser.add_argument(
|
||||
"--config_file", type=str, default=None, required=True, help="path to the file of target results"
|
||||
)
|
||||
parser.add_argument("--battle_prompt_file", type=str, default=None, help="path to the prompt file for battle")
|
||||
parser.add_argument(
|
||||
"--gpt_evaluation_prompt_file", type=str, default=None, help="path to the prompt file for gpt evaluation"
|
||||
)
|
||||
parser.add_argument("--target_file", type=str, default=None, help="path to the target answer (ground truth) file")
|
||||
parser.add_argument(
|
||||
"--answer_file_list",
|
||||
type=str,
|
||||
nargs="+",
|
||||
default=[],
|
||||
required=True,
|
||||
help="path to the answer files of at most 2 models",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model_name_list", type=str, nargs="+", default=[], required=True, help="the names of at most 2 models"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpt_model",
|
||||
default="gpt-3.5-turbo",
|
||||
choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-4"],
|
||||
help="which GPT model to use for evaluation",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpt_with_reference",
|
||||
default=False,
|
||||
action="store_true",
|
||||
help="whether to include reference answer in gpt evaluation",
|
||||
)
|
||||
parser.add_argument("--save_path", type=str, default="results", help="path to save evaluation results")
|
||||
parser.add_argument("--openai_key", type=str, default=None, required=True, help="Your openai key")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.openai_key is not None:
|
||||
os.environ["OPENAI_API_KEY"] = args.openai_key
|
||||
openai.api_key = os.getenv("OPENAI_API_KEY")
|
||||
|
||||
main(args)
|
@@ -1,9 +0,0 @@
|
||||
python eval.py \
|
||||
--config_file "path to the config file" \
|
||||
--battle_prompt_file "path to the prompt file for battle" \
|
||||
--gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
|
||||
--target_file "path to the target answer file" \
|
||||
--answer_file_list "path to the answer files of at most 2 models" \
|
||||
--model_name_list "the names of at most 2 models" \
|
||||
--save_path "path to save results" \
|
||||
--openai_key "your openai key" \
|
@@ -1,229 +0,0 @@
|
||||
import os
|
||||
from typing import Any, Dict, List
|
||||
|
||||
import gpt_evaluate
|
||||
import metrics
|
||||
import unieval
|
||||
from utils import analyze_automatic_results, get_data_per_category, save_automatic_results
|
||||
|
||||
|
||||
class Evaluator(object):
|
||||
"""
|
||||
A class named Evaluator includes GPT-3.5/GPT-4 evaluation
|
||||
and automatic evaluation
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
params: Dict[str, Any],
|
||||
battle_prompt: Dict[str, Any],
|
||||
gpt_evaluation_prompt: Dict[str, Any],
|
||||
gpt_model: str,
|
||||
language: str,
|
||||
path_for_UniEval: Dict[str, str],
|
||||
gpt_with_reference: bool,
|
||||
) -> None:
|
||||
self.params = params
|
||||
self.battle_prompt = battle_prompt
|
||||
self.gpt_evaluation_prompt = gpt_evaluation_prompt
|
||||
self.gpt_model = gpt_model
|
||||
self.language = language
|
||||
self.path_for_UniEval = path_for_UniEval
|
||||
self.gpt_with_reference = gpt_with_reference
|
||||
self.automatic_metric_stats = dict()
|
||||
self.unieval_metric_stats = dict()
|
||||
self.gpt_evaluation_results = dict()
|
||||
self.battle_results = []
|
||||
|
||||
def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
|
||||
"""
|
||||
Comparison between two models using GPT-4 as the reviewer.
|
||||
"""
|
||||
|
||||
self.battle_results = gpt_evaluate.battle(answers1, answers2, self.battle_prompt)
|
||||
|
||||
def evaluate(self, answers: List[Dict], targets: List[Dict]) -> None:
|
||||
"""
|
||||
A comprehensive evaluation of the answers from the model.
|
||||
The function evaluates the model's performance from different perspectives
|
||||
using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
|
||||
|
||||
The metrics will be decided by the config file.
|
||||
|
||||
"""
|
||||
|
||||
def switch(metric, language):
|
||||
if metric == "BLEU":
|
||||
return metrics.bleu_score(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "ROUGE":
|
||||
return metrics.rouge_score(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "Distinct":
|
||||
return metrics.distinct_score(preds=predicts_list, language=language)
|
||||
elif metric == "BERTScore":
|
||||
return metrics.bert_score(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "Precision":
|
||||
return metrics.precision(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "Recall":
|
||||
return metrics.recall(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "F1 score":
|
||||
return metrics.F1_score(preds=predicts_list, targets=targets_list, language=language)
|
||||
elif metric == "CHRF":
|
||||
return metrics.chrf_score(preds=predicts_list, targets=targets_list, language=language)
|
||||
else:
|
||||
raise ValueError(f"Unexpected metric")
|
||||
|
||||
answers_per_category = get_data_per_category(answers, list(self.params.keys()))
|
||||
targets_per_category = get_data_per_category(targets, list(self.params.keys()))
|
||||
|
||||
# automatic evaluation
|
||||
for category in self.params:
|
||||
if len(answers_per_category[category]) == 0:
|
||||
print(f"Category {category} specified in your config doesn't have corresponding answers!")
|
||||
continue
|
||||
|
||||
if self.params[category].get("Metrics", None) is None:
|
||||
continue
|
||||
|
||||
category_metrics = self.params[category]["Metrics"]
|
||||
self.automatic_metric_stats[category] = {}
|
||||
|
||||
targets_list = [
|
||||
target["target"] if target["target"] else target["output"] for target in targets_per_category[category]
|
||||
]
|
||||
predicts_list = [answer["output"] for answer in answers_per_category[category]]
|
||||
|
||||
for metric in category_metrics:
|
||||
self.automatic_metric_stats[category].update(switch(metric=metric, language=self.language))
|
||||
|
||||
# UniEval evaluation
|
||||
# self.unieval_metric_stats's key is "task" instead of "category".
|
||||
# Iterating "task" first will avoid repeated loading models because one task corresponds to one UniEval model.
|
||||
# If key is "category", different models will be loaded for multiple times across categories because the user may require different task(models) to evaluate one category.
|
||||
for category in self.params:
|
||||
if len(answers_per_category[category]) == 0:
|
||||
print(f"Category {category} specified in your config doesn't have corresponding answers!")
|
||||
continue
|
||||
|
||||
if self.params[category].get("UniEval", None) is None:
|
||||
continue
|
||||
|
||||
if self.params[category]["UniEval"] and self.language == "cn":
|
||||
raise Exception(
|
||||
"UniEval doesn't support Chinese! Please remove UniEval config in your Chinese config file."
|
||||
)
|
||||
|
||||
category_metrics = self.params[category]["UniEval"]
|
||||
|
||||
for task, metric in [tuple(category_metric.split("-")) for category_metric in category_metrics]:
|
||||
if self.unieval_metric_stats.get(task, None) is None:
|
||||
self.unieval_metric_stats[task] = {category: {metric: 0}}
|
||||
elif self.unieval_metric_stats[task].get(category, None) is None:
|
||||
self.unieval_metric_stats[task][category] = {metric: 0}
|
||||
else:
|
||||
self.unieval_metric_stats[task][category][metric] = 0
|
||||
|
||||
for task in self.unieval_metric_stats:
|
||||
if self.path_for_UniEval is None:
|
||||
raise Exception(f"Please specify the path for UniEval model in the config file!")
|
||||
|
||||
if self.path_for_UniEval.get(task, None) is None:
|
||||
raise Exception(f"Please specify the model path for task {task} in the config file!")
|
||||
|
||||
print(f"Load UniEval model for task {task}.")
|
||||
|
||||
uni_evaluator = unieval.get_evaluator(task, model_name_or_path=self.path_for_UniEval[task])
|
||||
for category in self.unieval_metric_stats[task]:
|
||||
targets_list = [
|
||||
target["target"] if target["target"] else target["output"]
|
||||
for target in targets_per_category[category]
|
||||
]
|
||||
predicts_list = [answer["output"] for answer in answers_per_category[category]]
|
||||
sources_list = [answer["instruction"] + answer["input"] for answer in answers_per_category[category]]
|
||||
|
||||
data = unieval.convert_data_to_unieval_format(predicts_list, sources_list, targets_list)
|
||||
scores = uni_evaluator.evaluate(
|
||||
data, category, dims=list(self.unieval_metric_stats[task][category].keys()), overall=False
|
||||
)
|
||||
avg_scores = unieval.calculate_average_score(scores)
|
||||
|
||||
self.unieval_metric_stats[task][category].update(avg_scores)
|
||||
|
||||
# gpt evaluation
|
||||
for category in self.params:
|
||||
if len(answers_per_category[category]) == 0:
|
||||
print(f"Category {category} specified in your config doesn't have corresponding answers!")
|
||||
continue
|
||||
|
||||
if self.params[category].get("GPT", None) is None:
|
||||
continue
|
||||
|
||||
category_metrics = self.params[category]["GPT"]
|
||||
|
||||
prompt = self.gpt_evaluation_prompt.get(category, None)
|
||||
if prompt is None:
|
||||
print(f"No prompt for category {category}! Use prompt for category general now.")
|
||||
prompt = self.gpt_evaluation_prompt["general"]
|
||||
|
||||
self.gpt_evaluation_results[category] = gpt_evaluate.evaluate(
|
||||
answers_per_category[category],
|
||||
prompt,
|
||||
category_metrics,
|
||||
category,
|
||||
self.gpt_model,
|
||||
self.language,
|
||||
references=targets_per_category[category] if self.gpt_with_reference else None,
|
||||
)
|
||||
|
||||
def save(self, path: str, model_name_list: List[str]) -> None:
|
||||
"""
|
||||
Save evaluation results of GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
|
||||
|
||||
"""
|
||||
|
||||
if len(model_name_list) == 2:
|
||||
save_path = os.path.join(path, "gpt_evaluate", "battle_results")
|
||||
gpt_evaluate.save_battle_results(self.battle_results, model_name_list[0], model_name_list[1], save_path)
|
||||
else:
|
||||
if self.automatic_metric_stats:
|
||||
# Save evaluation results for automatic metrics
|
||||
automatic_base_save_path = os.path.join(path, "automatic_results")
|
||||
automatic_results_save_path = os.path.join(automatic_base_save_path, "evaluation_results")
|
||||
|
||||
save_automatic_results(model_name_list[0], self.automatic_metric_stats, automatic_results_save_path)
|
||||
|
||||
# Save charts and csv.
|
||||
automatic_analyses_save_path = os.path.join(automatic_base_save_path, "evaluation_analyses")
|
||||
analyze_automatic_results(automatic_results_save_path, automatic_analyses_save_path)
|
||||
|
||||
if self.unieval_metric_stats:
|
||||
# Save evaluation results for UniEval metrics
|
||||
unieval_base_save_path = os.path.join(path, "unieval_results")
|
||||
unieval_results_save_path = os.path.join(unieval_base_save_path, "evaluation_results")
|
||||
|
||||
unieval.save_unieval_results(model_name_list[0], self.unieval_metric_stats, unieval_results_save_path)
|
||||
|
||||
# Save charts and csv.
|
||||
unieval_analyses_save_path = os.path.join(unieval_base_save_path, "evaluation_analyses")
|
||||
unieval.analyze_unieval_results(unieval_results_save_path, unieval_analyses_save_path)
|
||||
|
||||
if self.gpt_evaluation_results:
|
||||
# Save evaluation results for GPT evaluation metrics.
|
||||
gpt_base_save_path = os.path.join(path, "gpt_evaluate", "gpt_evaluate_results")
|
||||
gpt_evaluation_results_save_path = os.path.join(gpt_base_save_path, "evaluation_results")
|
||||
|
||||
all_evaluations = gpt_evaluate.save_gpt_evaluation_results(
|
||||
model_name_list[0], self.gpt_evaluation_results, gpt_evaluation_results_save_path
|
||||
)
|
||||
|
||||
# Start to calculate scores and save statistics.
|
||||
gpt_evaluation_statistics_save_path = os.path.join(gpt_base_save_path, "evaluation_statistics")
|
||||
gpt_evaluate.save_gpt_evaluation_statistics(
|
||||
model_name_list[0], all_evaluations, gpt_evaluation_statistics_save_path
|
||||
)
|
||||
|
||||
# Save charts and csv.
|
||||
gpt_evaluation_analyses_save_path = os.path.join(gpt_base_save_path, "evaluation_analyses")
|
||||
gpt_evaluate.analyze_gpt_evaluation_statistics(
|
||||
gpt_evaluation_statistics_save_path, gpt_evaluation_analyses_save_path
|
||||
)
|
@@ -1,780 +0,0 @@
|
||||
import concurrent.futures
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from copy import deepcopy
|
||||
from typing import Any, Dict, List
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
import openai
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
import tqdm
|
||||
from utils import jdump, jload
|
||||
|
||||
ref_step_template = {
|
||||
"en": "Now please compare the answer with the {adjective} answer, determine whether the answer is able to achieve the same level of {metric}.\n\n",
|
||||
"cn": "请比较答案与上面的{adjective}答案,确定答案是否可以达到与该{adjective}答案同样水平的{metric}。\n\n",
|
||||
}
|
||||
|
||||
ref_answer_template_general = {
|
||||
"en": "\nAn example answer with good quality is as follows:\n\n{answer}\n\n",
|
||||
"cn": "\n一个优质的示例答案如下:\n\n{answer}\n\n",
|
||||
}
|
||||
|
||||
ref_answer_template_correctness = {
|
||||
"en": "\nA correct answer is as follows:\n\n{answer}\n\n",
|
||||
"cn": "\n标准答案如下:\n\n{answer}\n\n",
|
||||
}
|
||||
|
||||
|
||||
def get_battle_result(sys_prompt: str, user_prompt: str, id: int, max_tokens: int = 2048) -> Dict[str, Any]:
|
||||
"""
|
||||
Get battle evaluation from GPT-4.
|
||||
|
||||
Args:
|
||||
sys_prompt: prompt for the system.
|
||||
user_prompt: prompt for the user.
|
||||
id: id of the answers for comparison.
|
||||
max_tokens: the maximum number of tokens to generate in the chat completion.
|
||||
|
||||
Returns:
|
||||
An evaluation of one comparison.
|
||||
"""
|
||||
|
||||
MAX_API_RETRY = 3
|
||||
for _ in range(MAX_API_RETRY):
|
||||
try:
|
||||
response = openai.ChatCompletion.create(
|
||||
model="gpt-4",
|
||||
messages=[
|
||||
{"role": "system", "content": sys_prompt},
|
||||
{
|
||||
"role": "user",
|
||||
"content": user_prompt,
|
||||
},
|
||||
],
|
||||
temperature=0.2,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
evaluation = response["choices"][0]["message"]["content"]
|
||||
return {"evaluation": evaluation, "id": id}
|
||||
except Exception as e:
|
||||
print(e)
|
||||
time.sleep(1)
|
||||
print(f"Evaluation {id} failed after {MAX_API_RETRY} retries.")
|
||||
return {"evaluation": "", "id": id}
|
||||
|
||||
|
||||
def parse_battle_score(evaluation: str) -> List[float]:
|
||||
"""
|
||||
Parse evaluation from GPT-4 and get the scores of model 1 and 2.
|
||||
|
||||
Args:
|
||||
evaluation: evaluation from GPT-4.
|
||||
|
||||
Returns:
|
||||
A score pair of two different model answers.
|
||||
"""
|
||||
|
||||
try:
|
||||
pattern = re.compile("([0-9]|10) out of 10")
|
||||
sp = re.findall(pattern, evaluation)
|
||||
if len(re.findall(pattern, evaluation)) == 2:
|
||||
return [float(sp[0]), float(sp[1])]
|
||||
|
||||
pattern = re.compile("a score of ([0-9]|10)")
|
||||
sp = re.findall(pattern, evaluation)
|
||||
if len(re.findall(pattern, evaluation)) == 2:
|
||||
return [float(sp[0]), float(sp[1])]
|
||||
|
||||
pattern = re.compile("([0-9]|10)/10")
|
||||
sp = re.findall(pattern, evaluation)
|
||||
if len(re.findall(pattern, evaluation)) == 2:
|
||||
return [float(sp[0]), float(sp[1])]
|
||||
|
||||
score_pair = evaluation.split("\n")[0]
|
||||
score_pair = score_pair.replace(",", " ")
|
||||
sp = score_pair.split(" ")
|
||||
if len(sp) == 2:
|
||||
return [float(sp[0]), float(sp[1])]
|
||||
else:
|
||||
raise Exception(f"Invalid score pair. Got {evaluation}.")
|
||||
except Exception:
|
||||
return [-1, -1]
|
||||
|
||||
|
||||
def battle(answer1: List[Dict], answer2: List[Dict], prompt_dict: Dict[str, Any]) -> List[Dict]:
|
||||
"""
|
||||
Use GPT-4 to compare answers of two different models.
|
||||
|
||||
Args:
|
||||
answer1: answers of model 1.
|
||||
answer2: answers of model 2.
|
||||
prompt_dict: prompt for battle.
|
||||
|
||||
Returns:
|
||||
Evaluations of all comparison pairs.
|
||||
"""
|
||||
|
||||
assert len(answer1) == len(answer2)
|
||||
|
||||
total_len = len(answer1)
|
||||
question_idx_list = list(range(total_len))
|
||||
|
||||
print(f" Total number of answers: {len(answer1)}.")
|
||||
|
||||
evaluations = []
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
|
||||
futures = []
|
||||
for i in question_idx_list:
|
||||
assert answer1[i]["id"] == answer2[i]["id"]
|
||||
answer_id = answer1[i]["id"]
|
||||
|
||||
ques = (
|
||||
answer1[i]["instruction"]
|
||||
if answer1[i]["input"] == ""
|
||||
else answer1[i]["instruction"] + " " + answer1[i]["input"]
|
||||
)
|
||||
answer1[i]["category"]
|
||||
ans1 = answer1[i]["output"]
|
||||
ans2 = answer2[i]["output"]
|
||||
|
||||
sys_prompt = prompt_dict["system_prompt"]
|
||||
prompt_template = prompt_dict["prompt_template"]
|
||||
prompt = prompt_template.format(
|
||||
question=ques,
|
||||
answer_1=ans1,
|
||||
answer_2=ans2,
|
||||
prompt=prompt_dict["prompt"],
|
||||
)
|
||||
|
||||
future = executor.submit(get_battle_result, sys_prompt, prompt, answer_id, 2048)
|
||||
futures.append(future)
|
||||
|
||||
for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
|
||||
evaluations.append(future.result())
|
||||
|
||||
evaluations.sort(key=lambda x: x["id"])
|
||||
|
||||
return evaluations
|
||||
|
||||
|
||||
def save_battle_results(evaluations: List[Dict], name1: str, name2: str, save_path: str) -> None:
|
||||
"""
|
||||
Save evaluation results (model 1 vs model 2) from GPT-4.
|
||||
|
||||
Args:
|
||||
evaluations: evaluation results from GPT-4.
|
||||
name1: model 1 's name.
|
||||
name2: model 2 's name.
|
||||
save_path: path to save battle results.
|
||||
"""
|
||||
|
||||
evaluation_file = deepcopy(evaluations)
|
||||
|
||||
ans1_score = 0
|
||||
ans2_score = 0
|
||||
better_count = 0
|
||||
worse_count = 0
|
||||
tie_count = 0
|
||||
invalid_count = 0
|
||||
|
||||
better_file = []
|
||||
worse_file = []
|
||||
tie_file = []
|
||||
invalid_file = []
|
||||
|
||||
for idx, evaluation in enumerate(evaluations):
|
||||
scores = parse_battle_score(evaluation["evaluation"])
|
||||
evaluation_file[idx]["score"] = scores
|
||||
|
||||
if scores[0] == -1 and scores[1] == -1:
|
||||
invalid_count += 1
|
||||
invalid_file.append(evaluation_file[idx])
|
||||
print(f'Invalid score pair: {evaluation_file[idx]["id"]}.')
|
||||
else:
|
||||
if scores[0] > scores[1]:
|
||||
worse_count += 1
|
||||
worse_file.append(evaluation_file[idx])
|
||||
elif scores[0] < scores[1]:
|
||||
better_count += 1
|
||||
better_file.append(evaluation_file[idx])
|
||||
else:
|
||||
tie_count += 1
|
||||
tie_file.append(evaluation_file[idx])
|
||||
ans1_score += scores[0]
|
||||
ans2_score += scores[1]
|
||||
|
||||
prefix = f"{name1}_vs_{name2}"
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
jdump(better_file, os.path.join(save_path, prefix, f"{name2}_better.json"))
|
||||
jdump(worse_file, os.path.join(save_path, prefix, f"{name2}_worse.json"))
|
||||
jdump(tie_file, os.path.join(save_path, prefix, f"{prefix}_tie.json"))
|
||||
jdump(invalid_file, os.path.join(save_path, prefix, f"{prefix}_invalid.json"))
|
||||
jdump(evaluation_file, os.path.join(save_path, prefix, f"{prefix}_evaluations.json"))
|
||||
|
||||
if os.path.exists(os.path.join(save_path, "battle_results.json")):
|
||||
results = jload(os.path.join(save_path, "battle_results.json"))
|
||||
else:
|
||||
results = {}
|
||||
|
||||
results[prefix] = {
|
||||
"model": [name1, name2],
|
||||
"better": better_count,
|
||||
"worse": worse_count,
|
||||
"tie": tie_count,
|
||||
"win_rate": better_count / (len(evaluations) - invalid_count),
|
||||
"score": [
|
||||
ans1_score / (len(evaluations) - invalid_count),
|
||||
ans2_score / (len(evaluations) - invalid_count),
|
||||
],
|
||||
}
|
||||
jdump(results, os.path.join(save_path, "battle_results.json"))
|
||||
|
||||
print(f"Total {invalid_count} invalid score pair(s).")
|
||||
print(f"Model {name2} has {better_count} better answer(s).")
|
||||
print(f"Model {name2} has {worse_count} worse answer(s).")
|
||||
print(f"{tie_count} answer(s) play(s) to a tie.")
|
||||
print(f"Win rate of model {name2}: {better_count/(len(evaluations)-invalid_count):.2f}")
|
||||
print(f"Model {name1} average score: {ans1_score/(len(evaluations)-invalid_count):.2f}")
|
||||
print(f"Model {name2} average score: {ans2_score/(len(evaluations)-invalid_count):.2f}")
|
||||
|
||||
|
||||
def reference_template(metric: str, language: str, reference: Dict[str, Any]) -> str:
|
||||
"""
|
||||
Get prompt template for GPT evaluation with reference.
|
||||
|
||||
Different languages have different prompt templates.
|
||||
|
||||
Args:
|
||||
metric: metric used in GPT evaluation with reference.
|
||||
language: language for the template.
|
||||
reference: the instruction that contains target answer.
|
||||
|
||||
Returns:
|
||||
Prompt template for GPT evaluation with reference.
|
||||
"""
|
||||
|
||||
step_to_add = ref_step_template[language]
|
||||
|
||||
for_the_given_answer = (
|
||||
"{metric} (1-5) (directly give the score for the given answer):"
|
||||
if language == "en"
|
||||
else "{metric} (1-5) (直接对给定答案打分)"
|
||||
)
|
||||
|
||||
# adjective is used to describe the word "answer" in the prompt.
|
||||
adjective = "example" if language == "en" else "示例"
|
||||
answer_to_add = ref_answer_template_general[language]
|
||||
|
||||
# Only for correctness, we will provide a correct answer and so the adjective for "answer" will be "correct". The prompt words will be "a correct answer".
|
||||
# In other cases, the prompt words will be "an example answer with good quality" by default.
|
||||
if metric.lower() == "correctness":
|
||||
adjective = "correct" if language == "en" else "标准"
|
||||
answer_to_add = ref_answer_template_correctness[language]
|
||||
|
||||
answer_to_add = answer_to_add.format(answer=reference["target"] if reference["target"] else reference["output"])
|
||||
step_to_add = step_to_add.format(metric=metric.lower(), adjective=adjective) + for_the_given_answer.format(
|
||||
metric=metric
|
||||
)
|
||||
|
||||
return answer_to_add + step_to_add
|
||||
|
||||
|
||||
def fill_in_message(role: str, content: str) -> Dict[str, str]:
|
||||
"""
|
||||
Generate one formatted message to send through chat completion.
|
||||
|
||||
Args:
|
||||
role: the role of the author of this message.
|
||||
content: the contents of the message.
|
||||
|
||||
Returns:
|
||||
One message to send through chat completion.
|
||||
"""
|
||||
|
||||
return {"role": role, "content": content}
|
||||
|
||||
|
||||
def multiturn_chat_completion(user_messages: List[str], model: str, max_tokens: int = 1, turns=2) -> Dict[str, Any]:
|
||||
"""
|
||||
Do multi-turn chat completion.
|
||||
|
||||
When turns == 1, it is a one-turn conversation for normal GPT evaluation.
|
||||
When turns == 2, it is a two-turn conversation which is used for GPT evaluation with reference answers.
|
||||
|
||||
Args:
|
||||
user_messages: messages user wants to send.
|
||||
model: the model used to evaluate answers.
|
||||
max_tokens: the maximum number of tokens to generate in the chat completion.
|
||||
turns: the number of turns for conversation.
|
||||
|
||||
Returns:
|
||||
Last turn's response.
|
||||
"""
|
||||
|
||||
if len(user_messages) != turns:
|
||||
raise Exception("The length of user messages should be equal to the turn number!")
|
||||
|
||||
assistant_responses = []
|
||||
|
||||
for i in range(turns):
|
||||
messages_to_send = []
|
||||
|
||||
for j in range(i):
|
||||
messages_to_send.append(fill_in_message("user", user_messages[j]))
|
||||
messages_to_send.append(
|
||||
fill_in_message("assistant", assistant_responses[j]["choices"][0]["message"]["content"])
|
||||
)
|
||||
|
||||
# Length of user messages == Length of assistant messages + 1
|
||||
# Because we always expect the api to response
|
||||
messages_to_send.append(fill_in_message("user", user_messages[i]))
|
||||
|
||||
response = openai.ChatCompletion.create(
|
||||
model=model,
|
||||
messages=messages_to_send,
|
||||
temperature=0,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
|
||||
# Avoid exceeding rate limits.
|
||||
# You can comment this line if your request doesn't contain many tokens.
|
||||
time.sleep(1)
|
||||
|
||||
assistant_responses.append(response)
|
||||
|
||||
return assistant_responses[-1]
|
||||
|
||||
|
||||
def get_gpt_evaluation_without_logprobs(
|
||||
prompt: Dict[str, Any],
|
||||
inst: Dict[str, Any],
|
||||
metrics: List[str],
|
||||
language: str,
|
||||
reference: Dict[str, Any] = None,
|
||||
model: str = "gpt-3.5-turbo",
|
||||
max_tokens: int = 2048,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Use chat models(gpt-3.5-turbo or gpt-4) to evaluate one model answer.
|
||||
|
||||
Temperature is set to 0 to make the model more deterministic.
|
||||
|
||||
Args:
|
||||
prompt: a dictionary including prompt template, CoT and metrics.
|
||||
inst: the instruction that is needed to be evaluated.
|
||||
metrics: the metrics for evaluation.
|
||||
language: language used to change the CoT(add one more step about comparing the given answer and reference) if reference is not None.
|
||||
reference: the reference answer.
|
||||
model: the model used to evaluate answers.
|
||||
max_tokens: the maximum number of tokens to generate in the chat completion.
|
||||
|
||||
Returns:
|
||||
An evaluation of one answer.
|
||||
"""
|
||||
|
||||
MAX_API_RETRY = 3
|
||||
|
||||
question = inst["instruction"] if inst["input"] == "" else inst["instruction"] + "\n" + inst["input"]
|
||||
answer = inst["output"]
|
||||
inst["evaluation"] = {}
|
||||
|
||||
for metric in metrics:
|
||||
if prompt["metrics"].get(metric, None) is None:
|
||||
raise Exception(
|
||||
f"Unsupported metric {metric} for category {inst['category']}! You should add this metric in the prompt file!"
|
||||
)
|
||||
for i in range(MAX_API_RETRY):
|
||||
try:
|
||||
prompt_reference = "" if reference is None else reference_template(metric, language, reference)
|
||||
|
||||
prompt_1st_round = prompt["prompt"].format(
|
||||
question=question,
|
||||
answer=answer,
|
||||
metric=prompt["metrics"][metric],
|
||||
steps=prompt["CoT"][metric],
|
||||
)
|
||||
|
||||
if prompt_reference:
|
||||
# Do a 2-round conversation
|
||||
response = multiturn_chat_completion(
|
||||
[prompt_1st_round, prompt_reference], model, max_tokens=max_tokens, turns=2
|
||||
)
|
||||
else:
|
||||
response = multiturn_chat_completion([prompt_1st_round], model, max_tokens=max_tokens, turns=1)
|
||||
|
||||
inst["evaluation"][metric] = {
|
||||
"response": response["choices"][0]["message"]["content"],
|
||||
"logprobs": None,
|
||||
}
|
||||
|
||||
# Prevent exceeding rate limits because we have multiple workers.
|
||||
# But this will slow down the evaluation process.
|
||||
# You can comment this line if your request doesn't contain many tokens.
|
||||
time.sleep(len(metrics) * 0.5)
|
||||
|
||||
break
|
||||
except Exception as e:
|
||||
print(e)
|
||||
time.sleep(1)
|
||||
if metric not in inst["evaluation"]:
|
||||
print(f"Evaluation {inst['id']} for metric {metric} failed after {MAX_API_RETRY} retries.")
|
||||
inst["evaluation"][metric] = {}
|
||||
return inst
|
||||
|
||||
|
||||
def get_gpt_evaluation_with_logprobs(
|
||||
prompt: Dict[str, Any], inst: Dict[str, Any], metrics: List[str], max_tokens: int = 2048
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Use completion model(text-davinci-003) to evaluate one model answer.
|
||||
Only completion models can return log probabilities.
|
||||
|
||||
Temperature is set to 0 to make the model more deterministic.
|
||||
|
||||
Args:
|
||||
prompt: a dictionary including prompt template, CoT and metrics.
|
||||
inst: the instruction that is needed to be evaluated.
|
||||
metrics: the metrics for evaluation.
|
||||
max_tokens: the maximum number of tokens to generate in the completion.
|
||||
|
||||
Returns:
|
||||
An evaluation of one answer.
|
||||
"""
|
||||
|
||||
MAX_API_RETRY = 3
|
||||
|
||||
question = inst["instruction"] if inst["input"] == "" else inst["instruction"] + "\n" + inst["input"]
|
||||
answer = inst["output"]
|
||||
inst["evaluation"] = {}
|
||||
|
||||
for metric in metrics:
|
||||
if prompt["metrics"].get(metric, None) is None:
|
||||
raise Exception(
|
||||
f"Unsupported metric {metric} for category {inst['category']}! You should add this metric in the prompt file!"
|
||||
)
|
||||
for i in range(MAX_API_RETRY):
|
||||
try:
|
||||
response = openai.Completion.create(
|
||||
model="text-davinci-003",
|
||||
prompt=prompt["prompt"].format(
|
||||
question=question,
|
||||
answer=answer,
|
||||
metric=prompt["metrics"][metric],
|
||||
steps=prompt["CoT"][metric],
|
||||
),
|
||||
logprobs=5,
|
||||
temperature=0,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
inst["evaluation"][metric] = {
|
||||
"response": response["choices"][0]["text"],
|
||||
"logprobs": response["choices"][0]["logprobs"]["top_logprobs"],
|
||||
}
|
||||
|
||||
# Prevent exceeding rate limits because we have multiple workers.
|
||||
# But this will slow down the evaluation process.
|
||||
# You can comment this line if your request doesn't contain many tokens.
|
||||
time.sleep(len(metrics) * 0.5)
|
||||
|
||||
break
|
||||
except Exception as e:
|
||||
print(e)
|
||||
time.sleep(1)
|
||||
if metric not in inst["evaluation"]:
|
||||
print(f"Evaluation {inst['id']} for metric {metric} failed after {MAX_API_RETRY} retries.")
|
||||
inst["evaluation"][metric] = {}
|
||||
return inst
|
||||
|
||||
|
||||
def evaluate(
|
||||
answers: List[Dict],
|
||||
prompt: Dict[str, Any],
|
||||
metrics: List[str],
|
||||
category: str,
|
||||
model: str,
|
||||
language: str,
|
||||
references: List[Dict] = None,
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Use GPT models to evaluate model answers and save evaluation results.
|
||||
|
||||
Args:
|
||||
answers: model answers.
|
||||
prompt: prompt for GPT evaluation.
|
||||
metrics: metrics for GPT evaluation.
|
||||
category: the category of the model answers for evaluation.
|
||||
model: the specific GPT model used to evaluate answers.
|
||||
language: language used in GPT evaluation
|
||||
references: references for GPT evaluation
|
||||
|
||||
Returns:
|
||||
Evaluations of the given answers.
|
||||
"""
|
||||
|
||||
print(f"The number of instances of category {category}'s is {len(answers)}.")
|
||||
|
||||
evaluations = []
|
||||
|
||||
metrics_str = ", ".join(x for x in metrics)
|
||||
print(f"Category {category}'s metrics are {metrics_str}.")
|
||||
|
||||
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
|
||||
futures = []
|
||||
for idx, inst in enumerate(answers):
|
||||
# Completion models can return log probabilities.
|
||||
if model == "text-davinci-003":
|
||||
future = executor.submit(get_gpt_evaluation_with_logprobs, prompt, inst, metrics, 1)
|
||||
else:
|
||||
future = executor.submit(
|
||||
get_gpt_evaluation_without_logprobs,
|
||||
prompt,
|
||||
inst,
|
||||
metrics,
|
||||
language,
|
||||
reference=None if references is None else references[idx],
|
||||
model=model,
|
||||
max_tokens=1,
|
||||
)
|
||||
|
||||
futures.append(future)
|
||||
|
||||
for future in tqdm.tqdm(
|
||||
concurrent.futures.as_completed(futures),
|
||||
desc=f"{category}: ",
|
||||
total=len(futures),
|
||||
):
|
||||
evaluations.append(future.result())
|
||||
|
||||
evaluations.sort(key=lambda x: x["id"])
|
||||
|
||||
print(f"{category} done.")
|
||||
|
||||
return evaluations
|
||||
|
||||
|
||||
def calculate_scores_form_logprobs(logprobs: Dict[str, Any]) -> float:
|
||||
"""
|
||||
Calculate the score according to log probabilities returned by text-davinci-003.
|
||||
|
||||
Calculation formula:
|
||||
score = sum(score_i * exp(value)) where score_i is the score which corresponds to the key(predicted token) and value is its log probability.
|
||||
|
||||
Ref: https://arxiv.org/abs/2303.16634
|
||||
This paper proposes NLG evaluation methods using text-davinci-003(log probabilities returned by completion models) and GPT-4(probabilities obtained by sampling).
|
||||
|
||||
Args:
|
||||
logprobs: logprobs returned by openai.Completion.
|
||||
|
||||
Returns:
|
||||
The score of one answer.
|
||||
"""
|
||||
|
||||
# GPT-3.5 only returns score of 1 to 5.
|
||||
prob = np.zeros(5)
|
||||
|
||||
for key, value in logprobs.items():
|
||||
# Sometimes the key will be one byte of a unicode character which takes the form of "bytes:\\xe7".
|
||||
# It is meaningless, and thus we don't calculate probability.
|
||||
if "bytes" in key:
|
||||
continue
|
||||
# results[0] is the score which corresponds to the key(predicted token).
|
||||
# For example, key "5" corresponds to score 5.
|
||||
results = re.findall(r"\d", key)
|
||||
if len(results) == 1:
|
||||
prob[int(results[0]) - 1] = prob[int(results[0]) - 1] + np.exp(value)
|
||||
|
||||
score = np.dot(np.arange(1, 6), prob)
|
||||
|
||||
return score
|
||||
|
||||
|
||||
def calculate_scores_form_response(response: str, evaluation: Dict[str, Any]) -> int:
|
||||
"""
|
||||
Calculate the score from the response returned by gpt-3.5-turbo or gpt-4.
|
||||
Different from text-davinci-003, this function directly calculates the score according to the plain response returned by gpt-3.5-turbo or gpt-4.
|
||||
Although text-davinci-003 can return log probabilities, it costs ten times as much as gpt-3.5-turbo.
|
||||
|
||||
Args:
|
||||
response: logprobs returned by openai.Completion.
|
||||
evaluation: the evaluation corresponds to the question.
|
||||
|
||||
Returns:
|
||||
The score of one answer.
|
||||
"""
|
||||
|
||||
try:
|
||||
results = re.findall(r"\d", response)
|
||||
if len(results) == 1:
|
||||
return int(results[0])
|
||||
else:
|
||||
raise Exception(f"Invalid score pair. Got {evaluation}.")
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
def save_gpt_evaluation_results(
|
||||
model_name: str, gpt_evaluation_results: Dict[str, Any], save_path: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Save evaluation results for different categories for one model.
|
||||
|
||||
Args:
|
||||
model_name: name of the model for saving evaluation results.
|
||||
gpt_evaluation_results: evaluations results for all the model answers.
|
||||
save_path: path to save GPT evaluation statistics.
|
||||
"""
|
||||
|
||||
all_evaluations = []
|
||||
for category, evaluations in gpt_evaluation_results.items():
|
||||
jdump(evaluations, os.path.join(save_path, model_name, f"{category}_evaluation_results.json"))
|
||||
all_evaluations.extend(evaluations)
|
||||
|
||||
jdump(all_evaluations, os.path.join(save_path, f"{model_name}_evaluation_results.json"))
|
||||
|
||||
return all_evaluations
|
||||
|
||||
|
||||
def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict], save_path: str) -> None:
|
||||
"""
|
||||
Generate statistics for one model.
|
||||
|
||||
Args:
|
||||
model_name: name of the model for saving statistics.
|
||||
evaluations: evaluations for all the model answers.
|
||||
save_path: path to save GPT evaluation statistics.
|
||||
"""
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
data_per_category = {}
|
||||
for evaluation in evaluations:
|
||||
category = evaluation["category"]
|
||||
if evaluation["category"] in data_per_category.keys():
|
||||
data_per_category[category].append(evaluation)
|
||||
else:
|
||||
data_per_category[category] = [evaluation]
|
||||
|
||||
all_statistics = {}
|
||||
for category, data in data_per_category.items():
|
||||
metrics = data[0]["evaluation"].keys()
|
||||
scores = {metric: [] for metric in metrics}
|
||||
for evaluation in data:
|
||||
for metric in metrics:
|
||||
if evaluation["evaluation"][metric] == {}:
|
||||
# This means after 3 retries, the server still returns an error, and we set the score to 0.
|
||||
scores[metric].append(0)
|
||||
elif evaluation["evaluation"][metric]["logprobs"] is not None:
|
||||
scores[metric].append(
|
||||
calculate_scores_form_logprobs(evaluation["evaluation"][metric]["logprobs"][0])
|
||||
)
|
||||
else:
|
||||
scores[metric].append(
|
||||
calculate_scores_form_response(evaluation["evaluation"][metric]["response"], evaluation)
|
||||
)
|
||||
|
||||
statistics = {}
|
||||
for metric in metrics:
|
||||
arg_sort = np.argsort(scores[metric])
|
||||
statistics[metric] = {}
|
||||
statistics[metric]["avg_score"] = sum(scores[metric]) / len(data)
|
||||
statistics[metric]["best_3"] = {data[i]["id"]: scores[metric][i] for i in arg_sort[-3:][::-1]}
|
||||
statistics[metric]["worst_3"] = {data[i]["id"]: scores[metric][i] for i in arg_sort[:3]}
|
||||
|
||||
all_statistics[category] = statistics
|
||||
|
||||
jdump(
|
||||
all_statistics,
|
||||
os.path.join(save_path, f"{model_name}_evaluation_statistics.json"),
|
||||
)
|
||||
|
||||
|
||||
def analyze_gpt_evaluation_statistics(statistics_path: str, save_path: str) -> None:
|
||||
"""
|
||||
Analyze and visualize all GPT evaluation statistics in the given directory.
|
||||
|
||||
Args:
|
||||
statistics_path: path to all the models' statistics.
|
||||
save_path: path to save table and visualization results.
|
||||
"""
|
||||
|
||||
if not os.path.exists(statistics_path):
|
||||
raise Exception(f'The given directory "{statistics_path}" doesn\'t exist! No statistics found!')
|
||||
|
||||
all_statistics = {}
|
||||
|
||||
for file_name in os.listdir(statistics_path):
|
||||
if file_name.endswith("_evaluation_statistics.json"):
|
||||
model_name = file_name.split("_evaluation_statistics.json")[0]
|
||||
all_statistics[model_name] = jload(os.path.join(statistics_path, file_name))
|
||||
|
||||
if len(list(all_statistics.keys())) == 0:
|
||||
raise Exception(f'There are no statistics in the given directory "{statistics_path}"!')
|
||||
|
||||
frame_all = {
|
||||
"model": [],
|
||||
"category": [],
|
||||
"metric": [],
|
||||
"avg_score": [],
|
||||
"best_3": [],
|
||||
"worst_3": [],
|
||||
}
|
||||
frame_per_category = {}
|
||||
for model_name, model_statistics in all_statistics.items():
|
||||
for category, category_statistics in model_statistics.items():
|
||||
if frame_per_category.get(category) is None:
|
||||
frame_per_category[category] = {
|
||||
"model": [],
|
||||
"metric": [],
|
||||
"avg_score": [],
|
||||
"best_3": [],
|
||||
"worst_3": [],
|
||||
}
|
||||
|
||||
for metric, metric_statistics in category_statistics.items():
|
||||
frame_all["model"].append(model_name)
|
||||
frame_all["category"].append(category)
|
||||
frame_all["metric"].append(metric)
|
||||
frame_all["avg_score"].append(metric_statistics["avg_score"])
|
||||
frame_all["best_3"].append(metric_statistics["best_3"])
|
||||
frame_all["worst_3"].append(metric_statistics["worst_3"])
|
||||
|
||||
frame_per_category[category]["model"].append(model_name)
|
||||
frame_per_category[category]["metric"].append(metric)
|
||||
frame_per_category[category]["avg_score"].append(metric_statistics["avg_score"])
|
||||
frame_per_category[category]["best_3"].append(metric_statistics["best_3"])
|
||||
frame_per_category[category]["worst_3"].append(metric_statistics["worst_3"])
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
frame_all = pd.DataFrame(frame_all)
|
||||
frame_all.to_csv(os.path.join(save_path, "gpt_evaluation_statistics.csv"))
|
||||
|
||||
for category in tqdm.tqdm(
|
||||
frame_per_category.keys(),
|
||||
desc=f"GPT evaluation: ",
|
||||
total=len(frame_per_category.keys()),
|
||||
):
|
||||
data = pd.DataFrame(frame_per_category[category])
|
||||
|
||||
sns.set()
|
||||
fig = plt.figure(figsize=(16, 10))
|
||||
plt.ylim((0, 5))
|
||||
|
||||
fig = sns.barplot(x="metric", y="avg_score", hue="model", data=data, dodge=True)
|
||||
fig.set_title(f"Comparison between Different Models for Category {category.title()}")
|
||||
plt.xlabel("Evaluation Metric")
|
||||
plt.ylabel("Average Score")
|
||||
|
||||
figure = fig.get_figure()
|
||||
figure.savefig(os.path.join(save_path, f"{category}.png"), dpi=400)
|
||||
|
||||
plt.close()
|
@@ -1,254 +0,0 @@
|
||||
import statistics
|
||||
from typing import Dict, List
|
||||
|
||||
import jieba
|
||||
from bert_score import score
|
||||
from nltk.translate.bleu_score import sentence_bleu
|
||||
from nltk.translate.chrf_score import sentence_chrf
|
||||
from rouge_chinese import Rouge as Rouge_cn
|
||||
from rouge_score import rouge_scorer as Rouge_en
|
||||
from sklearn.metrics import f1_score, precision_score, recall_score
|
||||
from utils import preprocessing_text, remove_redundant_space
|
||||
|
||||
|
||||
def bleu_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate BLEU Score Metric
|
||||
|
||||
The calculation includes BLEU-1 for unigram, BLEU-2 for bigram,
|
||||
BLEU-3 for trigram and BLEU-4 for 4-gram. Unigram evaluates the
|
||||
accuracy in word level, other n-gram evaluate the fluency in
|
||||
sentence level.
|
||||
"""
|
||||
bleu_scores = {"bleu1": 0, "bleu2": 0, "bleu3": 0, "bleu4": 0}
|
||||
cumulative_bleu = [0] * 4
|
||||
weights = [
|
||||
(1.0 / 1.0, 0.0, 0.0, 0.0),
|
||||
(1.0 / 2.0, 1.0 / 2.0, 0.0, 0.0),
|
||||
(1.0 / 3.0, 1.0 / 3.0, 1.0 / 3.0, 0.0),
|
||||
(1.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0),
|
||||
]
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
if language == "cn":
|
||||
pred_list = " ".join(jieba.cut(preprocessing_text(pred))).split()
|
||||
target_list = [(" ".join(jieba.cut(preprocessing_text(target)))).split()]
|
||||
elif language == "en":
|
||||
pred_list = preprocessing_text(pred).split()
|
||||
target_list = [preprocessing_text(target).split()]
|
||||
|
||||
bleu = sentence_bleu(target_list, pred_list, weights=weights)
|
||||
cumulative_bleu = [a + b for a, b in zip(cumulative_bleu, bleu)]
|
||||
|
||||
for i in range(len(cumulative_bleu)):
|
||||
bleu_scores[f"bleu{i+1}"] = cumulative_bleu[i] / len(preds)
|
||||
|
||||
return bleu_scores
|
||||
|
||||
|
||||
def chrf_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate CHRF Score Metric in sentence level."""
|
||||
chrf_score = {"chrf": 0}
|
||||
cumulative_chrf = []
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
if language == "cn":
|
||||
pred_list = " ".join(jieba.cut(preprocessing_text(pred))).split()
|
||||
target_list = " ".join(jieba.cut(preprocessing_text(target))).split()
|
||||
elif language == "en":
|
||||
pred_list = preprocessing_text(pred).split()
|
||||
target_list = preprocessing_text(target).split()
|
||||
|
||||
cumulative_chrf.append(sentence_chrf(target_list, pred_list))
|
||||
|
||||
chrf_score["chrf"] = statistics.mean(cumulative_chrf)
|
||||
|
||||
return chrf_score
|
||||
|
||||
|
||||
def rouge_cn_score(preds: List[str], targets: List[str]) -> Dict[str, float]:
|
||||
"""Calculate Chinese ROUGE Score Metric
|
||||
|
||||
The calculation includes ROUGE-1 for unigram, ROUGE-2 for bigram
|
||||
and ROUGE-L. ROUGE-N evaluates the number of matching n-grams between
|
||||
the preds and targets. ROUGE-L measures the number of matching
|
||||
longest common subsequence (LCS) between preds and targets.
|
||||
"""
|
||||
rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}
|
||||
all_preds = []
|
||||
all_targets = []
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
pred_list = remove_redundant_space(" ".join(jieba.cut(preprocessing_text(pred))))
|
||||
target_list = remove_redundant_space(" ".join(jieba.cut(preprocessing_text(target))))
|
||||
all_preds.append(pred_list)
|
||||
all_targets.append(target_list)
|
||||
|
||||
rouge_cn = Rouge_cn()
|
||||
rouge_avg = rouge_cn.get_scores(all_preds, all_targets, avg=True)
|
||||
|
||||
rouge_scores["rouge1"] = rouge_avg["rouge-1"]["f"]
|
||||
rouge_scores["rouge2"] = rouge_avg["rouge-2"]["f"]
|
||||
rouge_scores["rougeL"] = rouge_avg["rouge-l"]["f"]
|
||||
|
||||
return rouge_scores
|
||||
|
||||
|
||||
def rouge_en_score(preds: List[str], targets: List[str]) -> Dict[str, float]:
|
||||
"""Calculate English ROUGE Score Metric
|
||||
|
||||
The calculation includes ROUGE-1 for unigram, ROUGE-2 for bigram
|
||||
and ROUGE-L. ROUGE-N evaluates the number of matching n-grams between
|
||||
the preds and targets. ROUGE-L measures the number of matching
|
||||
longest common subsequence (LCS) between preds and targets.
|
||||
"""
|
||||
rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}
|
||||
|
||||
rouge_en = Rouge_en.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=False)
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
score = rouge_en.score(preprocessing_text(pred), preprocessing_text(target))
|
||||
rouge_scores["rouge1"] += score["rouge1"].fmeasure
|
||||
rouge_scores["rouge2"] += score["rouge2"].fmeasure
|
||||
rouge_scores["rougeL"] += score["rougeL"].fmeasure
|
||||
|
||||
rouge_scores["rouge1"] = rouge_scores["rouge1"] / len(preds)
|
||||
rouge_scores["rouge2"] = rouge_scores["rouge2"] / len(preds)
|
||||
rouge_scores["rougeL"] = rouge_scores["rougeL"] / len(preds)
|
||||
|
||||
return rouge_scores
|
||||
|
||||
|
||||
def rouge_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate ROUGE Score Metric"""
|
||||
if language == "cn":
|
||||
return rouge_cn_score(preds, targets)
|
||||
elif language == "en":
|
||||
return rouge_en_score(preds, targets)
|
||||
|
||||
|
||||
def distinct_score(preds: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate Distinct Score Metric
|
||||
|
||||
This metric refers to https://arxiv.org/abs/1510.03055.
|
||||
It evaluates the diversity of generation text by counting
|
||||
the unique n-grams.
|
||||
"""
|
||||
distinct_score = {"distinct": 0}
|
||||
cumulative_distinct = []
|
||||
|
||||
for pred in preds:
|
||||
if language == "cn":
|
||||
pred_seg_list = " ".join(jieba.cut(pred)).split()
|
||||
count_segs = len(pred_seg_list)
|
||||
unique_segs = set(pred_seg_list)
|
||||
count_unique_chars = len(unique_segs)
|
||||
# prevent denominator from being 0
|
||||
cumulative_distinct.append(count_unique_chars / (count_segs + 1e-6))
|
||||
elif language == "en":
|
||||
# calculate distinct 1-gram, 2-gram, 3-gram
|
||||
unique_ngram = [set() for _ in range(0, 3)]
|
||||
all_ngram_count = [0 for _ in range(0, 3)]
|
||||
|
||||
split_pred = preprocessing_text(pred).split()
|
||||
for n in range(0, 3):
|
||||
for i in range(0, len(split_pred) - n):
|
||||
ngram = " ".join(split_pred[i : i + n + 1])
|
||||
unique_ngram[n].add(ngram)
|
||||
all_ngram_count[n] += 1
|
||||
|
||||
# Sometimes the answer may contain only one word. For 2-gram and 3-gram, the gram count(denominator) may be zero.
|
||||
avg_distinct = [len(a) / (b + 1e-6) for a, b in zip(unique_ngram, all_ngram_count)]
|
||||
|
||||
cumulative_distinct.append(statistics.mean(avg_distinct))
|
||||
|
||||
distinct_score["distinct"] = statistics.mean(cumulative_distinct)
|
||||
|
||||
return distinct_score
|
||||
|
||||
|
||||
def bert_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate BERTScore Metric
|
||||
|
||||
The BERTScore evaluates the semantic similarity between
|
||||
tokens of preds and targets with BERT.
|
||||
"""
|
||||
bert_score = {"bert_score": 0}
|
||||
pred_list = []
|
||||
target_list = []
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
pred_list.append(pred)
|
||||
target_list.append(target)
|
||||
|
||||
if language == "cn":
|
||||
_, _, F = score(pred_list, target_list, lang="zh", verbose=True)
|
||||
elif language == "en":
|
||||
_, _, F = score(pred_list, target_list, lang="en", verbose=True)
|
||||
|
||||
bert_score["bert_score"] = F.mean().item()
|
||||
|
||||
return bert_score
|
||||
|
||||
|
||||
def calculate_precision_recall_f1(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Precision, Recall and F1-Score Calculation
|
||||
|
||||
The calculation of precision, recall and f1-score is realized by counting
|
||||
the number f overlaps between the preds and target. The comparison length
|
||||
limited by the shorter one of preds and targets.
|
||||
"""
|
||||
precision_recall_f1 = {"precision": 0, "recall": 0, "f1_score": 0}
|
||||
precision_scores = []
|
||||
recall_scores = []
|
||||
f1_scores = []
|
||||
|
||||
for pred, target in zip(preds, targets):
|
||||
if language == "cn":
|
||||
pred_list = [char for char in " ".join(jieba.cut(preprocessing_text(pred))).split()]
|
||||
target_list = [char for char in " ".join(jieba.cut(preprocessing_text(target))).split()]
|
||||
elif language == "en":
|
||||
pred_list = [char for char in preprocessing_text(pred).split()]
|
||||
target_list = [char for char in preprocessing_text(target).split()]
|
||||
|
||||
target_labels = [1] * min(len(target_list), len(pred_list))
|
||||
pred_labels = [int(pred_list[i] == target_list[i]) for i in range(0, min(len(target_list), len(pred_list)))]
|
||||
|
||||
precision_scores.append(precision_score(target_labels, pred_labels, zero_division=0))
|
||||
recall_scores.append(recall_score(target_labels, pred_labels, zero_division=0))
|
||||
f1_scores.append(f1_score(target_labels, pred_labels, zero_division=0))
|
||||
|
||||
precision_recall_f1["precision"] = statistics.mean(precision_scores)
|
||||
precision_recall_f1["recall"] = statistics.mean(recall_scores)
|
||||
precision_recall_f1["f1_score"] = statistics.mean(f1_scores)
|
||||
|
||||
return precision_recall_f1
|
||||
|
||||
|
||||
def precision(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate Precision Metric
|
||||
|
||||
Calculating precision by counting the number of overlaps between the preds and target.
|
||||
"""
|
||||
precision = {"precision": 0}
|
||||
precision["precision"] = calculate_precision_recall_f1(preds, targets, language)["precision"]
|
||||
return precision
|
||||
|
||||
|
||||
def recall(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate Recall Metric
|
||||
|
||||
Calculating recall by counting the number of overlaps between the preds and target.
|
||||
"""
|
||||
recall = {"recall": 0}
|
||||
recall["recall"] = calculate_precision_recall_f1(preds, targets, language)["recall"]
|
||||
return recall
|
||||
|
||||
|
||||
def F1_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
|
||||
"""Calculate F1-score Metric
|
||||
|
||||
Calculating f1-score by counting the number of overlaps between the preds and target.
|
||||
"""
|
||||
f1 = {"f1_score": 0}
|
||||
f1["f1_score"] = calculate_precision_recall_f1(preds, targets, language)["f1_score"]
|
||||
return f1
|
@@ -1,6 +0,0 @@
|
||||
{
|
||||
"id": 1,
|
||||
"system_prompt": "你是一个检查回答质量的好助手。",
|
||||
"prompt_template": "[问题]\n{question}\n\n[1号AI助手的答案]\n{answer_1}\n\n[1号AI助手答案终止]\n\n[2号AI助手的答案]\n{answer_2}\n\n[2号AI助手答案终止]\n\n[要求]\n{prompt}\n\n",
|
||||
"prompt": "我们需要你评价这两个AI助手回答的性能。\n请对他们的回答的有用性、相关性、准确性、详细程度进行评分。每个AI助手都会得到一个1到10分的总分,分数越高表示整体表现越好。\n请首先输出一行,该行只包含两个数值,分别表示1号和2号AI助手的分数。这两个分数之间要有一个空格。在随后的一行中,请对你的评价作出全面的解释,避免任何潜在的偏见,并确保AI助手回答的顺序不会影响您的判断。"
|
||||
}
|
@@ -1,6 +0,0 @@
|
||||
{
|
||||
"id": 1,
|
||||
"system_prompt": "You are a helpful and precise assistant for checking the quality of the answer. You will be given two different answers to the same question",
|
||||
"prompt_template": "[Question]\n{question}\n\n[The Start of AI Assistant 1's Answer]\n{answer_1}\n\n[The End of AI Assistant 1's Answer]\n\n[The Start of AI Assistant 2's Answer]\n{answer_2}\n\n[The End of AI Assistant 2's Answer]\n\n[Requirements]\n{prompt}\n\n",
|
||||
"prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.\nPlease rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."
|
||||
}
|
@@ -1,181 +0,0 @@
|
||||
{
|
||||
"brainstorming": {
|
||||
"id": 1,
|
||||
"category": "brainstorming",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"creativity": "创意性(1-5):某些头脑风暴问题可能需要答案具有创意,提出新的思路。",
|
||||
"practicality": "实用性(1-5):某些头脑风暴问题可能需要答案提出实用的建议或解决方法。",
|
||||
"reasonableness": "合理性(1-5):答案应该符合常识、生活实际等等。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"creativity": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则创意性评分可能会受到影响。\n3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠,但仍然可以被认为是有创意的,只要它提供了新的角度或方法来解决问题。\n4. 根据答案的创意性,给出一个1到5的评分。如果答案缺乏创意,则应给出一个较低的评分。如果答案具有创意并提供了新的思路,应给出一个较高的评分。\n\n创意性:",
|
||||
"practicality": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则实用性评分可能会受到影响。\n3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好,但如果无法实现或应用,则实用性评分可能会受到影响。\n4. 根据答案的实用性,给出一个1到5的评分。如果答案缺乏实用性,则应给出一个较低的评分。如果答案提出了实用的建议或解决方法,并且可以很好地解决问题,则应给出一个较高的评分。\n\n实用性:",
|
||||
"reasonableness": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则合理性评分可能会受到影响。\n3. 考虑答案中所提供的信息是否合理、符合常识、生活实际等等。如果答案中存在明显的不合理之处,则合理性评分可能会受到影响。\n4. 根据答案的合理性,给出一个1到5的评分。如果答案存在明显的不合理之处,则应给出一个较低的评分。如果答案合理、符合常识、生活实际等等,则应给出一个较高的评分。\n\n合理性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"chat": {
|
||||
"id": 2,
|
||||
"category": "chat",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"naturalness": "自然(1-5):答案是否自然,并且符合问题给定的身份。",
|
||||
"engagingness": "参与感(1-5):答案是否对前面的对话内容做出了恰当的反应,是否理解对话的语境和背景。",
|
||||
"reasonableness": "合理性(1-5):答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。",
|
||||
"fidelity": "保真度(1-5):答案是否能够严格遵守角色的设定回答给定的请求。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"naturalness": "1. 阅读题目,确定题目提供的身份信息。\n2. 检查答案内容是否符合题目给定的身份。\n3. 根据以上因素,对该回答的自然性进行打分,分数从1到5,其中1表示不自然,5表示非常自然,并符合问题给定的身份。\n\n自然:",
|
||||
"engagingness": "1. 阅读题目,确定对话的语境和背景。\n2. 检查答案是否充分理解对话的语境和背景,能否自然地融入到对话中而不显得突兀。\n3. 根据以上因素,对该回答的参与感进行打分,分数从1到5,其中1表示没有参与感,5表示非常有参与感,并且恰当地理解了对话的语境和背景。\n\n参与感:",
|
||||
"reasonableness": "1. 阅读题目,确定对话的主题以及问题期望的回答方向。\n2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。\n3. 根据以上因素,对该回答的合理性进行打分,分数从1到5,其中1表示不合理,5表示非常合理,并且能够与前面的对话内容形成逻辑上的衔接,并符合常理。\n\n合理性:",
|
||||
"fidelity": "1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。\n阅读题目的请求,确认回答请求时需要注意的细节。\n3. 对比提供的回答与该角色的设定,评估回答是否能够严格遵守角色的设定。\n4. 结合以上评估结果给出保真度的评分,范围从1到5分,其中1分表示回答与角色设定完全不符,5分表示回答完全符合角色设定且满足给定请求。\n\n保真度:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“补全对话”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"classification": {
|
||||
"id": 3,
|
||||
"category": "classification",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "正确性(1-5):答案是否正确。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为5分。如果答案是部分正确的,则可以给予适当的得分,例如2分、3分或4分。如果答案完全不正确,则只得1分。\n\n正确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“分类“问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"closed_qa": {
|
||||
"id": 4,
|
||||
"category": "closed_qa",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "正确性(1-5):答案是否正确。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为5分。如果答案是部分正确的,则可以给予适当的得分,例如2分、3分或4分。如果答案完全不正确,则只得1分。\n\n正确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下:\n\n{question}\n\n需要你评分的答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"extraction": {
|
||||
"id": 5,
|
||||
"category": "extraction",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "准确性(1-5):回答应该准确无误地提取出所需信息,不应该包含任何错误或误导性信息。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读问题并确定需要从材料中提取的信息。\n2. 仔细阅读回答并确保它涵盖了所有需要提取的信息。\n3. 使用所提供的材料来验证回答的准确性。如果回答不准确或包含错误或误导性信息,则无法给出高分。\n4. 检查回答是否包含所有要求提取的信息,不要漏掉任何重要细节。\n5. 根据回答的准确性和完整性,给出一个介于1和5之间的分数,5分表示回答非常准确且完整,1分表示回答几乎没有提取出所需信息。\n\n准确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“提取”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"generation": {
|
||||
"id": 6,
|
||||
"category": "generation",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"diversity": "多样性(1-5):答案使用语言是否优美,具有有一定的创造性和想象力。然而,回答也应该保持合理和适度,不要过于夸张或离题。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"diversity": "1. 仔细阅读整个回答,确保完全理解回答所表达的内容和主题。\n2. 在阅读回答的同时,注意语言的质量,例如措辞是否正确,语言是否生动等。\n3. 检查回答的创造性和想象力,看看回答是否能够吸引人阅读下去。\n4. 检查回答的合理性和适度,看看回答是否夸张或离题。\n5. 将多样性的评分打分在1到5之间,5分表示回答的质量很好,能够吸引人阅读,1分表示回答的内容生硬或者有离题的问题。\n\n多样性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“生成”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"open_qa": {
|
||||
"id": 7,
|
||||
"category": "open_qa",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "正确性(1-5):答案是否正确。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为5分。如果答案是部分正确的,则可以给予适当的得分,例如2分、3分或4分。如果答案完全不正确,则只得1分。\n\n正确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"rewriting": {
|
||||
"id": 8,
|
||||
"category": "rewriting",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "正确性(1-5):答案是否正确。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为5分。如果答案是部分正确的,则可以给予适当的得分,例如2分、3分或4分。如果答案完全不正确,则只得1分。\n\n正确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"roleplay": {
|
||||
"id": 9,
|
||||
"category": "roleplay",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"fidelity": "保真度(1-5):答案是否能够严格遵守角色的设定回答给定的请求。",
|
||||
"creativity": "创意性(1-5):角色扮演问题的回答需要具有一定创意,但同时需要遵守角色的设定。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"fidelity": "1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。\n2. 阅读题目的请求,确认回答请求时需要注意的细节。\n3. 对比提供的回答与该角色的设定,评估回答是否能够严格遵守角色的设定。\n4. 结合以上评估结果给出保真度的评分,范围从1到5分,其中1分表示回答与角色设定完全不符,5分表示回答完全符合角色设定且满足给定请求。\n\n保真度:",
|
||||
"creativity": "1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。\n2. 评估回答是否具有独特的思路和建议,是否能够给提问者带来新的想法和启示。\n3. 对比回答中的创意和该角色的设定,评估回答是否遵守了该角色的设定和基本特征。\n4. 对回答的质量进行总体评估,并结合以上评估结果给出创意性的评分,范围从1到5分,其中1分表示回答缺乏创意,5分表示回答具有独特的思路和建议,并且能够遵守该角色的设定。\n\n创意性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“角色扮演”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"summarization": {
|
||||
"id": 10,
|
||||
"category": "summarization",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "准确性(1-5):回答应该准确无误地总结出材料的重点。",
|
||||
"conciseness": "简明扼要(1-5):答案是否简明扼要,没有冗余内容。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读问题给的材料,理解其内容和要点。\n2. 评估回答是否准确地总结出原始材料的重点。\n3. 评估回答是否包含原始材料中的所有关键信息。\n4. 根据以上步骤,给出一个1-5的分数,其中1表示回答不能准确地总结出材料的重点,5表示回答完全准确地总结出材料的重点。\n\n准确性:",
|
||||
"conciseness": "1. 阅读题目,提取出材料的重点。\n2. 阅读该总结,并注意其中的主要观点和信息。\n3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息,而不是冗长的段落或文章。\n4. 检查总结是否包含与主要观点无关的信息或冗余信息。\n5.确定总结涵盖了材料中的关键信息,并且没有忽略任何重要细节。\n6.给总结打出1-5的分数,其中5表示总结简明扼要,没有冗余内容,而1表示总结冗长或包含不必要的信息,难以理解或记忆。根据您的判断,打出适当的得分。\n\n简明扼要:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面的“总结”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
},
|
||||
"general": {
|
||||
"id": 11,
|
||||
"category": "general",
|
||||
"metrics": {
|
||||
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
|
||||
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
|
||||
"correctness": "正确性(1-5):答案是否正确。"
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织,并给出一个1到5的分数,其中5表示语言组织非常好,而1表示语言组织非常差。\n\n语言组织:",
|
||||
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度,并给出一个1到5的分数,其中5表示答案非常切题,而1表示答案完全没有切题。\n\n切题:",
|
||||
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的,则可以将正确性得分为5分。如果答案是部分正确的,则可以给予适当的得分,例如2分、3分或4分。如果答案完全不正确,则只得1分。\n\n正确性:"
|
||||
},
|
||||
"prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下:\n\n{question}\n\n需要你评分的答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
|
||||
}
|
||||
}
|
@@ -1,181 +0,0 @@
|
||||
{
|
||||
"brainstorming": {
|
||||
"id": 1,
|
||||
"category": "brainstorming",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"creativity": "Creativity (1-5): Some brainstorming questions may require answers that are creative and suggest new ideas.",
|
||||
"practicality": "Practicality (1-5): Some brainstorming questions may require answers to suggest practical suggestions or solutions.",
|
||||
"reasonableness": "Reasonableness (1-5): The answer should be in line with common sense, life experience, etc."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"creativity": "1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.\n2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the creativity score may be affected.\n3. Consider whether the answer contains novel ideas or unique thoughts. An answer may overlap with a known solution and still be considered creative, as long as it offers a new perspective or approach to the problem.\n4. Give a score of 1 to 5 depending on the creativity of the answer. If the answer lacks creativity, a lower score should be given. If the answer is creative and provides a new idea, a higher score should be given.\n\nCreativity:",
|
||||
"practicality": "1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.\n2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the practicality score may be affected.\n3. Consider whether the suggestions or solutions presented in the answer are practical and workable. The answer may look good, but if it cannot be implemented or applied, the practicality score may be affected.\n4. Give a score of 1 to 5 depending on the practicality of the answer. If the answer lacks practicality, a lower score should be given. If the answer makes a practical suggestion or solution and solves the problem well, a higher score should be given.\n\nPracticality:",
|
||||
"reasonableness": "1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.\n2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the reasonableness score may be affected.\n3. Consider whether the information provided in the answer is reasonable, consistent with common sense, real life, etc. If there are obvious errors or implausibilities in the answer, the reasonableness score may be affected.\n4. Give a score of 1 to 5 depending on the reasonableness of the answer. If the answer contains obvious errors or unreasonable points, a lower score should be given. A higher score should be given if the answer is reasonable, consistent with common sense, real life, etc.\n\nReasonableness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"chat": {
|
||||
"id": 2,
|
||||
"category": "chat",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"naturalness": "Naturalness (1-5): whether the answer is natural and fits the identity given by the question.",
|
||||
"engagingness": "Engagingness (1-5): whether the answer responds appropriately to the content of the preceding conversation and whether it understands the context and background of the conversation.",
|
||||
"reasonableness": "Reasonableness (1-5): Whether the answer can form a logical connection with the content of the previous dialogue, whether it is consistent with common sense, and whether it can reasonably exist in this context.",
|
||||
"fidelity": "Fidelity (1-5): whether the answer is able to answer the given request in strict compliance with the role setting."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"naturalness": "1. Read the question and determine the identity information provided in the question.\n2. Check whether the content of the answer matches the identity given in the question.\n3. Based on the above factors, score the naturalness of the response on a scale from 1 to 5, where 1 means unnatural and 5 means very natural and in accordance with the identity given in the question.\n\nNaturalness:",
|
||||
"engagingness": "1. Read the questions to determine the context and background of the dialogue.\n2. Check that the answer fully understands the context and background of the conversation and that it fits naturally into the conversation without seeming abrupt.\n3. Based on the above factors, rate the response's engagement on a scale from 1 to 5, where 1 means not engaged and 5 means very engaged and appropriately understands the context and background of the conversation.\n\nEngagingness:",
|
||||
"reasonableness": "1. Read the question and determine the topic of the conversation and the direction the question expects the answer to go.\n2. Determine whether the answer can be logically connected to the preceding conversation, whether it makes common sense, and whether it can reasonably exist in this context.\n3. Based on the above factors, rate the reasonableness of the answer on a scale from 1 to 5, where 1 means unreasonable and 5 means very reasonable and able to form a logical connection with the preceding dialogue content and consistent with common sense.\n\nReasonableness:",
|
||||
"fidelity": "1. Read the question carefully to understand how the character is set up and represented in the question, including aspects such as occupation, background, point of view, and personality.\n2. Read the question's request and confirm the details that need to be taken into account when answering the request.\n3. Compare the provided answer with the setting of the role and assess whether the answer can strictly adhere to the setting of the role.\n4. Combine the results of the above assessment to give a fidelity score ranging from 1 to 5, where a score of 1 means that the response does not match the persona at all, and a score of 5 means that the response fully complies with the persona and satisfies the given request.\n\nFidelity:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"chat\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"classification": {
|
||||
"id": 3,
|
||||
"category": "classification",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): whether the answer is correct or not."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"classification\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"closed_qa": {
|
||||
"id": 4,
|
||||
"category": "closed_qa",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): whether the answer is correct or not."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the question carefully and try to answer the question by yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"closed qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"extraction": {
|
||||
"id": 5,
|
||||
"category": "extraction",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "correctness (1-5): Answers should extract the required information accurately and should not contain any incorrect or misleading information."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the questions carefully and identify the information that needs to be extracted from the material.\n2. Read the answer carefully and make sure it covers all the information that needs to be extracted.\n3. Use the material provided to verify the correctness of the response. If the response is inaccurate or contains incorrect or misleading information, a high score cannot be given.\n4. Check that the answer contains all the information required to be extracted and do not leave out any important details.\n5. Give a score between 1 and 5 based on the correctness and completeness of the response, with a score of 5 indicating a very accurate and complete response and a score of 1 indicating that the response barely extracts the required information.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"extraction\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"generation": {
|
||||
"id": 6,
|
||||
"category": "generation",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"diversity": "Diversity (1-5): Whether the answers use beautiful language and have some creativity and imagination. However, answers should also be kept reasonable and moderate, not overly exaggerated or off-topic."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"diversity": "1. Read the entire response carefully to ensure that you fully understand the content and theme expressed in the response.\n2. While reading the response, pay attention to the quality of the language, such as whether the wording is correct and the language is vivid.\n3. Check the creativity and imagination of the response to see if the response is engaging to read on.\n4. Check the reasonableness and appropriateness of the responses to see if the responses are exaggerated or off-topic.\n5. Rate the diversity on a scale of 1 to 5, with a 5 indicating a good quality response that is engaging to read and a 1 indicating a raw response or a question that is off-topic.\n\nDiversity:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"generation\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"open_qa": {
|
||||
"id": 7,
|
||||
"category": "open_qa",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): whether the answer is correct or not."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the answers to the \"open qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"rewriting": {
|
||||
"id": 8,
|
||||
"category": "rewriting",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): whether the answer is correct or not."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the answers to the \"rewriting\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"roleplay": {
|
||||
"id": 9,
|
||||
"category": "roleplay",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"fidelity": "Fidelity (1-5): whether the answer is able to answer the given request in strict compliance with the role setting.",
|
||||
"creativity": "Creativity (1-5): The answers to the role-play questions need to be somewhat creative, but at the same time they need to adhere to the setting of the role."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"fidelity": "1. Read the question carefully to understand how the character is set up and represented in the question, including aspects such as occupation, background, point of view, and personality.\n2. Read the question's request and confirm the details that need to be taken into account when answering the request.\n3. Compare the provided answer with the setting of the role and assess whether the answer can strictly adhere to the setting of the role.\n4. Combine the results of the above assessment to give a fidelity score ranging from 1 to 5, where a score of 1 means that the response does not match the persona at all, and a score of 5 means that the response fully complies with the persona and satisfies the given request.\n\nFidelity:",
|
||||
"creativity": "1. Read the question carefully to understand how the character is set up and represented in the question, including career, background, perspective, and personality.\n2. Evaluate whether the answer has unique ideas and suggestions that bring new ideas and insights to the questioner.\n3. Compare the creativity in the response to the setting of the persona and assess whether the response adheres to the setting and essential characteristics of the persona.\n4. Evaluate the quality of the responses in general and combine the results of the above assessment to give a creativity score ranging from 1 to 5, where a score of 1 indicates that the response lacks creativity and a score of 5 indicates that the response has unique ideas and suggestions and is able to adhere to the set-up of the persona.\n\nCreativity:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"role-play\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"summarization": {
|
||||
"id": 10,
|
||||
"category": "summarization",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): answers should summarize the main points of the material accurately and unambiguously.",
|
||||
"conciseness": "Conciseness (1-5): answers should be concise and without redundant content."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the material given in the question carefully to understand its content and main points.\n2. Assess whether the answer accurately summarizes the key points of the source material.\n3. assess whether the response contains all the key information in the source material.\n4. Based on the above steps, give a score of 1-5, where 1 means that the response does not accurately summarize the main points of the material and 5 means that the response completely accurately summarizes the main points of the material.\n\nCorrectness:",
|
||||
"conciseness": "1. Read the title and extract the main points of the material.\n2. Read the summary and note the main ideas and messages in it.\n3. Assess the length of the summary. A concise summary should usually convey key information within a few sentences or paragraphs, rather than lengthy paragraphs or essays.\n4. Check that the summary does not contain information that is not relevant to the main ideas or that is redundant.\n5. Make sure that the summary covers the key information in the material and that no important details have been omitted.\n6. Rate the summary on a scale of 1-5, where 5 means the summary is concise and free of redundancy, and 1 means the summary is lengthy or contains unnecessary information that is difficult to understand or remember. Based on your judgment, assign the appropriate score.\n\nConciseness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the \"summarization\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
},
|
||||
"general": {
|
||||
"id": 11,
|
||||
"category": "general",
|
||||
"metrics": {
|
||||
"language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
|
||||
"relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
|
||||
"correctness": "Correctness (1-5): whether the answer is correct or not."
|
||||
},
|
||||
"CoT": {
|
||||
"language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
|
||||
"relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
|
||||
"correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
|
||||
},
|
||||
"prompt": "You are a good assistant. Please rate the given answer to the question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
|
||||
}
|
||||
}
|
@@ -1,12 +0,0 @@
|
||||
jieba
|
||||
bert-score
|
||||
rouge_chinese
|
||||
scikit-metrics
|
||||
nltk
|
||||
openai
|
||||
seaborn
|
||||
pandas
|
||||
matplotlib
|
||||
numpy
|
||||
zhon
|
||||
rouge_score
|
@@ -1,15 +0,0 @@
|
||||
from .evaluator import get_evaluator
|
||||
from .utils import (
|
||||
analyze_unieval_results,
|
||||
calculate_average_score,
|
||||
convert_data_to_unieval_format,
|
||||
save_unieval_results,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"get_evaluator",
|
||||
"convert_data_to_unieval_format",
|
||||
"calculate_average_score",
|
||||
"save_unieval_results",
|
||||
"analyze_unieval_results",
|
||||
]
|
@@ -1,329 +0,0 @@
|
||||
# MIT License
|
||||
|
||||
# Copyright (c) 2022 Ming Zhong
|
||||
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
import numpy as np
|
||||
from nltk import sent_tokenize
|
||||
|
||||
from .scorer import UniEvaluator
|
||||
from .utils import add_question
|
||||
|
||||
|
||||
class SumEvaluator:
|
||||
def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
|
||||
"""Set up evaluator for text summarization"""
|
||||
self.scorer = UniEvaluator(
|
||||
model_name_or_path="MingZhong/unieval-sum" if model_name_or_path == "" else model_name_or_path,
|
||||
max_length=max_length,
|
||||
device=device,
|
||||
cache_dir=cache_dir,
|
||||
)
|
||||
self.task = "summarization"
|
||||
self.dimensions = ["coherence", "consistency", "fluency", "relevance"]
|
||||
|
||||
def evaluate(self, data, category, dims=None, overall=True):
|
||||
"""
|
||||
Get the scores of all the given dimensions
|
||||
|
||||
category: The category to be evaluated.
|
||||
|
||||
dims: A list of dimensions to be evaluated. If dims is None, SumEvaluator will evaluate
|
||||
four dimensions: coherence, consistency, fluency, relevance.
|
||||
|
||||
overall: indicates whether the overall score is to be calculated.
|
||||
Overall score can be customized to a combination of scores based on different
|
||||
dimensions. The default here is the average score of all the given dimensions.
|
||||
"""
|
||||
n_data = len(data)
|
||||
eval_scores = [{} for _ in range(n_data)]
|
||||
|
||||
if dims == None:
|
||||
eval_dims = self.dimensions
|
||||
else:
|
||||
assert isinstance(dims, list)
|
||||
eval_dims = dims
|
||||
|
||||
for dim in eval_dims:
|
||||
# Calculate average sentence-level scores for 'consistency' and 'fluency'
|
||||
if dim == "consistency" or dim == "fluency":
|
||||
src_list, output_list = [], []
|
||||
n_sents = [] # the number of sentences in each generated summary
|
||||
for i in range(n_data):
|
||||
source = data[i]["source"]
|
||||
system_outputs = sent_tokenize(data[i]["system_output"])
|
||||
n_sents.append(len(system_outputs))
|
||||
for j in range(len(system_outputs)):
|
||||
src_list.append(source)
|
||||
output_list.append(system_outputs[j])
|
||||
input_list = add_question(dimension=dim, output=output_list, src=src_list, task=self.task)
|
||||
sent_score = self.scorer.score(input_list, self.task, category, dim)
|
||||
|
||||
# Get average score for each sample
|
||||
start_idx = 0
|
||||
score = []
|
||||
for cur_n_sent in n_sents:
|
||||
# prevent denominator from being 0
|
||||
score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]) / (cur_n_sent + 1e-6))
|
||||
start_idx += cur_n_sent
|
||||
|
||||
# Calculate summary-level score for 'coherence' and 'relevance'
|
||||
elif dim == "coherence" or dim == "relevance":
|
||||
src_list, output_list, ref_list = [], [], []
|
||||
for i in range(n_data):
|
||||
src_list.append(data[i]["source"])
|
||||
output_list.append(data[i]["system_output"])
|
||||
if dim == "relevance":
|
||||
ref_list.append(data[i]["reference"])
|
||||
input_list = add_question(dimension=dim, output=output_list, src=src_list, ref=ref_list, task=self.task)
|
||||
score = self.scorer.score(input_list, self.task, category, dim)
|
||||
|
||||
# Please customize other dimensions here for summarization
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"The input format for this dimension is still undefined. \
|
||||
Please customize it first."
|
||||
)
|
||||
|
||||
for i in range(n_data):
|
||||
eval_scores[i][dim] = score[i]
|
||||
|
||||
# Customize your overall score here.
|
||||
if overall == True:
|
||||
for i in range(n_data):
|
||||
eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
|
||||
|
||||
return eval_scores
|
||||
|
||||
|
||||
class DialogEvaluator:
|
||||
def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
|
||||
"""Set up evaluator for dialogues"""
|
||||
self.scorer = UniEvaluator(
|
||||
model_name_or_path="MingZhong/unieval-dialog" if model_name_or_path == "" else model_name_or_path,
|
||||
max_length=max_length,
|
||||
device=device,
|
||||
cache_dir=cache_dir,
|
||||
)
|
||||
self.task = "dialogue"
|
||||
self.dimensions = ["naturalness", "coherence", "engagingness", "groundedness", "understandability"]
|
||||
|
||||
def evaluate(self, data, category, dims=None, overall=True):
|
||||
"""
|
||||
Get the scores of all the given dimensions
|
||||
|
||||
category: The category to be evaluated.
|
||||
|
||||
dims: A list of dimensions to be evaluated. If dims is None, DialogEvaluator will evaluate
|
||||
five dimensions: naturalness, coherence, engagingness, groundedness and understandability.
|
||||
|
||||
overall: indicates whether the overall score is to be calculated.
|
||||
Overall score can be customized to a combination of scores based on different
|
||||
dimensions. The default here is the average score of all the given dimensions.
|
||||
"""
|
||||
n_data = len(data)
|
||||
eval_scores = [{} for _ in range(n_data)]
|
||||
|
||||
if dims == None:
|
||||
eval_dims = self.dimensions
|
||||
else:
|
||||
assert isinstance(dims, list)
|
||||
eval_dims = dims
|
||||
|
||||
for dim in eval_dims:
|
||||
# Calculate summation score for 'engagingness'
|
||||
if dim == "engagingness":
|
||||
src_list, output_list, context_list = [], [], []
|
||||
n_sents = [] # the number of sentences in each generated response
|
||||
for i in range(n_data):
|
||||
source = data[i]["source"]
|
||||
context = data[i]["context"]
|
||||
system_outputs = sent_tokenize(data[i]["system_output"])
|
||||
n_sents.append(len(system_outputs))
|
||||
for j in range(len(system_outputs)):
|
||||
src_list.append(source)
|
||||
context_list.append(context)
|
||||
output_list.append(system_outputs[j])
|
||||
input_list = add_question(
|
||||
dimension=dim, output=output_list, src=src_list, context=context_list, task=self.task
|
||||
)
|
||||
sent_score = self.scorer.score(input_list, self.task, category, dim)
|
||||
|
||||
# Get the summation score for each sample
|
||||
start_idx = 0
|
||||
score = []
|
||||
for cur_n_sent in n_sents:
|
||||
score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]))
|
||||
start_idx += cur_n_sent
|
||||
|
||||
# Calculate turn-level score for other dimensions
|
||||
elif dim in ["naturalness", "coherence", "groundedness", "understandability"]:
|
||||
src_list, output_list, context_list = [], [], []
|
||||
for i in range(n_data):
|
||||
src_list.append(data[i]["source"])
|
||||
output_list.append(data[i]["system_output"])
|
||||
context_list.append(data[i]["context"])
|
||||
input_list = add_question(
|
||||
dimension=dim, output=output_list, src=src_list, context=context_list, task=self.task
|
||||
)
|
||||
score = self.scorer.score(input_list, self.task, category, dim)
|
||||
|
||||
# Please customize other dimensions here for summarization
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"The input format for this dimension is still undefined. \
|
||||
Please customize it first."
|
||||
)
|
||||
|
||||
for i in range(n_data):
|
||||
eval_scores[i][dim] = score[i]
|
||||
|
||||
# Customize your overall score here.
|
||||
if overall == True:
|
||||
for i in range(n_data):
|
||||
eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
|
||||
|
||||
return eval_scores
|
||||
|
||||
|
||||
class D2tEvaluator:
|
||||
def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
|
||||
"""Set up evaluator for data-to-text"""
|
||||
self.scorer = UniEvaluator(
|
||||
model_name_or_path="MingZhong/unieval-sum" if model_name_or_path == "" else model_name_or_path,
|
||||
max_length=max_length,
|
||||
device=device,
|
||||
cache_dir=cache_dir,
|
||||
)
|
||||
self.task = "data2text"
|
||||
self.dimensions = ["naturalness", "informativeness"]
|
||||
|
||||
def evaluate(self, data, category, dims=None, overall=True):
|
||||
"""
|
||||
Get the scores of all the given dimensions
|
||||
|
||||
category: The category to be evaluated.
|
||||
|
||||
dims: A list of dimensions to be evaluated. If dims is None, D2tEvaluator will evaluate
|
||||
two dimensions: naturalness and informativeness.
|
||||
|
||||
overall: indicates whether the overall score is to be calculated.
|
||||
Overall score can be customized to a combination of scores based on different
|
||||
dimensions. The default here is the average score of all the given dimensions.
|
||||
"""
|
||||
n_data = len(data)
|
||||
eval_scores = [{} for _ in range(n_data)]
|
||||
|
||||
if dims == None:
|
||||
eval_dims = self.dimensions
|
||||
else:
|
||||
assert isinstance(dims, list)
|
||||
eval_dims = dims
|
||||
|
||||
for dim in eval_dims:
|
||||
output_list, ref_list = [], []
|
||||
for i in range(n_data):
|
||||
output_list.append(data[i]["system_output"])
|
||||
ref_list.append(data[i]["reference"])
|
||||
|
||||
input_list = add_question(dimension=dim, output=output_list, ref=ref_list, task=self.task)
|
||||
score = self.scorer.score(input_list, self.task, category, dim)
|
||||
|
||||
for i in range(n_data):
|
||||
eval_scores[i][dim] = score[i]
|
||||
|
||||
# Customize your overall score here.
|
||||
if overall == True:
|
||||
for i in range(n_data):
|
||||
eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
|
||||
|
||||
return eval_scores
|
||||
|
||||
|
||||
class FactEvaluator:
|
||||
def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
|
||||
"""Set up evaluator for factual consistency detection"""
|
||||
self.scorer = UniEvaluator(
|
||||
model_name_or_path="MingZhong/unieval-fact" if model_name_or_path == "" else model_name_or_path,
|
||||
max_length=max_length,
|
||||
device=device,
|
||||
cache_dir=cache_dir,
|
||||
)
|
||||
self.task = "fact"
|
||||
self.dim = "consistency"
|
||||
|
||||
def evaluate(self, data, category):
|
||||
"""
|
||||
Get the factual consistency score (only 1 dimension for this task)
|
||||
|
||||
category: The category to be evaluated.
|
||||
"""
|
||||
n_data = len(data)
|
||||
eval_scores = [{} for _ in range(n_data)]
|
||||
|
||||
# Calculate average sentence-level scores for factual consistency
|
||||
src_list, output_list = [], []
|
||||
n_sents = [] # the number of sentences in the claim
|
||||
for i in range(n_data):
|
||||
source = data[i]["source"]
|
||||
system_outputs = sent_tokenize(data[i]["system_output"])
|
||||
n_sents.append(len(system_outputs))
|
||||
for j in range(len(system_outputs)):
|
||||
src_list.append(source)
|
||||
output_list.append(system_outputs[j])
|
||||
input_list = add_question(dimension=self.dim, output=output_list, src=src_list, task=self.task)
|
||||
sent_score = self.scorer.score(input_list, self.task, category, self.dim)
|
||||
|
||||
# Get average score for each sample
|
||||
start_idx = 0
|
||||
score = []
|
||||
for cur_n_sent in n_sents:
|
||||
score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]) / cur_n_sent)
|
||||
start_idx += cur_n_sent
|
||||
|
||||
for i in range(n_data):
|
||||
eval_scores[i][self.dim] = score[i]
|
||||
|
||||
return eval_scores
|
||||
|
||||
|
||||
def get_evaluator(task, model_name_or_path="", max_length=1024, device="cuda:0", cache_dir=None):
|
||||
assert task in ["summarization", "dialogue", "data2text", "fact"]
|
||||
if task == "summarization":
|
||||
return SumEvaluator(
|
||||
model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
|
||||
)
|
||||
elif task == "dialogue":
|
||||
return DialogEvaluator(
|
||||
model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
|
||||
)
|
||||
elif task == "data2text":
|
||||
return D2tEvaluator(
|
||||
model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
|
||||
)
|
||||
elif task == "fact":
|
||||
return FactEvaluator(
|
||||
model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
|
||||
)
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"Other tasks are not implemented, \
|
||||
please customize specific tasks here."
|
||||
)
|
@@ -1,96 +0,0 @@
|
||||
# MIT License
|
||||
|
||||
# Copyright (c) 2022 Ming Zhong
|
||||
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from tqdm import tqdm
|
||||
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
|
||||
|
||||
class UniEvaluator:
|
||||
def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
|
||||
"""Set up model"""
|
||||
self.device = device
|
||||
self.max_length = max_length
|
||||
|
||||
self.config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
|
||||
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path, config=self.config, cache_dir=cache_dir)
|
||||
|
||||
self.model.eval()
|
||||
self.model.to(device)
|
||||
|
||||
self.softmax = nn.Softmax(dim=1)
|
||||
|
||||
self.pos_id = self.tokenizer("Yes")["input_ids"][0]
|
||||
self.neg_id = self.tokenizer("No")["input_ids"][0]
|
||||
|
||||
def score(self, inputs, task, category, dim, batch_size=8):
|
||||
"""
|
||||
Get scores for the given samples.
|
||||
final_score = postive_score / (postive_score + negative_score)
|
||||
"""
|
||||
|
||||
# The implementation of "forward" in T5 still requires decoder_input_ids.
|
||||
# Therefore, we construct a random one-word target sequence.
|
||||
# The content of the target has no effect on the final scores.
|
||||
tgts = ["No" for _ in range(len(inputs))]
|
||||
|
||||
pos_score_list, neg_score_list = [], []
|
||||
for i in tqdm(range(0, len(inputs), batch_size), desc=f"{category}-({dim}-{task}): "):
|
||||
src_list = inputs[i : i + batch_size]
|
||||
tgt_list = tgts[i : i + batch_size]
|
||||
try:
|
||||
with torch.no_grad():
|
||||
encoded_src = self.tokenizer(
|
||||
src_list, max_length=self.max_length, truncation=True, padding=True, return_tensors="pt"
|
||||
)
|
||||
encoded_tgt = self.tokenizer(
|
||||
tgt_list, max_length=self.max_length, truncation=True, padding=True, return_tensors="pt"
|
||||
)
|
||||
|
||||
src_tokens = encoded_src["input_ids"].to(self.device)
|
||||
src_mask = encoded_src["attention_mask"].to(self.device)
|
||||
|
||||
tgt_tokens = encoded_tgt["input_ids"].to(self.device)[:, 0].unsqueeze(-1)
|
||||
|
||||
output = self.model(input_ids=src_tokens, attention_mask=src_mask, labels=tgt_tokens)
|
||||
logits = output.logits.view(-1, self.model.config.vocab_size)
|
||||
|
||||
pos_score = self.softmax(logits)[:, self.pos_id] # Yes
|
||||
neg_score = self.softmax(logits)[:, self.neg_id] # No
|
||||
|
||||
cur_pos_score = [x.item() for x in pos_score]
|
||||
cur_neg_score = [x.item() for x in neg_score]
|
||||
pos_score_list += cur_pos_score
|
||||
neg_score_list += cur_neg_score
|
||||
|
||||
except RuntimeError:
|
||||
print(f"source: {src_list}")
|
||||
print(f"target: {tgt_list}")
|
||||
exit(0)
|
||||
|
||||
score_list = []
|
||||
for i in range(len(pos_score_list)):
|
||||
score_list.append(pos_score_list[i] / (pos_score_list[i] + neg_score_list[i]))
|
||||
|
||||
return score_list
|
@@ -1,285 +0,0 @@
|
||||
# MIT License
|
||||
|
||||
# Copyright (c) 2022 Ming Zhong
|
||||
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
import os
|
||||
from typing import Dict
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
import tqdm
|
||||
|
||||
|
||||
def add_question(dimension, output, src=None, ref=None, context=None, task=None):
|
||||
"""
|
||||
Add questions to generate input in Bool-QA format for UniEval.
|
||||
|
||||
dimension: specific dimension to be evaluated
|
||||
src: source input for different NLG tasks. For example, source document for summarization
|
||||
and dialogue history for dialogue response generation.
|
||||
output: output text generated by the models
|
||||
ref: human-annotated groundtruth
|
||||
context: the context needed to evaluate several specific dimension. For example,
|
||||
additional factual information when evaluating engagingness and groundedness in dialogues.
|
||||
"""
|
||||
|
||||
input_with_question = []
|
||||
for i in range(len(output)):
|
||||
# For summarization
|
||||
if task == "summarization":
|
||||
if dimension == "fluency":
|
||||
cur_input = "question: Is this a fluent paragraph? </s> paragraph: " + output[i]
|
||||
elif dimension == "coherence":
|
||||
cur_input = (
|
||||
"question: Is this a coherent summary to the document? </s> summary: "
|
||||
+ output[i]
|
||||
+ " </s> document: "
|
||||
+ src[i]
|
||||
)
|
||||
elif dimension == "consistency":
|
||||
cur_input = (
|
||||
"question: Is this claim consistent with the document? </s> claim: "
|
||||
+ output[i]
|
||||
+ " </s> document: "
|
||||
+ src[i]
|
||||
)
|
||||
elif dimension == "relevance":
|
||||
cur_input = (
|
||||
"question: Is this summary relevant to the reference? </s> summary: "
|
||||
+ output[i]
|
||||
+ " </s> reference: "
|
||||
+ ref[i]
|
||||
)
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"The input format for this dimension is still undefined. Please customize it first."
|
||||
)
|
||||
# For dialogues
|
||||
elif task == "dialogue":
|
||||
if dimension == "naturalness":
|
||||
cur_input = "question: Is this a natural response in the dialogue? </s> response: " + output[i]
|
||||
elif dimension == "coherence":
|
||||
cur_input = (
|
||||
"question: Is this a coherent response given the dialogue history? </s> response: "
|
||||
+ output[i]
|
||||
+ " </s> dialogue history: "
|
||||
+ src[i]
|
||||
)
|
||||
elif dimension == "engagingness":
|
||||
cur_input = (
|
||||
"question: Is this an engaging and informative response according to the dialogue history and fact? </s> response: "
|
||||
+ output[i]
|
||||
+ " </s> dialogue history: "
|
||||
+ src[i]
|
||||
+ " </s> fact: "
|
||||
+ context[i]
|
||||
)
|
||||
elif dimension == "groundedness":
|
||||
cur_input = (
|
||||
"question: Is this response consistent with knowledge in the fact? </s> response: "
|
||||
+ output[i]
|
||||
+ " </s> fact: "
|
||||
+ context[i]
|
||||
)
|
||||
elif dimension == "understandability":
|
||||
cur_input = "question: Is this an understandable response in the dialogue? </s> response: " + output[i]
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"The input format for this dimension is still undefined. Please customize it first."
|
||||
)
|
||||
# For data-to-text
|
||||
elif task == "data2text":
|
||||
if dimension == "naturalness":
|
||||
cur_input = "question: Is this a fluent utterance? </s> utterance: " + output[i]
|
||||
elif dimension == "informativeness":
|
||||
cur_input = (
|
||||
"question: Is this sentence informative according to the reference? </s> sentence: "
|
||||
+ output[i]
|
||||
+ " </s> reference: "
|
||||
+ ref[i]
|
||||
)
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"The input format for this dimension is still undefined. Please customize it first."
|
||||
)
|
||||
# For factual consistency detection
|
||||
elif task == "fact":
|
||||
if dimension == "consistency":
|
||||
cur_input = (
|
||||
"question: Is this claim consistent with the document? </s> claim: "
|
||||
+ output[i]
|
||||
+ " </s> document: "
|
||||
+ src[i]
|
||||
)
|
||||
else:
|
||||
raise NotImplementedError("No other dimensions for the factual consistency detection task.")
|
||||
# For new customized tasks
|
||||
else:
|
||||
raise NotImplementedError("Other tasks are not implemented, please customize specific tasks here.")
|
||||
input_with_question.append(cur_input)
|
||||
return input_with_question
|
||||
|
||||
|
||||
def convert_data_to_unieval_format(output_list, src_list=None, ref_list=None):
|
||||
"""
|
||||
Convert the data into the unieval's format.
|
||||
|
||||
output_list: a list of model output
|
||||
|
||||
src_list: source input for different NLG tasks. For example, source document for summarization
|
||||
and dialogue history for dialogue response generation
|
||||
ref_list: human-annotated groundtruth
|
||||
"""
|
||||
json_data = []
|
||||
for i in range(len(output_list)):
|
||||
cur = {}
|
||||
cur["system_output"] = output_list[i]
|
||||
if src_list is not None:
|
||||
cur["source"] = src_list[i]
|
||||
if ref_list is not None:
|
||||
cur["reference"] = ref_list[i]
|
||||
cur["context"] = ""
|
||||
json_data.append(cur)
|
||||
return json_data
|
||||
|
||||
|
||||
def calculate_average_score(scores):
|
||||
"""
|
||||
Calculate average scores for different metrics
|
||||
|
||||
scores: a list of scores for different metrics for each answer
|
||||
|
||||
"""
|
||||
metrics = {metric: 0 for metric in scores[0]}
|
||||
|
||||
for score in scores:
|
||||
for metric in score:
|
||||
metrics[metric] += score[metric]
|
||||
|
||||
for metric in metrics:
|
||||
metrics[metric] /= len(scores)
|
||||
|
||||
return metrics
|
||||
|
||||
|
||||
def save_unieval_results(model_name: str, unieval_metric_stats: Dict[str, Dict], save_path: str) -> None:
|
||||
"""
|
||||
Save UniEval evaluation results of different categories for one model.
|
||||
|
||||
"""
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
unieval_metric_stats_per_category = {}
|
||||
for task, category_stat in unieval_metric_stats.items():
|
||||
for category, metric_stat in category_stat.items():
|
||||
if unieval_metric_stats_per_category.get(category, None) is None:
|
||||
unieval_metric_stats_per_category[category] = {}
|
||||
for metric, score in metric_stat.items():
|
||||
unieval_metric_stats_per_category[category][f"{metric}-{task}"] = score
|
||||
|
||||
automatic_df = pd.DataFrame(unieval_metric_stats_per_category)
|
||||
automatic_df.to_csv(os.path.join(save_path, f"{model_name}_results.csv"), index=True)
|
||||
|
||||
|
||||
def read_unieval_results(results_path: str, file_name: str) -> Dict[str, Dict]:
|
||||
"""
|
||||
Read a csv file and return a dictionary which stores scores per metric.
|
||||
|
||||
"""
|
||||
|
||||
results = pd.read_csv(os.path.join(results_path, file_name), index_col=0)
|
||||
|
||||
results_dict = {metric: {} for metric in list(results.index)}
|
||||
for i, metric in enumerate(results_dict.keys()):
|
||||
for j, category in enumerate(list(results.columns)):
|
||||
if pd.isnull(results.iloc[i][j]):
|
||||
continue
|
||||
results_dict[metric][category] = results.iloc[i][j]
|
||||
|
||||
return results_dict
|
||||
|
||||
|
||||
def analyze_unieval_results(results_path: str, save_path: str) -> None:
|
||||
"""
|
||||
Analyze and visualize all csv files in the given folder.
|
||||
|
||||
"""
|
||||
|
||||
if not os.path.exists(results_path):
|
||||
raise Exception(f'The given directory "{results_path}" doesn\'t exist! No results found!')
|
||||
|
||||
all_statistics = {}
|
||||
|
||||
for file_name in os.listdir(results_path):
|
||||
if file_name.endswith("_results.csv"):
|
||||
model_name = file_name.split("_results.csv")[0]
|
||||
all_statistics[model_name] = read_unieval_results(results_path, file_name)
|
||||
|
||||
if len(list(all_statistics.keys())) == 0:
|
||||
raise Exception(f'There are no csv files in the given directory "{results_path}"!')
|
||||
|
||||
frame_all = {"model": [], "category": [], "metric": [], "score": []}
|
||||
frame_per_metric = {}
|
||||
for model_name, model_statistics in all_statistics.items():
|
||||
for metric, metric_statistics in model_statistics.items():
|
||||
if frame_per_metric.get(metric) is None:
|
||||
frame_per_metric[metric] = {"model": [], "category": [], "score": []}
|
||||
|
||||
for category, category_score in metric_statistics.items():
|
||||
frame_all["model"].append(model_name)
|
||||
frame_all["category"].append(category)
|
||||
frame_all["metric"].append(metric)
|
||||
frame_all["score"].append(category_score)
|
||||
|
||||
frame_per_metric[metric]["model"].append(model_name)
|
||||
frame_per_metric[metric]["category"].append(category)
|
||||
frame_per_metric[metric]["score"].append(category_score)
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
frame_all = pd.DataFrame(frame_all)
|
||||
frame_all.to_csv(os.path.join(save_path, "unieval_statistics.csv"))
|
||||
|
||||
for metric in tqdm.tqdm(
|
||||
frame_per_metric.keys(),
|
||||
desc=f"UniEval metrics: ",
|
||||
total=len(frame_per_metric.keys()),
|
||||
):
|
||||
data = pd.DataFrame(frame_per_metric[metric])
|
||||
|
||||
sns.set()
|
||||
fig = plt.figure(figsize=(16, 10))
|
||||
|
||||
fig = sns.barplot(x="category", y="score", hue="model", data=data, dodge=True)
|
||||
fig.set_title(
|
||||
f"Comparison between Different Models for Metric {metric.split('-')[0].title()} in Task {metric.split('-')[1].title()}"
|
||||
)
|
||||
plt.xlabel("Evaluation Category")
|
||||
plt.ylabel("Score")
|
||||
|
||||
figure = fig.get_figure()
|
||||
figure.savefig(os.path.join(save_path, f"{metric}.png"), dpi=400)
|
||||
|
||||
plt.close()
|
@@ -1,206 +0,0 @@
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import string
|
||||
from typing import Dict
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
import tqdm
|
||||
from zhon import hanzi
|
||||
|
||||
|
||||
def _make_w_io_base(f, mode: str):
|
||||
if not isinstance(f, io.IOBase):
|
||||
f_dirname = os.path.dirname(f)
|
||||
if f_dirname != "":
|
||||
os.makedirs(f_dirname, exist_ok=True)
|
||||
f = open(f, mode=mode)
|
||||
return f
|
||||
|
||||
|
||||
def _make_r_io_base(f, mode: str):
|
||||
if not isinstance(f, io.IOBase):
|
||||
f = open(f, mode=mode)
|
||||
return f
|
||||
|
||||
|
||||
def jdump(obj, f, mode="w", indent=4, default=str):
|
||||
"""Dump a str or dictionary to a file in json format.
|
||||
Args:
|
||||
obj: An object to be written.
|
||||
f: A string path to the location on disk.
|
||||
mode: Mode for opening the file.
|
||||
indent: Indent for storing json dictionaries.
|
||||
default: A function to handle non-serializable entries; defaults to `str`.
|
||||
"""
|
||||
f = _make_w_io_base(f, mode)
|
||||
if isinstance(obj, (dict, list)):
|
||||
json.dump(obj, f, indent=indent, default=default, ensure_ascii=False)
|
||||
elif isinstance(obj, str):
|
||||
f.write(obj)
|
||||
else:
|
||||
raise ValueError(f"Unexpected type: {type(obj)}")
|
||||
f.close()
|
||||
|
||||
|
||||
def jload(f, mode="r"):
|
||||
"""Load a .json file into a dictionary."""
|
||||
f = _make_r_io_base(f, mode)
|
||||
jdict = json.load(f)
|
||||
f.close()
|
||||
return jdict
|
||||
|
||||
|
||||
def get_json_list(file_path):
|
||||
with open(file_path, "r") as f:
|
||||
json_list = []
|
||||
for line in f:
|
||||
json_list.append(json.loads(line))
|
||||
return json_list
|
||||
|
||||
|
||||
def get_data_per_category(data, categories):
|
||||
data_per_category = {category: [] for category in categories}
|
||||
for item in data:
|
||||
category = item["category"]
|
||||
if category in categories:
|
||||
data_per_category[category].append(item)
|
||||
|
||||
return data_per_category
|
||||
|
||||
|
||||
def remove_punctuations(text: str) -> str:
|
||||
"""
|
||||
Remove punctuations in the given text.
|
||||
It is used in evaluation of automatic metrics.
|
||||
|
||||
"""
|
||||
|
||||
punctuation = string.punctuation + hanzi.punctuation
|
||||
punctuation = set([char for char in punctuation])
|
||||
punctuation.difference_update(set("!@#$%&()<>?|,.\"'"))
|
||||
|
||||
out = []
|
||||
for char in text:
|
||||
if char in punctuation:
|
||||
continue
|
||||
else:
|
||||
out.append(char)
|
||||
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def remove_redundant_space(text: str) -> str:
|
||||
"""
|
||||
Remove redundant spaces in the given text.
|
||||
It is used in evaluation of automatic metrics.
|
||||
|
||||
"""
|
||||
|
||||
return " ".join(text.split())
|
||||
|
||||
|
||||
def preprocessing_text(text: str) -> str:
|
||||
"""
|
||||
Preprocess the given text.
|
||||
It is used in evaluation of automatic metrics.
|
||||
|
||||
"""
|
||||
|
||||
return remove_redundant_space(remove_punctuations(text.lower()))
|
||||
|
||||
|
||||
def save_automatic_results(model_name: str, automatic_metric_stats: Dict[str, Dict], save_path: str) -> None:
|
||||
"""
|
||||
Save automatic evaluation results of different categories for one model.
|
||||
|
||||
"""
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
automatic_df = pd.DataFrame(automatic_metric_stats)
|
||||
automatic_df.to_csv(os.path.join(save_path, f"{model_name}_results.csv"), index=True)
|
||||
|
||||
|
||||
def read_automatic_results(results_path: str, file_name: str) -> Dict[str, Dict]:
|
||||
"""
|
||||
Read a csv file and return a dictionary which stores scores per metric.
|
||||
|
||||
"""
|
||||
|
||||
results = pd.read_csv(os.path.join(results_path, file_name), index_col=0)
|
||||
|
||||
results_dict = {metric: {} for metric in list(results.index)}
|
||||
for i, metric in enumerate(results_dict.keys()):
|
||||
for j, category in enumerate(list(results.columns)):
|
||||
if pd.isnull(results.iloc[i][j]):
|
||||
continue
|
||||
results_dict[metric][category] = results.iloc[i][j]
|
||||
|
||||
return results_dict
|
||||
|
||||
|
||||
def analyze_automatic_results(results_path: str, save_path: str) -> None:
|
||||
"""
|
||||
Analyze and visualize all csv files in the given folder.
|
||||
|
||||
"""
|
||||
|
||||
if not os.path.exists(results_path):
|
||||
raise Exception(f'The given directory "{results_path}" doesn\'t exist! No results found!')
|
||||
|
||||
all_statistics = {}
|
||||
|
||||
for file_name in os.listdir(results_path):
|
||||
if file_name.endswith("_results.csv"):
|
||||
model_name = file_name.split("_results.csv")[0]
|
||||
all_statistics[model_name] = read_automatic_results(results_path, file_name)
|
||||
|
||||
if len(list(all_statistics.keys())) == 0:
|
||||
raise Exception(f'There are no csv files in the given directory "{results_path}"!')
|
||||
|
||||
frame_all = {"model": [], "category": [], "metric": [], "score": []}
|
||||
frame_per_metric = {}
|
||||
for model_name, model_statistics in all_statistics.items():
|
||||
for metric, metric_statistics in model_statistics.items():
|
||||
if frame_per_metric.get(metric) is None:
|
||||
frame_per_metric[metric] = {"model": [], "category": [], "score": []}
|
||||
|
||||
for category, category_score in metric_statistics.items():
|
||||
frame_all["model"].append(model_name)
|
||||
frame_all["category"].append(category)
|
||||
frame_all["metric"].append(metric)
|
||||
frame_all["score"].append(category_score)
|
||||
|
||||
frame_per_metric[metric]["model"].append(model_name)
|
||||
frame_per_metric[metric]["category"].append(category)
|
||||
frame_per_metric[metric]["score"].append(category_score)
|
||||
|
||||
if not os.path.exists(save_path):
|
||||
os.makedirs(save_path)
|
||||
|
||||
frame_all = pd.DataFrame(frame_all)
|
||||
frame_all.to_csv(os.path.join(save_path, "automatic_evaluation_statistics.csv"))
|
||||
|
||||
for metric in tqdm.tqdm(
|
||||
frame_per_metric.keys(),
|
||||
desc=f"automatic metrics: ",
|
||||
total=len(frame_per_metric.keys()),
|
||||
):
|
||||
data = pd.DataFrame(frame_per_metric[metric])
|
||||
|
||||
sns.set()
|
||||
fig = plt.figure(figsize=(16, 10))
|
||||
|
||||
fig = sns.barplot(x="category", y="score", hue="model", data=data, dodge=True)
|
||||
fig.set_title(f"Comparison between Different Models for Metric {metric.title()}")
|
||||
plt.xlabel("Evaluation Category")
|
||||
plt.ylabel("Score")
|
||||
|
||||
figure = fig.get_figure()
|
||||
figure.savefig(os.path.join(save_path, f"{metric}.png"), dpi=400)
|
||||
|
||||
plt.close()
|
Reference in New Issue
Block a user