remove unnecessary step and update readme

This commit is contained in:
Tong Li 2023-04-27 18:51:58 +08:00
parent 6ef7011462
commit aa77ddae33
4 changed files with 584 additions and 63 deletions

View File

@ -5,21 +5,11 @@ In this directory we will introduce how you can evaluate your model with GPT-4.
## Evaluation Pipeline ## Evaluation Pipeline
The whole evaluation process undergoes two steps. The whole evaluation process undergoes two steps.
1. Prepare the questions following the internal data structure in the data format section (described below).
1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. 2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. 3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
### Generate Answers ### Generate Answers
To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows.
```shell
python format_questions.py \
--questions_path "path to FastChat's question.jsonl" \
--save_path "path to the formatted file" \
```
In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
```shell ```shell
@ -107,16 +97,23 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
## Data Format ## Data Format
### Questions ### Questions
The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. The current sample questions are collected from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl). Each question record has the following field:
* `id` (id, compulsory): The ID of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question.
* `output` (str, optional): The sample output of the instruction / question.
* `category` (str, compulsory): The category of the instruction / question.
We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record. Example:
```
A question record has the following field: {
"id": 0,
* `category` (str): The category of the question. "instruction": "Help me summarize the following short story?",
* `instruction` (str): The question. "input": "{story}",
* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. "output": "{summarized story}",
* `output` (str): This is empty. "category": "closed qa"
* `id` (int): The question id. }
```
### Answers ### Answers
@ -126,7 +123,7 @@ An answer record has the following field:
* `category` (str): The category of the question. * `category` (str): The category of the question.
* `instruction` (str): The question. * `instruction` (str): The question.
* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. * `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
* `output` (str): The answer to the question. * `output` (str): The answer to the question.
* `id` (int): The question id. * `id` (int): The question id.
@ -158,15 +155,11 @@ A record has the following field:
### Prompts ### Prompts
The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts. The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
### Reviewer ### Reviewer
The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers. The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
## Plan
- [ ] Extend the questions
## Citations ## Citations

View File

@ -1,31 +0,0 @@
import argparse
import os
import json
import copy
from utils import jdump, get_json_list
def format_questions(args):
questions = get_json_list(args.questions_path)
keys=questions[0].keys()
formatted_questions=copy.deepcopy(questions)
for i in range(len(formatted_questions)):
formatted_questions[i]['instruction']=questions[i]['text']
formatted_questions[i]['input']=""
formatted_questions[i]['output']=""
formatted_questions[i]['id']=questions[i]['question_id']
for key in keys:
if key=="category":
continue
del formatted_questions[i][key]
jdump(formatted_questions, args.save_path)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--questions_path', type=str, default='table/question.jsonl')
parser.add_argument('--save_path', type=str, default="table/questions.json")
args = parser.parse_args()
format_questions(args)

View File

@ -1,3 +0,0 @@
python format_questions.py \
--questions_path "path to FastChat's question.jsonl" \
--save_path "path to the formatted file" \

View File

@ -0,0 +1,562 @@
[
{
"category": "generic",
"instruction": "How can I improve my time management skills?",
"input": "",
"output": "",
"id": 1
},
{
"category": "generic",
"instruction": "What are the most effective ways to deal with stress?",
"input": "",
"output": "",
"id": 2
},
{
"category": "generic",
"instruction": "What are the main differences between Python and JavaScript programming languages?",
"input": "",
"output": "",
"id": 3
},
{
"category": "generic",
"instruction": "How can I increase my productivity while working from home?",
"input": "",
"output": "",
"id": 4
},
{
"category": "generic",
"instruction": "Can you explain the basics of quantum computing?",
"input": "",
"output": "",
"id": 5
},
{
"category": "generic",
"instruction": "What are the differences between plant-based and animal-based protein sources?",
"input": "",
"output": "",
"id": 6
},
{
"category": "generic",
"instruction": "How can I develop my critical thinking skills?",
"input": "",
"output": "",
"id": 7
},
{
"category": "generic",
"instruction": "What are the major challenges faced by the education sector today?",
"input": "",
"output": "",
"id": 8
},
{
"category": "generic",
"instruction": "What are the primary factors that influence consumer behavior?",
"input": "",
"output": "",
"id": 9
},
{
"category": "generic",
"instruction": "What are the most effective strategies for conflict resolution in the workplace?",
"input": "",
"output": "",
"id": 10
},
{
"category": "knowledge",
"instruction": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?",
"input": "",
"output": "",
"id": 11
},
{
"category": "knowledge",
"instruction": "What factors would you consider when designing an inclusive and accessible public transportation system?",
"input": "",
"output": "",
"id": 12
},
{
"category": "knowledge",
"instruction": "How can governments utilize fiscal and monetary policies to combat economic recessions?",
"input": "",
"output": "",
"id": 13
},
{
"category": "knowledge",
"instruction": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?",
"input": "",
"output": "",
"id": 14
},
{
"category": "knowledge",
"instruction": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.",
"input": "",
"output": "",
"id": 15
},
{
"category": "knowledge",
"instruction": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications.",
"input": "",
"output": "",
"id": 16
},
{
"category": "knowledge",
"instruction": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?",
"input": "",
"output": "",
"id": 17
},
{
"category": "knowledge",
"instruction": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?",
"input": "",
"output": "",
"id": 18
},
{
"category": "knowledge",
"instruction": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?",
"input": "",
"output": "",
"id": 19
},
{
"category": "knowledge",
"instruction": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species.",
"input": "",
"output": "",
"id": 20
},
{
"category": "roleplay",
"instruction": "How would you introduce yourself as a medieval knight at a royal banquet?",
"input": "",
"output": "",
"id": 21
},
{
"category": "roleplay",
"instruction": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?",
"input": "",
"output": "",
"id": 22
},
{
"category": "roleplay",
"instruction": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?",
"input": "",
"output": "",
"id": 23
},
{
"category": "roleplay",
"instruction": "As a superhero, how would you explain your origin story to a curious child?",
"input": "",
"output": "",
"id": 24
},
{
"category": "roleplay",
"instruction": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?",
"input": "",
"output": "",
"id": 25
},
{
"category": "roleplay",
"instruction": "As a sports commentator, describe the winning play in the final seconds of a championship game.",
"input": "",
"output": "",
"id": 26
},
{
"category": "roleplay",
"instruction": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?",
"input": "",
"output": "",
"id": 27
},
{
"category": "roleplay",
"instruction": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top.",
"input": "",
"output": "",
"id": 28
},
{
"category": "roleplay",
"instruction": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.",
"input": "",
"output": "",
"id": 29
},
{
"category": "roleplay",
"instruction": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.",
"input": "",
"output": "",
"id": 30
},
{
"category": "common-sense",
"instruction": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?",
"input": "",
"output": "",
"id": 31
},
{
"category": "common-sense",
"instruction": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?",
"input": "",
"output": "",
"id": 32
},
{
"category": "common-sense",
"instruction": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?",
"input": "",
"output": "",
"id": 33
},
{
"category": "common-sense",
"instruction": "How can you determine if a person is genuinely interested in a conversation or simply being polite?",
"input": "",
"output": "",
"id": 34
},
{
"category": "common-sense",
"instruction": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?",
"input": "",
"output": "",
"id": 35
},
{
"category": "common-sense",
"instruction": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?",
"input": "",
"output": "",
"id": 36
},
{
"category": "common-sense",
"instruction": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?",
"input": "",
"output": "",
"id": 37
},
{
"category": "common-sense",
"instruction": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?",
"input": "",
"output": "",
"id": 38
},
{
"category": "common-sense",
"instruction": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?",
"input": "",
"output": "",
"id": 39
},
{
"category": "common-sense",
"instruction": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?",
"input": "",
"output": "",
"id": 40
},
{
"category": "fermi",
"instruction": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 41
},
{
"category": "fermi",
"instruction": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 42
},
{
"category": "fermi",
"instruction": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 43
},
{
"category": "fermi",
"instruction": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 44
},
{
"category": "fermi",
"instruction": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 45
},
{
"category": "fermi",
"instruction": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 46
},
{
"category": "fermi",
"instruction": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 47
},
{
"category": "fermi",
"instruction": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 48
},
{
"category": "fermi",
"instruction": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 49
},
{
"category": "fermi",
"instruction": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
"input": "",
"output": "",
"id": 50
},
{
"category": "counterfactual",
"instruction": "What if the Internet had been invented during the Renaissance period?",
"input": "",
"output": "",
"id": 51
},
{
"category": "counterfactual",
"instruction": "What if the Aztecs had successfully repelled the Spanish conquistadors?",
"input": "",
"output": "",
"id": 52
},
{
"category": "counterfactual",
"instruction": "What if the Black Death had not occurred in the 14th century?",
"input": "",
"output": "",
"id": 53
},
{
"category": "counterfactual",
"instruction": "What if Isaac Newton had focused on biology instead of physics?",
"input": "",
"output": "",
"id": 54
},
{
"category": "counterfactual",
"instruction": "What if the Beatles had never formed as a band?",
"input": "",
"output": "",
"id": 55
},
{
"category": "counterfactual",
"instruction": "What if Alan Turing had not cracked the Enigma code during World War II?",
"input": "",
"output": "",
"id": 56
},
{
"category": "counterfactual",
"instruction": "What if the Suez Canal had never been constructed?",
"input": "",
"output": "",
"id": 57
},
{
"category": "counterfactual",
"instruction": "What if the Maya civilization had never mysteriously collapsed?",
"input": "",
"output": "",
"id": 58
},
{
"category": "counterfactual",
"instruction": "What if Christopher Columbus had not discovered the Americas?",
"input": "",
"output": "",
"id": 59
},
{
"category": "counterfactual",
"instruction": "What if Vincent van Gogh had been a successful artist during his lifetime?",
"input": "",
"output": "",
"id": 60
},
{
"category": "coding",
"instruction": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file.",
"input": "",
"output": "",
"id": 61
},
{
"category": "coding",
"instruction": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.",
"input": "",
"output": "",
"id": 62
},
{
"category": "coding",
"instruction": "Implement a regular expression in Python to validate an email address.",
"input": "",
"output": "",
"id": 63
},
{
"category": "coding",
"instruction": "Write a program to find the nth Fibonacci number using dynamic programming.",
"input": "",
"output": "",
"id": 64
},
{
"category": "coding",
"instruction": "Implement a binary search algorithm to find a specific element in a sorted array.",
"input": "",
"output": "",
"id": 65
},
{
"category": "coding",
"instruction": "Implement a queue data structure using two stacks in Python.",
"input": "",
"output": "",
"id": 66
},
{
"category": "coding",
"instruction": "Implement a program to find the common elements in two arrays without using any extra data structures.",
"input": "",
"output": "",
"id": 67
},
{
"category": "math",
"instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).",
"input": "",
"output": "",
"id": 68
},
{
"category": "math",
"instruction": "Solve for x in the equation 3x + 10 = 5(x - 2).",
"input": "",
"output": "",
"id": 69
},
{
"category": "math",
"instruction": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?",
"input": "",
"output": "",
"id": 70
},
{
"category": "writing",
"instruction": "Can you help me write a formal email to a potential business partner proposing a joint venture?",
"input": "",
"output": "",
"id": 71
},
{
"category": "writing",
"instruction": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?",
"input": "",
"output": "",
"id": 72
},
{
"category": "writing",
"instruction": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science.",
"input": "",
"output": "",
"id": 73
},
{
"category": "writing",
"instruction": "Write a compelling product launch announcement email to inform our customers of our new software solution.",
"input": "",
"output": "",
"id": 74
},
{
"category": "writing",
"instruction": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved.",
"input": "",
"output": "",
"id": 75
},
{
"category": "writing",
"instruction": "Write a script for a YouTube video exploring the history and cultural significance of jazz.",
"input": "",
"output": "",
"id": 76
},
{
"category": "writing",
"instruction": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",
"input": "",
"output": "",
"id": 77
},
{
"category": "writing",
"instruction": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects.",
"input": "",
"output": "",
"id": 78
},
{
"category": "writing",
"instruction": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry.",
"input": "",
"output": "",
"id": 79
},
{
"category": "writing",
"instruction": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience.",
"input": "",
"output": "",
"id": 80
}
]