mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-08-13 13:45:51 +00:00
* run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments --------- Co-authored-by: csric <richcsr256@gmail.com>
24 lines
825 B
Bash
24 lines
825 B
Bash
set_n_least_used_CUDA_VISIBLE_DEVICES() {
|
|
local n=${1:-"9999"}
|
|
echo "GPU Memory Usage:"
|
|
local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv \
|
|
| tail -n +2 \
|
|
| nl -v 0 \
|
|
| tee /dev/tty \
|
|
| sort -g -k 2 \
|
|
| awk '{print $1}' \
|
|
| head -n $n)
|
|
export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
|
|
echo "Now CUDA_VISIBLE_DEVICES is set to:"
|
|
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
|
|
}
|
|
|
|
set_n_least_used_CUDA_VISIBLE_DEVICES 2
|
|
|
|
export RAY_NAMESPACE="admin"
|
|
|
|
python 1m1t.py "/path/to/prompts.csv" \
|
|
--trainer_strategy colossalai_zero2 --maker_strategy naive --lora_rank 2 --pretrain "facebook/opt-350m" --model 'opt' \
|
|
--num_episodes 10 --max_timesteps 10 --update_timesteps 10 \
|
|
--max_epochs 10 --debug
|