ColossalAI

mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-08-19 08:27:23 +00:00

Author	SHA1	Message	Date
Frank Lee	8bcad73677	[workflow] fixed the directory check in build (#3980 )	2023-06-13 14:42:35 +08:00
Frank Lee	6718a2f285	[workflow] cancel duplicated workflow jobs (#3960 )	2023-06-12 15:11:27 +08:00
digger yu	1aadeedeea	fix typo .github/workflows/scripts/ (#3946 )	2023-06-09 10:30:50 +08:00
Frank Lee	5e2132dcff	[workflow] added docker latest tag for release (#3920 )	2023-06-07 15:37:37 +08:00
Hongxin Liu	c25d421f3e	[devops] hotfix testmon cache clean logic (#3917 )	2023-06-07 12:39:12 +08:00
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	2023-06-07 10:41:16 +08:00
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	2023-06-06 18:58:58 +08:00
Hongxin Liu	ec9bbc0094	[devops] improving testmon cache (#3902 ) * [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme	2023-06-06 11:32:31 +08:00
Frank Lee	ae959a72a5	[workflow] fixed workflow check for docker build (#3849 )	2023-05-25 16:42:34 +08:00
Frank Lee	54e97ed7ea	[workflow] supported test on CUDA 10.2 (#3841 )	2023-05-25 14:14:34 +08:00
Frank Lee	84500b7799	[workflow] fixed testmon cache in build CI (#3806 ) * [workflow] fixed testmon cache in build CI * polish code	2023-05-24 14:59:40 +08:00
Frank Lee	05b8a8de58	[workflow] changed to doc build to be on schedule and release (#3825 ) * [workflow] changed to doc build to be on schedule and release * polish code	2023-05-24 10:50:19 +08:00
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2023-05-24 09:01:50 +08:00
Frank Lee	1e3b64f26c	[workflow] enblaed doc build from a forked repo (#3815 )	2023-05-23 17:49:53 +08:00
Frank Lee	ad93c736ea	[workflow] enable testing for develop & feature branch (#3801 )	2023-05-23 11:21:15 +08:00
Frank Lee	788e07dbc5	[workflow] fixed the docker build workflow (#3794 ) * [workflow] fixed the docker build workflow * polish code	2023-05-22 16:30:32 +08:00
liuzeming	4d29c0f8e0	Fix/docker action (#3266 ) * [docker] Add ARG VERSION to determine the Tag * [workflow] fixed the version in the release docker workflow --------- Co-authored-by: liuzeming <liuzeming@4paradigm.com>	2023-05-22 15:04:00 +08:00
Hongxin Liu	b4788d63ed	[devops] fix doc test on pr (#3782 )	2023-05-19 16:28:57 +08:00
Hongxin Liu	5dd573c6b6	[devops] fix ci for document check (#3751 ) * [doc] add test info * [devops] update doc check ci * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] remove debug info and update invalid doc * [devops] add essential comments	2023-05-17 11:24:22 +08:00
Hongxin Liu	c03bd7c6b2	[devops] make build on PR run automatically (#3748 ) * [devops] make build on PR run automatically * [devops] update build on pr condition	2023-05-17 11:17:37 +08:00
Hongxin Liu	afb239bbf8	[devops] update torch version of CI (#3725 ) * [test] fix flop tensor test * [test] fix autochunk test * [test] fix lazyinit test * [devops] update torch version of CI * [devops] enable testmon * [devops] fix ci * [devops] fix ci * [test] fix checkpoint io test * [test] fix cluster test * [test] fix timm test * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] force sync to test ci * [test] skip fsdp test	2023-05-15 17:20:56 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00
Hongxin Liu	179558a87a	[devops] fix chat ci (#3628 )	2023-04-24 10:55:14 +08:00
digger-yu	633bac2f58	[doc] .github/workflows/README.md (#3605 ) Fixed several word spelling errors change "compatiblity" to "compatibility" etc.	2023-04-20 10:36:28 +08:00
Camille Zhong	36a519b49f	Update test_ci.sh update Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh update Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml update ci Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update test_ci.sh update test ci RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. update roberta with coati chat ci update Revert "chat ci update" This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846. [test]chat_update_ci Update test_ci.sh Update test_ci.sh test Update gpt_critic.py Update gpt_critic.py Update run_chatgpt_unit_tests.yml update test ci update update update update Update test_ci.sh update Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml	2023-04-18 14:33:12 +08:00
digger-yu	6e7e43c6fe	[doc] Update .github/workflows/README.md (#3577 ) Optimization Code I think there were two extra $ entered here, which have been deleted	2023-04-17 16:27:38 +08:00
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2023-04-06 14:51:35 +08:00
Hakjin Lee	1653063fce	[CI] Fix pre-commit workflow (#3238 )	2023-03-27 09:41:08 +08:00
Frank Lee	169ed4d24e	[workflow] purged extension cache before GPT test (#3128 )	2023-03-14 10:11:32 +08:00
Frank Lee	91ccf97514	[workflow] fixed doc build trigger condition (#3072 )	2023-03-09 17:31:41 +08:00
Frank Lee	8fedc8766a	[workflow] supported conda package installation in doc test (#3028 ) * [workflow] supported conda package installation in doc test * polish code * polish code * polish code * polish code * polish code * polish code	2023-03-07 14:21:26 +08:00
Frank Lee	2cd6ba3098	[workflow] fixed the post-commit failure when no formatting needed (#3020 ) * [workflow] fixed the post-commit failure when no formatting needed * polish code * polish code * polish code	2023-03-07 13:35:45 +08:00
Frank Lee	77b88a3849	[workflow] added auto doc test on PR (#2929 ) * [workflow] added auto doc test on PR * [workflow] added doc test workflow * polish code * polish code * polish code * polish code * polish code * polish code * polish code	2023-02-28 11:10:38 +08:00
Frank Lee	e33c043dec	[workflow] moved pre-commit to post-commit (#2895 )	2023-02-24 14:41:33 +08:00
LuGY	dbd0fd1522	[CI/CD] fix nightly release CD running on forked repo (#2812 ) * [CI/CD] fix nightly release CD running on forker repo * fix misunderstanding of dispatch * remove some build condition, enable notify even when release failed	2023-02-18 13:27:13 +08:00
ver217	9c0943ecdb	[chatgpt] optimize generation kwargs (#2717 ) * [chatgpt] ppo trainer use default generate args * [chatgpt] example remove generation preparing fn * [chatgpt] benchmark remove generation preparing fn * [chatgpt] fix ci	2023-02-15 13:59:58 +08:00
Frank Lee	2045d45ab7	[doc] updated documentation version list (#2715 )	2023-02-15 11:24:18 +08:00
ver217	f6b4ca4e6c	[devops] add chatgpt ci (#2713 )	2023-02-15 10:53:54 +08:00
Frank Lee	89f8975fb8	[workflow] fixed tensor-nvme build caching (#2711 )	2023-02-15 10:12:55 +08:00
Frank Lee	5cd8cae0c9	[workflow] fixed communtity report ranking (#2680 )	2023-02-13 17:04:49 +08:00
Frank Lee	c44fd0c867	[workflow] added trigger to build doc upon release (#2678 )	2023-02-13 16:53:26 +08:00
Frank Lee	327bc06278	[workflow] added doc build test (#2675 ) * [workflow] added doc build test * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code	2023-02-13 15:55:57 +08:00
Frank Lee	94f87f9651	[workflow] fixed gpu memory check condition (#2659 )	2023-02-10 09:59:07 +08:00
Frank Lee	85b2303b55	[doc] migrate the markdown files (#2652 )	2023-02-09 14:21:38 +08:00
Frank Lee	8518263b80	[test] fixed the triton version for testing (#2608 )	2023-02-07 13:49:38 +08:00
Frank Lee	aa7e9e4794	[workflow] fixed the test coverage report (#2614 ) * [workflow] fixed the test coverage report * polish code	2023-02-07 11:50:53 +08:00
Frank Lee	b3973b995a	[workflow] fixed test coverage report (#2611 )	2023-02-07 11:02:56 +08:00
Frank Lee	f566b0ce6b	[workflow] fixed broken rellease workflows (#2604 )	2023-02-06 21:40:19 +08:00
Frank Lee	f7458d3ec7	[release] v0.2.1 (#2602 ) * [release] v0.2.1 * polish code	2023-02-06 20:46:18 +08:00
Frank Lee	719c4d5553	[doc] updated readme for CI/CD (#2600 )	2023-02-06 17:42:15 +08:00

1 2 3 4

168 Commits