Commit Graph

168 Commits

Author SHA1 Message Date
Frank Lee
8bcad73677
[workflow] fixed the directory check in build (#3980) 2023-06-13 14:42:35 +08:00
Frank Lee
6718a2f285 [workflow] cancel duplicated workflow jobs (#3960) 2023-06-12 15:11:27 +08:00
digger yu
1aadeedeea
fix typo .github/workflows/scripts/ (#3946) 2023-06-09 10:30:50 +08:00
Frank Lee
5e2132dcff
[workflow] added docker latest tag for release (#3920) 2023-06-07 15:37:37 +08:00
Hongxin Liu
c25d421f3e
[devops] hotfix testmon cache clean logic (#3917) 2023-06-07 12:39:12 +08:00
Hongxin Liu
b5f0566363
[chat] add distributed PPO trainer (#3740)
* Detached ppo (#9)

* run the base

* working on dist ppo

* sync

* detached trainer

* update detached trainer. no maker update function

* facing init problem

* 1 maker 1 trainer detached run. but no model update

* facing cuda problem

* fix save functions

* verified maker update

* nothing

* add ignore

* analyize loss issue

* remove some debug codes

* facing 2m1t stuck issue

* 2m1t verified

* do not use torchrun

* working on 2m2t

* working on 2m2t

* initialize strategy in ray actor env

* facing actor's init order issue

* facing ddp model update issue (need unwarp ddp)

* unwrap ddp actor

* checking 1m2t stuck problem

* nothing

* set timeout for trainer choosing. It solves the stuck problem!

* delete some debug output

* rename to sync with upstream

* rename to sync with upstream

* coati rename

* nothing

* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations

* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.

* move code to ray subfolder

* working on pipeline inference

* apply comments

* working on pipeline strategy. in progress.

* remove pipeline code. clean this branch

* update remote parameters by state_dict. no test

* nothing

* state_dict sharding transfer

* merge debug branch

* gemini _unwrap_model fix

* simplify code

* simplify code & fix LoRALinear AttributeError

* critic unwrapped state_dict

---------

Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add perfomance evaluator and fix bugs (#10)

* [chat] add performance evaluator for ray

* [chat] refactor debug arg

* [chat] support hf config

* [chat] fix generation

* [chat] add 1mmt dummy example

* [chat] fix gemini ckpt

* split experience to send (#11)

Co-authored-by: csric <richcsr256@gmail.com>

* [chat] refactor trainer and maker (#12)

* [chat] refactor experience maker holder

* [chat] refactor model init

* [chat] refactor trainer args

* [chat] refactor model init

* [chat] refactor trainer

* [chat] refactor experience sending logic and training loop args (#13)

* [chat] refactor experience send logic

* [chat] refactor trainer

* [chat] refactor trainer

* [chat] refactor experience maker

* [chat] refactor pbar

* [chat] refactor example folder (#14)

* [chat] support quant (#15)

* [chat] add quant

* [chat] add quant example

* prompt example (#16)

* prompt example

* prompt load csv data

* remove legacy try

---------

Co-authored-by: csric <richcsr256@gmail.com>

* [chat] add mmmt dummy example and refactor experience sending (#17)

* [chat] add mmmt dummy example

* [chat] refactor naive strategy

* [chat] fix struck problem

* [chat] fix naive strategy

* [chat] optimize experience maker sending logic

* [chat] refactor sending assignment

* [chat] refactor performance evaluator (#18)

* Prompt Example & requires_grad state_dict & sharding state_dict (#19)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

---------

Co-authored-by: csric <richcsr256@gmail.com>

* state_dict sending adapts to new unwrap function (#20)

* prompt example

* prompt load csv data

* remove legacy try

* maker models require_grad set to False

* working on zero redundancy update

* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.

* remove legacy examples

* remove legacy examples

* remove replay buffer tp state. bad design

* opt benchmark

* better script

* nothing

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test

* working on lora reconstruction

* state_dict sending adapts to new unwrap function

* remove comments

---------

Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* [chat-ray] add readme (#21)

* add readme

* transparent graph

* add note background

---------

Co-authored-by: csric <richcsr256@gmail.com>

* [chat] get images from url (#22)

* Refactor/chat ray (#23)

* [chat] lora add todo

* [chat] remove unused pipeline strategy

* [chat] refactor example structure

* [chat] setup ci for ray

* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24)

* lora support prototype

* lora support

* 1mmt lora & remove useless code

---------

Co-authored-by: csric <richcsr256@gmail.com>

* [chat] fix test ci for ray

* [chat] fix test ci requirements for ray

* [chat] fix ray runtime env

* [chat] fix ray runtime env

* [chat] fix example ci docker args

* [chat] add debug info in trainer

* [chat] add nccl debug info

* [chat] skip ray test

* [doc] fix typo

---------

Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>
2023-06-07 10:41:16 +08:00
Hongxin Liu
41fb7236aa
[devops] hotfix CI about testmon cache (#3910)
* [devops] hotfix CI about testmon cache

* [devops] fix testmon cahe on pr
2023-06-06 18:58:58 +08:00
Hongxin Liu
ec9bbc0094
[devops] improving testmon cache (#3902)
* [devops] improving testmon cache

* [devops] fix branch name with slash

* [devops] fix branch name with slash

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] fix edit action

* [devops] update readme
2023-06-06 11:32:31 +08:00
Frank Lee
ae959a72a5
[workflow] fixed workflow check for docker build (#3849) 2023-05-25 16:42:34 +08:00
Frank Lee
54e97ed7ea
[workflow] supported test on CUDA 10.2 (#3841) 2023-05-25 14:14:34 +08:00
Frank Lee
84500b7799
[workflow] fixed testmon cache in build CI (#3806)
* [workflow] fixed testmon cache in build CI

* polish code
2023-05-24 14:59:40 +08:00
Frank Lee
05b8a8de58
[workflow] changed to doc build to be on schedule and release (#3825)
* [workflow] changed to doc build to be on schedule and release

* polish code
2023-05-24 10:50:19 +08:00
digger yu
7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) 2023-05-24 09:01:50 +08:00
Frank Lee
1e3b64f26c
[workflow] enblaed doc build from a forked repo (#3815) 2023-05-23 17:49:53 +08:00
Frank Lee
ad93c736ea
[workflow] enable testing for develop & feature branch (#3801) 2023-05-23 11:21:15 +08:00
Frank Lee
788e07dbc5
[workflow] fixed the docker build workflow (#3794)
* [workflow] fixed the docker build workflow

* polish code
2023-05-22 16:30:32 +08:00
liuzeming
4d29c0f8e0
Fix/docker action (#3266)
* [docker] Add ARG VERSION to determine the Tag

* [workflow] fixed the version in the release docker workflow

---------

Co-authored-by: liuzeming <liuzeming@4paradigm.com>
2023-05-22 15:04:00 +08:00
Hongxin Liu
b4788d63ed
[devops] fix doc test on pr (#3782) 2023-05-19 16:28:57 +08:00
Hongxin Liu
5dd573c6b6
[devops] fix ci for document check (#3751)
* [doc] add test info

* [devops] update doc check ci

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] remove debug info and update invalid doc

* [devops] add essential comments
2023-05-17 11:24:22 +08:00
Hongxin Liu
c03bd7c6b2
[devops] make build on PR run automatically (#3748)
* [devops] make build on PR run automatically

* [devops] update build on pr condition
2023-05-17 11:17:37 +08:00
Hongxin Liu
afb239bbf8
[devops] update torch version of CI (#3725)
* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test
2023-05-15 17:20:56 +08:00
Hongxin Liu
50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu
179558a87a
[devops] fix chat ci (#3628) 2023-04-24 10:55:14 +08:00
digger-yu
633bac2f58
[doc] .github/workflows/README.md (#3605)
Fixed several word spelling errors
change "compatiblity" to "compatibility" etc.
2023-04-20 10:36:28 +08:00
Camille Zhong
36a519b49f Update test_ci.sh
update

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

update

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

update ci

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update test_ci.sh

Update test_ci.sh

Update test_ci.sh

update test ci

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

[test]chat_update_ci

Update test_ci.sh

Update test_ci.sh

test

Update gpt_critic.py

Update gpt_critic.py

Update run_chatgpt_unit_tests.yml

update test ci

update

update

update

update

Update test_ci.sh

update

Update test_ci.sh

Update test_ci.sh

Update run_chatgpt_examples.yml

Update run_chatgpt_examples.yml
2023-04-18 14:33:12 +08:00
digger-yu
6e7e43c6fe
[doc] Update .github/workflows/README.md (#3577)
Optimization Code
I think there were two extra $ entered here, which have been deleted
2023-04-17 16:27:38 +08:00
Frank Lee
80eba05b0a
[test] refactor tests with spawn (#3452)
* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-04-06 14:51:35 +08:00
Hakjin Lee
1653063fce
[CI] Fix pre-commit workflow (#3238) 2023-03-27 09:41:08 +08:00
Frank Lee
169ed4d24e
[workflow] purged extension cache before GPT test (#3128) 2023-03-14 10:11:32 +08:00
Frank Lee
91ccf97514
[workflow] fixed doc build trigger condition (#3072) 2023-03-09 17:31:41 +08:00
Frank Lee
8fedc8766a
[workflow] supported conda package installation in doc test (#3028)
* [workflow] supported conda package installation in doc test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-03-07 14:21:26 +08:00
Frank Lee
2cd6ba3098
[workflow] fixed the post-commit failure when no formatting needed (#3020)
* [workflow] fixed the post-commit failure when no formatting needed

* polish code

* polish code

* polish code
2023-03-07 13:35:45 +08:00
Frank Lee
77b88a3849
[workflow] added auto doc test on PR (#2929)
* [workflow] added auto doc test on PR

* [workflow] added doc test workflow

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-02-28 11:10:38 +08:00
Frank Lee
e33c043dec
[workflow] moved pre-commit to post-commit (#2895) 2023-02-24 14:41:33 +08:00
LuGY
dbd0fd1522
[CI/CD] fix nightly release CD running on forked repo (#2812)
* [CI/CD] fix nightly release CD running on forker repo

* fix misunderstanding of dispatch

* remove some build condition, enable notify even when release failed
2023-02-18 13:27:13 +08:00
ver217
9c0943ecdb
[chatgpt] optimize generation kwargs (#2717)
* [chatgpt] ppo trainer use default generate args

* [chatgpt] example remove generation preparing fn

* [chatgpt] benchmark remove generation preparing fn

* [chatgpt] fix ci
2023-02-15 13:59:58 +08:00
Frank Lee
2045d45ab7
[doc] updated documentation version list (#2715) 2023-02-15 11:24:18 +08:00
ver217
f6b4ca4e6c
[devops] add chatgpt ci (#2713) 2023-02-15 10:53:54 +08:00
Frank Lee
89f8975fb8
[workflow] fixed tensor-nvme build caching (#2711) 2023-02-15 10:12:55 +08:00
Frank Lee
5cd8cae0c9
[workflow] fixed communtity report ranking (#2680) 2023-02-13 17:04:49 +08:00
Frank Lee
c44fd0c867
[workflow] added trigger to build doc upon release (#2678) 2023-02-13 16:53:26 +08:00
Frank Lee
327bc06278
[workflow] added doc build test (#2675)
* [workflow] added doc build test

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-02-13 15:55:57 +08:00
Frank Lee
94f87f9651
[workflow] fixed gpu memory check condition (#2659) 2023-02-10 09:59:07 +08:00
Frank Lee
85b2303b55
[doc] migrate the markdown files (#2652) 2023-02-09 14:21:38 +08:00
Frank Lee
8518263b80
[test] fixed the triton version for testing (#2608) 2023-02-07 13:49:38 +08:00
Frank Lee
aa7e9e4794
[workflow] fixed the test coverage report (#2614)
* [workflow] fixed the test coverage report

* polish code
2023-02-07 11:50:53 +08:00
Frank Lee
b3973b995a
[workflow] fixed test coverage report (#2611) 2023-02-07 11:02:56 +08:00
Frank Lee
f566b0ce6b
[workflow] fixed broken rellease workflows (#2604) 2023-02-06 21:40:19 +08:00
Frank Lee
f7458d3ec7
[release] v0.2.1 (#2602)
* [release] v0.2.1

* polish code
2023-02-06 20:46:18 +08:00
Frank Lee
719c4d5553
[doc] updated readme for CI/CD (#2600) 2023-02-06 17:42:15 +08:00