ColossalAI

mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-08-13 21:55:46 +00:00

Author	SHA1	Message	Date
flybird11111	9d83c6d715	[lazy] fix lazy cls init (#5720 ) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix	2024-05-17 18:18:59 +08:00
Yuanheng Zhao	5bbab1533a	[ci] Fix example tests (#5714 ) * [fix] revise timeout value on example CI * trivial	2024-05-14 16:08:51 +08:00
Yuanheng Zhao	55cc7f3df7	[Fix] Fix Inference Example, Tests, and Requirements (#5688 ) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe	2024-05-08 11:30:15 +08:00
Edenzzzz	58954b2986	[misc] Add an existing issue checkbox in bug report (#5691 ) Co-authored-by: Wenxuan(Eden) Tan <wtan45@wisc.edu>	2024-05-07 12:18:50 +08:00
Hongxin Liu	7f8b16635b	[misc] refactor launch API and tensor constructor (#5666 ) * [misc] remove config arg from initialize * [misc] remove old tensor contrusctor * [plugin] add npu support for ddp * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [devops] fix doc test ci * [test] fix test launch * [doc] update launch doc --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-29 10:40:11 +08:00
Hongxin Liu	c1594e4bad	[devops] fix release docker ci (#5665 )	2024-04-27 19:11:57 +08:00
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-08 15:09:40 +08:00
YeAnbang	df5e9c53cf	[ColossalChat] Update RLHF V2 (#5286 ) * Add dpo. Fix sft, ppo, lora. Refactor all * fix and tested ppo * 2 nd round refactor * add ci tests * fix ci * fix ci * fix readme, style * fix readme style * fix style, fix benchmark * reproduce benchmark result, remove useless files * rename to ColossalChat * use new image * fix ci workflow * fix ci * use local model/tokenizer for ci tests * fix ci * fix ci * fix ci * fix ci timeout * fix rm progress bar. fix ci timeout * fix ci * fix ci typo * remove 3d plugin from ci temporary * test environment * cannot save optimizer * support chat template * fix readme * fix path * test ci locally * restore build_or_pr * fix ci data path * fix benchmark * fix ci, move ci tests to 3080, disable fast tokenizer * move ci to 85 * support flash attention 2 * add all-in-one data preparation script. Fix colossal-llama2-chat chat template * add hardware requirements * move ci test data * fix save_model, add unwrap * fix missing bos * fix missing bos; support grad accumulation with gemini * fix ci * fix ci * fix ci * fix llama2 chat template config * debug sft * debug sft * fix colossalai version requirement * fix ci * add sanity check to prevent NaN loss * fix requirements * add dummy data generation script * add dummy data generation script * add dummy data generation script * add dummy data generation script * update readme * update readme * update readme and ignore * fix logger bug * support parallel_output * modify data preparation logic * fix tokenization * update lr * fix inference * run pre-commit --------- Co-authored-by: Tong Li <tong.li352711588@gmail.com>	2024-03-29 14:12:29 +08:00
Hongxin Liu	19e1a5cf16	[shardformer] update colo attention to support custom mask (#5510 ) * [feature] refactor colo attention (#5462) * [extension] update api * [feature] add colo attention * [feature] update sdpa * [feature] update npu attention * [feature] update flash-attn * [test] add flash attn test * [test] update flash attn test * [shardformer] update modeling to fit colo attention (#5465) * [misc] refactor folder structure * [shardformer] update llama flash-attn * [shardformer] fix llama policy * [devops] update tensornvme install * [test] update llama test * [shardformer] update colo attn kernel dispatch * [shardformer] update blip2 * [shardformer] update chatglm * [shardformer] update gpt2 * [shardformer] update gptj * [shardformer] update opt * [shardformer] update vit * [shardformer] update colo attention mask prep * [shardformer] update whisper * [test] fix shardformer tests (#5514) * [test] fix shardformer tests * [test] fix shardformer tests	2024-03-27 11:19:32 +08:00
Hongxin Liu	a7790a92e8	[devops] fix example test ci (#5504 )	2024-03-26 15:09:05 +08:00
Hongxin Liu	f2e8b9ef9f	[devops] fix compatibility (#5444 ) * [devops] fix compatibility * [hotfix] update compatibility test on pr * [devops] fix compatibility * [devops] record duration during comp test * [test] decrease test duration * fix falcon	2024-03-13 15:24:13 +08:00
Hongxin Liu	070df689e6	[devops] fix extention building (#5427 )	2024-03-05 15:35:54 +08:00
flybird11111	29695cf70c	[example]add gpt2 benchmark example script. (#5295 ) * benchmark gpt2 * fix fix fix fix * [doc] fix typo in Colossal-LLaMA-2/README.md (#5247) * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed ddp test (#5254) * [ci] fixed ddp test * polish * fix typo in applications/ColossalEval/README.md (#5250) * [ci] fix shardformer tests. (#5255) * fix ci fix * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [doc] fix doc typo (#5256) * [doc] fix annotation display * [doc] fix llama2 doc * [hotfix]: add pp sanity check and fix mbs arg (#5268) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check * [workflow] fixed incomplete bash command (#5272) * [workflow] fixed oom tests (#5275) * [workflow] fixed oom tests * polish * polish * polish * [ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) * fix ci fix * fix test * revert: revert p2p * feat: add enable_metadata_cache option * revert: enable t5 tests * fix --------- Co-authored-by: Wenhao Chen <cwher@outlook.com> * [shardformer] hybridparallelplugin support gradients accumulation. (#5246) * support gradients acc fix fix fix fix fix fix fix fix fix fix fix fix fix * fix fix * fix fix fix * [hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) * fix auto loading gpt2 tokenizer (#5279) * [doc] add llama2-13B disyplay (#5285) * Update README.md * fix 13b typo --------- Co-authored-by: binmakeswell <binmakeswell@gmail.com> * fix llama pretrain (#5287) * fix * fix * fix fix * fix fix fix * fix fix * benchmark gpt2 * fix fix fix fix * [workflow] fixed build CI (#5240) * [workflow] fixed build CI * polish * polish * polish * polish * polish * [ci] fixed booster test (#5251) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test * fix fix * fix fix fix * fix * fix fix fix fix fix * fix * Update shardformer.py --------- Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Wenhao Chen <cwher@outlook.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by: Michelle <97082656+MichelleMa8@users.noreply.github.com> Co-authored-by: Desperado-Jia <502205863@qq.com>	2024-03-04 16:18:13 +08:00
Frank Lee	2461f37886	[workflow] added pypi channel (#5412 )	2024-02-29 13:56:55 +08:00
Frank Lee	dcdd8a5ef7	[setup] fixed nightly release (#5388 )	2024-02-27 15:19:13 +08:00
Frank Lee	73f4dc578e	[workflow] updated CI image (#5318 )	2024-01-29 11:53:07 +08:00
Frank Lee	7cfed5f076	[feat] refactored extension module (#5298 ) * [feat] refactored extension module * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish	2024-01-25 17:01:48 +08:00
ver217	148469348a	Merge branch 'main' into sync/npu	2024-01-18 12:05:21 +08:00
Frank Lee	d69cd2eb89	[workflow] fixed oom tests (#5275 ) * [workflow] fixed oom tests * polish * polish * polish	2024-01-16 18:55:13 +08:00
Frank Lee	04244aaaf1	[workflow] fixed incomplete bash command (#5272 )	2024-01-16 11:54:44 +08:00
Frank Lee	d5eeeb1416	[ci] fixed booster test (#5251 ) * [ci] fixed booster test * [ci] fixed booster test * [ci] fixed booster test	2024-01-11 16:04:45 +08:00
Frank Lee	edf94a35c3	[workflow] fixed build CI (#5240 ) * [workflow] fixed build CI * polish * polish * polish * polish * polish	2024-01-10 22:34:16 +08:00
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	2024-01-09 10:20:05 +08:00
Hongxin Liu	7f3400b560	[devops] update torch versoin in ci (#5217 )	2024-01-03 11:46:33 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00
YeAnbang	e53e729d8e	[Feature] Add document retrieval QA (#5020 ) * add langchain * add langchain * Add files via upload * add langchain * fix style * fix style: remove extra space * add pytest; modified retriever * add pytest; modified retriever * add tests to build_on_pr.yml * fix build_on_pr.yml * fix build on pr; fix environ vars * seperate unit tests for colossalqa from build from pr * fix container setting; fix environ vars * commented dev code * add incremental update * remove stale code * fix style * change to sha3 224 * fix retriever; fix style; add unit test for document loader * fix ci workflow config * fix ci workflow config * add set cuda visible device script in ci * fix doc string * fix style; update readme; refactored * add force log info * change build on pr, ignore colossalqa * fix docstring, captitalize all initial letters * fix indexing; fix text-splitter * remove debug code, update reference * reset previous commit * update LICENSE update README add key-value mode, fix bugs * add files back * revert force push * remove junk file * add test files * fix retriever bug, add intent classification * change conversation chain design * rewrite prompt and conversation chain * add ui v1 * ui v1 * fix atavar * add header * Refactor the RAG Code and support Pangu * Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo. * resolved conversation. tested scripts under examples. web demo still buggy * fix ci tests * Some modifications to add ChatGPT api * modify llm.py and remove unnecessary files * Delete applications/ColossalQA/examples/ui/test_frontend_input.json * Remove OpenAI api key * add colossalqa * move files * move files * move files * move files * fix style * Add Readme and fix some bugs. * Add something to readme and modify some code * modify a directory name for clarity * remove redundant directory * Correct a type in llm.py * fix AI prefix * fix test_memory.py * fix conversation * fix some erros and typos * Fix a missing import in RAG_ChatBot.py * add colossalcloud LLM wrapper, correct issues in code review --------- Co-authored-by: YeAnbang <anbangy2@outlook.com> Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu> Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>	2023-11-23 10:33:48 +08:00
Hongxin Liu	67f5331754	[misc] add code owners (#5024 )	2023-11-08 15:18:51 +08:00
Hongxin Liu	8993c8a817	[release] update version (#4995 ) * [release] update version * [hotfix] fix ci	2023-11-01 13:41:22 +08:00
binmakeswell	822051d888	[doc] update slack link (#4823 )	2023-09-27 17:37:39 +08:00
Hongxin Liu	4965c0dabd	[lazy] support from_pretrained (#4801 ) * [lazy] patch from pretrained * [lazy] fix from pretrained and add tests * [devops] update ci	2023-09-26 11:04:11 +08:00
Wenhao Chen	7b9b86441f	[chat]: update rm, add wandb and fix bugs (#4471 ) * feat: modify forward fn of critic and reward model * feat: modify calc_action_log_probs * to: add wandb in sft and rm trainer * feat: update train_sft * feat: update train_rm * style: modify type annotation and add warning * feat: pass tokenizer to ppo trainer * to: modify trainer base and maker base * feat: add wandb in ppo trainer * feat: pass tokenizer to generate * test: update generate fn tests * test: update train tests * fix: remove action_mask * feat: remove unused code * fix: fix wrong ignore_index * fix: fix mock tokenizer * chore: update requirements * revert: modify make_experience * fix: fix inference * fix: add padding side * style: modify _on_learn_batch_end * test: use mock tokenizer * fix: use bf16 to avoid overflow * fix: fix workflow * [chat] fix gemini strategy * [chat] fix * sync: update colossalai strategy * fix: fix args and model dtype * fix: fix checkpoint test * fix: fix requirements * fix: fix missing import and wrong arg * fix: temporarily skip gemini test in stage 3 * style: apply pre-commit * fix: temporarily skip gemini test in stage 1&2 --------- Co-authored-by: Mingyan Jiang <1829166702@qq.com>	2023-09-20 15:53:58 +08:00
Hongxin Liu	079bf3cb26	[misc] update pre-commit and run all files (#4752 ) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format	2023-09-19 14:20:26 +08:00
Hongxin Liu	b5f9e37c70	[legacy] clean up legacy code (#4743 ) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci	2023-09-18 16:31:06 +08:00
Hongxin Liu	536397cc95	[devops] fix concurrency group (#4667 )	2023-09-11 15:32:50 +08:00
Hongxin Liu	a686f9ddc8	[devops] fix concurrency group and compatibility test (#4665 ) * [devops] fix concurrency group * [devops] fix compatibility test * [devops] fix tensornvme install * [devops] fix tensornvme install * [devops] fix colossalai install	2023-09-08 13:49:40 +08:00
Hongxin Liu	a39a5c66fe	Merge branch 'main' into feature/shardformer	2023-09-04 23:43:13 +08:00
yingliu-hpc	aaeb520ce3	Merge pull request #4542 from hpcaitech/chatglm [coati] Add chatglm in coati	2023-09-04 16:09:45 +08:00
Baizhou Zhang	38ccb8b1a3	[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575 ) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs	2023-09-01 17:40:01 +08:00
Hongxin Liu	c7b60f7547	[devops] cancel previous runs in the PR (#4546 )	2023-08-30 23:07:21 +08:00
ver217	1c43bfd54e	[coati] update ci	2023-08-30 10:55:56 +08:00
Hongxin Liu	26e29d58f0	[devops] add large-scale distributed test marker (#4452 ) * [test] remove cpu marker * [test] remove gpu marker * [test] update pytest markers * [ci] update unit test ci	2023-08-16 18:56:52 +08:00
Wenhao Chen	da4f7b855f	[chat] fix bugs and add unit tests (#4213 ) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args	2023-08-02 10:17:36 +08:00
Hongxin Liu	806477121d	[release] update version (#4332 ) * [release] update version * [devops] hotfix cuda extension building * [devops] pytest ignore useless folders	2023-08-01 15:01:19 +08:00
Hongxin Liu	02192a632e	[ci] support testmon core pkg change detection (#4305 )	2023-07-21 18:36:35 +08:00
Frank Lee	cc3cbe9f6f	[workflow] show test duration (#4159 )	2023-07-04 18:11:46 +08:00
Wenhao Chen	3d8d5d0d58	[chat] use official transformers and fix some issues (#4117 ) * feat: remove on_learn_epoch fn as not used * revert: add _on_learn_epoch fn * feat: remove NaiveStrategy * test: update train_prompts tests * fix: remove prepare_llama_tokenizer_and_embedding * test: add lora arg * feat: remove roberta support in train_prompts due to runtime errs * feat: remove deberta & roberta in rm as not used * test: remove deberta and roberta tests * feat: remove deberta and roberta models as not used * fix: remove calls to roberta * fix: remove prepare_llama_tokenizer_and_embedding * chore: update transformers version * docs: update transformers version * fix: fix actor inference * fix: fix ci * feat: change llama pad token to unk * revert: revert ddp setup_distributed * fix: change llama pad token to unk * revert: undo unnecessary changes * fix: use pip to install transformers	2023-07-04 13:49:09 +08:00
Frank Lee	1ee947f617	[workflow] added status check for test coverage workflow (#4106 )	2023-06-28 14:33:43 +08:00
Frank Lee	b463651f3e	[workflow] cover all public repositories in weekly report (#4069 )	2023-06-22 14:41:25 +08:00
Hongxin Liu	4a81faa5f3	[devops] fix build on pr ci (#4043 ) * [devops] fix build on pr ci * [devops] fix build on pr ci	2023-06-19 17:12:56 +08:00
Frank Lee	8bcad73677	[workflow] fixed the directory check in build (#3980 )	2023-06-13 14:42:35 +08:00
Frank Lee	6718a2f285	[workflow] cancel duplicated workflow jobs (#3960 )	2023-06-12 15:11:27 +08:00
digger yu	1aadeedeea	fix typo .github/workflows/scripts/ (#3946 )	2023-06-09 10:30:50 +08:00
Frank Lee	5e2132dcff	[workflow] added docker latest tag for release (#3920 )	2023-06-07 15:37:37 +08:00
Hongxin Liu	c25d421f3e	[devops] hotfix testmon cache clean logic (#3917 )	2023-06-07 12:39:12 +08:00
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	2023-06-07 10:41:16 +08:00
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	2023-06-06 18:58:58 +08:00
Hongxin Liu	ec9bbc0094	[devops] improving testmon cache (#3902 ) * [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme	2023-06-06 11:32:31 +08:00
Frank Lee	ae959a72a5	[workflow] fixed workflow check for docker build (#3849 )	2023-05-25 16:42:34 +08:00
Frank Lee	54e97ed7ea	[workflow] supported test on CUDA 10.2 (#3841 )	2023-05-25 14:14:34 +08:00
Frank Lee	84500b7799	[workflow] fixed testmon cache in build CI (#3806 ) * [workflow] fixed testmon cache in build CI * polish code	2023-05-24 14:59:40 +08:00
Frank Lee	05b8a8de58	[workflow] changed to doc build to be on schedule and release (#3825 ) * [workflow] changed to doc build to be on schedule and release * polish code	2023-05-24 10:50:19 +08:00
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2023-05-24 09:01:50 +08:00
Frank Lee	1e3b64f26c	[workflow] enblaed doc build from a forked repo (#3815 )	2023-05-23 17:49:53 +08:00
Frank Lee	ad93c736ea	[workflow] enable testing for develop & feature branch (#3801 )	2023-05-23 11:21:15 +08:00
Frank Lee	788e07dbc5	[workflow] fixed the docker build workflow (#3794 ) * [workflow] fixed the docker build workflow * polish code	2023-05-22 16:30:32 +08:00
liuzeming	4d29c0f8e0	Fix/docker action (#3266 ) * [docker] Add ARG VERSION to determine the Tag * [workflow] fixed the version in the release docker workflow --------- Co-authored-by: liuzeming <liuzeming@4paradigm.com>	2023-05-22 15:04:00 +08:00
Hongxin Liu	b4788d63ed	[devops] fix doc test on pr (#3782 )	2023-05-19 16:28:57 +08:00
Hongxin Liu	5dd573c6b6	[devops] fix ci for document check (#3751 ) * [doc] add test info * [devops] update doc check ci * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] remove debug info and update invalid doc * [devops] add essential comments	2023-05-17 11:24:22 +08:00
Hongxin Liu	c03bd7c6b2	[devops] make build on PR run automatically (#3748 ) * [devops] make build on PR run automatically * [devops] update build on pr condition	2023-05-17 11:17:37 +08:00
Hongxin Liu	afb239bbf8	[devops] update torch version of CI (#3725 ) * [test] fix flop tensor test * [test] fix autochunk test * [test] fix lazyinit test * [devops] update torch version of CI * [devops] enable testmon * [devops] fix ci * [devops] fix ci * [test] fix checkpoint io test * [test] fix cluster test * [test] fix timm test * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] force sync to test ci * [test] skip fsdp test	2023-05-15 17:20:56 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00
Hongxin Liu	179558a87a	[devops] fix chat ci (#3628 )	2023-04-24 10:55:14 +08:00
digger-yu	633bac2f58	[doc] .github/workflows/README.md (#3605 ) Fixed several word spelling errors change "compatiblity" to "compatibility" etc.	2023-04-20 10:36:28 +08:00
Camille Zhong	36a519b49f	Update test_ci.sh update Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh update Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml update ci Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update test_ci.sh Update test_ci.sh Update test_ci.sh update test ci RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. update roberta with coati chat ci update Revert "chat ci update" This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846. [test]chat_update_ci Update test_ci.sh Update test_ci.sh test Update gpt_critic.py Update gpt_critic.py Update run_chatgpt_unit_tests.yml update test ci update update update update Update test_ci.sh update Update test_ci.sh Update test_ci.sh Update run_chatgpt_examples.yml Update run_chatgpt_examples.yml	2023-04-18 14:33:12 +08:00
digger-yu	6e7e43c6fe	[doc] Update .github/workflows/README.md (#3577 ) Optimization Code I think there were two extra $ entered here, which have been deleted	2023-04-17 16:27:38 +08:00
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2023-04-06 14:51:35 +08:00
Hakjin Lee	1653063fce	[CI] Fix pre-commit workflow (#3238 )	2023-03-27 09:41:08 +08:00
Frank Lee	169ed4d24e	[workflow] purged extension cache before GPT test (#3128 )	2023-03-14 10:11:32 +08:00
Frank Lee	91ccf97514	[workflow] fixed doc build trigger condition (#3072 )	2023-03-09 17:31:41 +08:00
Frank Lee	8fedc8766a	[workflow] supported conda package installation in doc test (#3028 ) * [workflow] supported conda package installation in doc test * polish code * polish code * polish code * polish code * polish code * polish code	2023-03-07 14:21:26 +08:00
Frank Lee	2cd6ba3098	[workflow] fixed the post-commit failure when no formatting needed (#3020 ) * [workflow] fixed the post-commit failure when no formatting needed * polish code * polish code * polish code	2023-03-07 13:35:45 +08:00
Frank Lee	2e427ddf42	[revert] recover "[refactor] restructure configuration files (#2977 )" (#3022 ) This reverts commit `35c8f4ce47`.	2023-03-07 13:31:23 +08:00
Saurav Maheshkar	35c8f4ce47	[refactor] restructure configuration files (#2977 ) * gh: move CONTRIBUTING to .github * chore: move isort config to pyproject * chore: move pytest config to pyproject * chore: move yapf config to pyproject * chore: move clang-format config to pre-commit	2023-03-05 20:29:34 +08:00
Frank Lee	77b88a3849	[workflow] added auto doc test on PR (#2929 ) * [workflow] added auto doc test on PR * [workflow] added doc test workflow * polish code * polish code * polish code * polish code * polish code * polish code * polish code	2023-02-28 11:10:38 +08:00
Frank Lee	e33c043dec	[workflow] moved pre-commit to post-commit (#2895 )	2023-02-24 14:41:33 +08:00
LuGY	dbd0fd1522	[CI/CD] fix nightly release CD running on forked repo (#2812 ) * [CI/CD] fix nightly release CD running on forker repo * fix misunderstanding of dispatch * remove some build condition, enable notify even when release failed	2023-02-18 13:27:13 +08:00
ver217	9c0943ecdb	[chatgpt] optimize generation kwargs (#2717 ) * [chatgpt] ppo trainer use default generate args * [chatgpt] example remove generation preparing fn * [chatgpt] benchmark remove generation preparing fn * [chatgpt] fix ci	2023-02-15 13:59:58 +08:00
Frank Lee	2045d45ab7	[doc] updated documentation version list (#2715 )	2023-02-15 11:24:18 +08:00
ver217	f6b4ca4e6c	[devops] add chatgpt ci (#2713 )	2023-02-15 10:53:54 +08:00
Frank Lee	89f8975fb8	[workflow] fixed tensor-nvme build caching (#2711 )	2023-02-15 10:12:55 +08:00
Frank Lee	5cd8cae0c9	[workflow] fixed communtity report ranking (#2680 )	2023-02-13 17:04:49 +08:00
Frank Lee	c44fd0c867	[workflow] added trigger to build doc upon release (#2678 )	2023-02-13 16:53:26 +08:00
Frank Lee	327bc06278	[workflow] added doc build test (#2675 ) * [workflow] added doc build test * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code * polish code	2023-02-13 15:55:57 +08:00
Frank Lee	94f87f9651	[workflow] fixed gpu memory check condition (#2659 )	2023-02-10 09:59:07 +08:00
Frank Lee	85b2303b55	[doc] migrate the markdown files (#2652 )	2023-02-09 14:21:38 +08:00
Frank Lee	8518263b80	[test] fixed the triton version for testing (#2608 )	2023-02-07 13:49:38 +08:00
Frank Lee	aa7e9e4794	[workflow] fixed the test coverage report (#2614 ) * [workflow] fixed the test coverage report * polish code	2023-02-07 11:50:53 +08:00
Frank Lee	b3973b995a	[workflow] fixed test coverage report (#2611 )	2023-02-07 11:02:56 +08:00
Frank Lee	f566b0ce6b	[workflow] fixed broken rellease workflows (#2604 )	2023-02-06 21:40:19 +08:00
Frank Lee	f7458d3ec7	[release] v0.2.1 (#2602 ) * [release] v0.2.1 * polish code	2023-02-06 20:46:18 +08:00

1 2 3 4 5 ...

280 Commits