Commit Graph

2998 Commits

Author SHA1 Message Date
Frank Lee
705a62a565
[doc] updated installation command (#5389) 2024-02-19 16:54:03 +08:00
yixiaoer
69e3ad01ed
[doc] Fix typo (#5361) 2024-02-19 16:53:28 +08:00
Hongxin Liu
7303801854
[llama] fix training and inference scripts (#5384)
* [llama] refactor inference example to fit sft

* [llama] fix training script to fit gemini

* [llama] fix inference script
2024-02-19 16:41:04 +08:00
Hongxin Liu
adae123df3
[release] update version (#5380) 2024-02-08 18:50:09 +08:00
Frank Lee
efef43b53c
Merge pull request #5372 from hpcaitech/exp/mixtral 2024-02-08 16:30:05 +08:00
Frank Lee
4c03347fc7
Merge pull request #5377 from hpcaitech/example/llama-npu
[llama] support npu for Colossal-LLaMA-2
2024-02-08 14:12:11 +08:00
ver217
06db94fbc9 [moe] fix tests 2024-02-08 12:46:37 +08:00
Hongxin Liu
65e5d6baa5 [moe] fix mixtral optim checkpoint (#5344) 2024-02-07 19:21:02 +08:00
Hongxin Liu
956b561b54 [moe] fix mixtral forward default value (#5329) 2024-02-07 19:21:02 +08:00
Hongxin Liu
b60be18dcc [moe] fix mixtral checkpoint io (#5314) 2024-02-07 19:21:02 +08:00
Hongxin Liu
da39d21b71 [moe] support mixtral (#5309)
* [moe] add mixtral block for single expert

* [moe] mixtral block fwd support uneven ep

* [moe] mixtral block bwd support uneven ep

* [moe] add mixtral moe layer

* [moe] simplify replace

* [meo] support save sharded mixtral

* [meo] support load sharded mixtral

* [meo] support save sharded optim

* [meo] integrate moe manager into plug

* [meo] fix optimizer load

* [meo] fix mixtral layer
2024-02-07 19:21:02 +08:00
Hongxin Liu
c904d2ae99 [moe] update capacity computing (#5253)
* [moe] top2 allow uneven input

* [moe] update capacity computing

* [moe] remove debug info

* [moe] update capacity computing

* [moe] update capacity computing
2024-02-07 19:21:02 +08:00
Xuanlei Zhao
7d8e0338a4 [moe] init mixtral impl 2024-02-07 19:21:02 +08:00
Hongxin Liu
084c91246c
[llama] fix memory issue (#5371)
* [llama] fix memory issue

* [llama] add comment
2024-02-06 19:02:37 +08:00
Hongxin Liu
c53ddda88f
[lr-scheduler] fix load state dict and add test (#5369) 2024-02-06 14:23:32 +08:00
Hongxin Liu
eb4f2d90f9
[llama] polish training script and fix optim ckpt (#5368) 2024-02-06 11:52:17 +08:00
Camille Zhong
a5756a8720
[eval] update llama npu eval (#5366) 2024-02-06 10:53:03 +08:00
Camille Zhong
44ca61a22b
[llama] fix neftune & pbar with start_step (#5364) 2024-02-05 18:04:23 +08:00
Hongxin Liu
a4cec1715b
[llama] add flash attn patch for npu (#5362) 2024-02-05 16:48:34 +08:00
Hongxin Liu
73f9f23fc6
[llama] update training script (#5360)
* [llama] update training script

* [doc] polish docstr
2024-02-05 16:33:18 +08:00
Hongxin Liu
6c0fa7b9a8
[llama] fix dataloader for hybrid parallel (#5358)
* [plugin] refactor prepare dataloader

* [plugin] update train script
2024-02-05 15:14:56 +08:00
Hongxin Liu
2dd01e3a14
[gemini] fix param op hook when output is tuple (#5355)
* [gemini] fix param op hook when output is tuple

* [gemini] fix param op hook
2024-02-04 11:58:26 +08:00
Wenhao Chen
1c790c0877
[fix] remove unnecessary dp_size assert (#5351)
* fix: remove unnecessary assert

* test: add more 3d plugin tests

* fix: add warning
2024-02-02 14:40:20 +08:00
Hongxin Liu
ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
* [checkpointio] fix hybrid parallel optim checkpoint

* [extension] fix cuda extension

* [checkpointio] fix gemini optimizer checkpoint

* polish code
2024-02-01 16:13:06 +08:00
YeAnbang
c5239840e6
[Chat] fix sft loss nan (#5345)
* fix script

* fix script

* fix chat nan

* fix chat nan
2024-02-01 14:25:16 +08:00
Frank Lee
abd8e77ad8
[extension] fixed exception catch (#5342) 2024-01-31 18:09:49 +08:00
digger yu
71321a07cf
fix typo change dosen't to doesn't (#5308) 2024-01-30 09:57:38 +08:00
digger yu
6a3086a505
fix typo under extensions/ (#5330) 2024-01-30 09:55:16 +08:00
Frank Lee
febed23288
[doc] added docs for extensions (#5324)
* [doc] added docs for extensions

* polish

* polish
2024-01-29 17:39:23 +08:00
flybird11111
388179f966
[tests] fix t5 test. (#5322)
* [ci] fix shardformer tests. (#5255)

* fix ci

fix

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>

* fix t5 test

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-29 17:38:46 +08:00
Frank Lee
a6709afe66
Merge pull request #5321 from FrankLeeeee/hotfix/accelerator-api
[accelerator] fixed npu api
2024-01-29 14:29:58 +08:00
FrankLeeeee
087d0cb1fc [accelerator] fixed npu api 2024-01-29 14:27:52 +08:00
Frank Lee
8823cc4831
Merge pull request #5310 from hpcaitech/feature/npu
Feature/npu
2024-01-29 13:49:39 +08:00
Frank Lee
73f4dc578e
[workflow] updated CI image (#5318) 2024-01-29 11:53:07 +08:00
Frank Lee
7cfed5f076
[feat] refactored extension module (#5298)
* [feat] refactored extension module

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2024-01-25 17:01:48 +08:00
digger yu
bce9499ed3
fix some typo (#5307) 2024-01-25 13:56:27 +08:00
李文军
ec912b1ba9
[NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/init_tokenizer.py code style (#5228) 2024-01-25 13:14:48 +08:00
Desperado-Jia
ddf879e2db
fix bug for mefture (#5299) 2024-01-22 22:17:54 +08:00
Hongxin Liu
d7f8db8e21
[hotfix] fix 3d plugin test (#5292) 2024-01-22 15:19:04 +08:00
flybird11111
f7e3f82a7e
fix llama pretrain (#5287) 2024-01-19 17:49:02 +08:00
Desperado-Jia
6a56967855
[doc] add llama2-13B disyplay (#5285)
* Update README.md

* fix 13b typo

---------

Co-authored-by: binmakeswell <binmakeswell@gmail.com>
2024-01-19 16:04:08 +08:00
Michelle
32cb74493a
fix auto loading gpt2 tokenizer (#5279) 2024-01-18 14:08:29 +08:00
Frank Lee
d66e6988bc
Merge pull request #5278 from ver217/sync/npu
[sync] sync npu branch with main
2024-01-18 13:11:45 +08:00
ver217
148469348a Merge branch 'main' into sync/npu 2024-01-18 12:05:21 +08:00
Zhongkai Zhao
5d9a0ae75b
[hotfix] Fix ShardFormer test execution path when using sequence parallelism (#5230) 2024-01-17 17:42:29 +08:00
flybird11111
46e091651b
[shardformer] hybridparallelplugin support gradients accumulation. (#5246)
* support gradients acc

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

fix

* fix

fix

* fix

fix

fix
2024-01-17 15:22:33 +08:00
flybird11111
2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
* fix ci

fix

* fix test

* revert: revert p2p

* feat: add enable_metadata_cache option

* revert: enable t5 tests

* fix

---------

Co-authored-by: Wenhao Chen <cwher@outlook.com>
2024-01-17 13:38:55 +08:00
Frank Lee
d69cd2eb89
[workflow] fixed oom tests (#5275)
* [workflow] fixed oom tests

* polish

* polish

* polish
2024-01-16 18:55:13 +08:00
Frank Lee
04244aaaf1
[workflow] fixed incomplete bash command (#5272) 2024-01-16 11:54:44 +08:00
Wenhao Chen
ef4f0ee854
[hotfix]: add pp sanity check and fix mbs arg (#5268)
* fix: fix misleading mbs arg

* feat: add pp sanity check

* fix: fix 1f1b sanity check
2024-01-15 15:57:40 +08:00