Commit Graph

3871 Commits

Author SHA1 Message Date
duanjunwen
45f17fc6cc [fix] rm comments; 2024-09-26 06:13:56 +00:00
duanjunwen
a92e16719b [fix] fix zerobubble; support shardformer model type; 2024-09-26 06:11:56 +00:00
binmakeswell
f4daf04270
add funding news (#6072)
* add funding news

* add funding news

* add funding news
2024-09-26 12:29:27 +08:00
wangbluo
6705dad41b fix 2024-09-25 19:02:21 +08:00
wangbluo
91ed32c256 fix 2024-09-25 19:00:38 +08:00
wangbluo
6fb1322db1 fix 2024-09-25 18:56:18 +08:00
wangbluo
65c8297710 fix the attn 2024-09-25 18:51:03 +08:00
wangbluo
cfd9eda628 fix the ring attn 2024-09-25 18:34:29 +08:00
duanjunwen
83163fa70c [fix] fix traverse; traverse dict --> traverse tensor List; 2024-09-25 06:38:11 +00:00
duanjunwen
fc8b016887 [fix] fix stage_indices; 2024-09-25 06:15:45 +00:00
binmakeswell
cbaa104216
release FP8 news (#6068)
* add FP8 news

* release FP8 news

* release FP8 news
2024-09-25 11:57:16 +08:00
duanjunwen
8501202a35
Merge pull request #6065 from duanjunwen/dev/zero_bubble
[Feat] Support zero bubble with shardformer input
2024-09-24 19:17:37 +08:00
duanjunwen
7e6f793c51 [fix] fix detach_output_obj clone; 2024-09-24 08:08:32 +00:00
duanjunwen
6c1e1550ae [fix] fix dumb clone; 2024-09-23 06:43:49 +00:00
duanjunwen
a875212a42 [fix] fix ci --> oom in 4096 hidden dim; 2024-09-23 05:55:16 +00:00
duanjunwen
c114d1429a [fix] fix detach clone release order; 2024-09-23 04:00:24 +00:00
duanjunwen
da3220f48c [fix] fix pipeline util func deallocate --> release_tensor_data; fix bwd_b loss bwd branch; 2024-09-20 09:48:35 +00:00
duanjunwen
1739df423c [fix] fix fwd branch, fwd pass both micro_batch & internal_inputs' 2024-09-20 07:34:43 +00:00
duanjunwen
b6616f544e [fix] rm comments; 2024-09-20 07:29:41 +00:00
duanjunwen
c6d6ee39bd [fix] use tree_flatten replace dict traverse; 2024-09-20 07:18:49 +00:00
duanjunwen
26783776f1 [fix] fix input_tensors buffer append input_obj(dict) --> Tuple (microbatch, input_obj) , and all bwd b related cal logic; 2024-09-20 06:41:19 +00:00
duanjunwen
4753bf7add [fix] fix mem assert; 2024-09-19 08:27:47 +00:00
duanjunwen
a115106f8d [fix] fix bwd w input; 2024-09-19 08:10:05 +00:00
duanjunwen
349272c71f [fix] updatw bwd b&w input; dict --> list[torch.Tensor] 2024-09-19 07:47:01 +00:00
duanjunwen
6ee9584b9a [fix] fix require_grad & deallocate call; 2024-09-19 05:53:03 +00:00
duanjunwen
1f5c7258aa Merge remote-tracking branch 'upstream/feature/zerobubble' into dev/zero_bubble 2024-09-19 03:52:13 +00:00
Hongxin Liu
dabc2e7430
[release] update version (#6062) 2024-09-19 10:45:32 +08:00
Camille Zhong
f9546ba0be
[ColossalEval] support for vllm (#6056)
* support vllm

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* modify vllm and update readme

* run pre-commit

* remove dupilicated lines and refine code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update param name

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine code

* update readme

* refine code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-09-18 17:09:45 +08:00
duanjunwen
af2c2f8092 [feat] add more test; 2024-09-18 07:51:54 +00:00
duanjunwen
3dbad102cf [fix] fix zerobubble pp for shardformer type input; 2024-09-18 07:14:34 +00:00
botbw
4fa6b9509c
[moe] add parallel strategy for shared_expert && fix test for deepseek (#6063) 2024-09-18 10:09:01 +08:00
Wang Binluo
63314ce4e4
Merge pull request #6064 from wangbluo/fix_attn
[sp] : fix the attention kernel for sp
2024-09-18 10:08:15 +08:00
wangbluo
10e4f7da72 fix 2024-09-16 13:45:04 +08:00
Wang Binluo
37e35230ff
Merge pull request #6061 from wangbluo/sp_fix
[sp] : fix the attention kernel for sp
2024-09-14 20:54:35 +08:00
wangbluo
827ef3ee9a fix 2024-09-14 10:40:35 +00:00
Guangyao Zhang
bdb125f83f
[doc] FP8 training and communication document (#6050)
* Add FP8 training and communication document

* add fp8 docstring for plugins

* fix typo

* fix typo
2024-09-14 11:01:05 +08:00
Guangyao Zhang
f20b066c59
[fp8] Disable all_gather intranode. Disable Redundant all_gather fp8 (#6059)
* all_gather only internode, fix pytest

* fix cuda arch <89 compile pytest error

* fix pytest failure

* disable all_gather_into_tensor_flat_fp8

* fix fp8 format

* fix pytest

* fix conversations

* fix chunk tuple to list
2024-09-14 10:40:01 +08:00
wangbluo
b582319273 fix 2024-09-13 10:24:41 +00:00
wangbluo
0ad3129cb9 fix 2024-09-13 09:01:26 +00:00
wangbluo
0b14a5512e fix 2024-09-13 07:06:14 +00:00
botbw
696fced0d7
[fp8] fix missing fp8_comm flag in mixtral (#6057) 2024-09-13 14:30:05 +08:00
wangbluo
dc032172c3 fix 2024-09-13 06:00:58 +00:00
wangbluo
f393867cff fix 2024-09-13 05:24:52 +00:00
wangbluo
6eb8832366 fix 2024-09-13 05:06:56 +00:00
wangbluo
683179cefd fix 2024-09-13 03:40:56 +00:00
wangbluo
0a01e2a453 fix the attn 2024-09-13 03:38:35 +00:00
pre-commit-ci[bot]
216d54e374 [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
2024-09-13 02:38:40 +00:00
wangbluo
fdd84b9087 fix the sp 2024-09-13 02:32:03 +00:00
duanjunwen
9bc3b6e220 [feat] moehybrid support zerobubble; 2024-09-12 02:51:46 +00:00
flybird11111
a35a078f08
[doc] update sp doc (#6055)
* update sp doc

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-09-11 17:25:14 +08:00