Commit Graph

1213 Commits

Author SHA1 Message Date
github-actions[bot]
c77b3b19be [format] applied code formatting on changed files in pull request 4152 (#4157)
Co-authored-by: github-actions <github-actions@github.com>
2023-07-04 16:07:47 +08:00
Frank Lee
1fb0d95df0 [shardformer] made tensor parallelism configurable (#4144)
* [shardformer] made tensor parallelism configurable

* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
74257cb446 [shardformer] refactored some doc and api (#4137)
* [shardformer] refactored some doc and api

* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
ae035d305d [shardformer] added embedding gradient check (#4124) 2023-07-04 16:05:01 +08:00
Frank Lee
6a88bae4ec [shardformer] integrate with data parallelism (#4103) 2023-07-04 16:05:01 +08:00
Frank Lee
f3b6aaa6b7 [shardformer] supported fused normalization (#4112) 2023-07-04 16:05:01 +08:00
Frank Lee
b1c2901530 [shardformer] supported bloom model (#4098) 2023-07-04 16:05:01 +08:00
Kun Lin
8af29ee47a [shardformer] support vision transformer (#4096)
* first v of vit shardformer

* keep vit

* update

* vit shard add vitattention vitlayer

* update num head shard para

* finish test for vit

* add new_model_class & postprocess

* add vit readme

* delete old files & fix the conflict

* fix sth
2023-07-04 16:05:01 +08:00
jiangmingyan
ac80937138 [shardformer] shardformer support opt models (#4091)
* [shardformer] shardformer support opt models

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix

* [shardformer] shardformer support opt models, fix
2023-07-04 16:05:01 +08:00
Frank Lee
d33a44e8c3 [shardformer] refactored layernorm (#4086) 2023-07-04 16:05:01 +08:00
Frank Lee
c4b1b65931 [test] fixed tests failed due to dtensor change (#4082)
* [test] fixed tests failed due to dtensor change

* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
92f6791095 [shardformer] Add layernorm (#4072)
* add layernorm to bert

* add layernorm test

* add layernorm test with load state dict

* add use_mixedfusedLN in shard config

* refactor policy to support fused_layernorm
2023-07-04 16:05:01 +08:00
Frank Lee
70c58cfd4f [shardformer] supported fused qkv checkpoint (#4073) 2023-07-04 16:05:01 +08:00
FoolPlayer
0803a61412 [shardformer] add linearconv1d test (#4067)
* add linearconv1d test

* add linearconv1d test
2023-07-04 16:05:01 +08:00
Frank Lee
8eb09a4c69 [shardformer] support module saving and loading (#4062)
* [shardformer] support module saving and loading

* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
7740c55c55 support kit use for bert/gpt test (#4055)
* support kit use for bert test

* support kit test for gpt2
2023-07-04 16:05:01 +08:00
Frank Lee
f22ddacef0 [shardformer] refactored the shardformer layer structure (#4053) 2023-07-04 16:05:01 +08:00
Frank Lee
58df720570 [shardformer] adapted T5 and LLaMa test to use kit (#4049)
* [shardformer] adapted T5 and LLaMa test to use kit

* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
4021b9a8a2 [shardformer] add gpt2 test and layer class refactor (#4041)
* add gpt2 test and layer class refactor

* add dropout in gpt2 policy
2023-07-04 16:05:01 +08:00
Frank Lee
d857f3dbba [shardformer] supported T5 and its variants (#4045) 2023-07-04 16:05:01 +08:00
Frank Lee
c1d5453e9f [shardformer] adapted llama to the new API (#4036) 2023-07-04 16:05:01 +08:00
FoolPlayer
74d176c8d8 [shardformer] fix bert and gpt downstream with new api (#4024)
* fix bert downstream with new api

* remove comment line
2023-07-04 16:05:01 +08:00
FoolPlayer
507c0ad368 add vocabembedding layer 2023-07-04 16:05:01 +08:00
Frank Lee
3893fa1a8d [shardformer] refactored embedding and dropout to parallel module (#4013)
* [shardformer] refactored embedding and dropout to parallel module

* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
dfca9678fa integrate with dist layer (#4011) 2023-07-04 16:05:01 +08:00
Frank Lee
015af592f8 [shardformer] integrated linear 1D with dtensor (#3996)
* [shardformer] integrated linear 1D with dtensor

* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
611971248c [device] support init device mesh from process group (#3990) 2023-07-04 16:05:01 +08:00
FoolPlayer
f7774ec0f3 [Shardformer] Downstream bert (#3979)
* add dist dropout in model

* update docstring and bert policy with dropout

* refactor basepolicy and sharded, update bert

* update format

* update gpt2 policy

* update bert policy

* remove unused code

* update readme for new policy usage

* add downstream model of bert

* remove unused code
2023-07-04 16:05:01 +08:00
wukong1992
c1c672d0f0 [shardformer] shardformer support t5 model (#3994)
test t5
2023-07-04 16:05:01 +08:00
wukong1992
6b30dfb7ce [shardformer] support llama model using shardformer (#3969)
adjust layer attr
2023-07-04 16:05:01 +08:00
FoolPlayer
a73130482d [shardformer] Unit test (#3928)
* fix bug in slicer, add slicer unit test

* add dropout test

* use pid as dropout seed

* updata dropout test with local pattern

* ad todo
2023-07-04 16:05:01 +08:00
FoolPlayer
f1cb5ac6bf [shardformer] Align bert value (#3907)
* add bert align test, fix dist loss bug

* forward and backward align

* add ignore index

* add shardformer CI

* add gather_output optional for user in shardconfig

* update readme with optional gather_ouput

* add dist crossentropy loss test, remove unused files

* remove unused file

* remove unused file

* rename the file

* polish code
2023-07-04 16:05:01 +08:00
Baizhou Zhang
0bb0b481b4 [gemini] fix argument naming during chunk configuration searching 2023-06-25 13:34:15 +08:00
github-actions[bot]
a52f62082d [format] applied code formatting on changed files in pull request 4021 (#4022)
Co-authored-by: github-actions <github-actions@github.com>
2023-06-19 11:23:24 +08:00
Frank Lee
a5883aa790 [test] fixed codefactor format report (#4026) 2023-06-16 18:23:02 +08:00
Baizhou Zhang
822c3d4d66 [checkpointio] sharded optimizer checkpoint for DDP plugin (#4002) 2023-06-16 14:14:05 +08:00
Wenhao Chen
725af3eeeb [booster] make optimizer argument optional for boost (#3993)
* feat: make optimizer optional in Booster.boost

* test: skip unet test if diffusers version > 0.10.2
2023-06-15 17:38:42 +08:00
Baizhou Zhang
c9cff7e7fa [checkpointio] General Checkpointing of Sharded Optimizers (#3984) 2023-06-15 15:21:26 +08:00
digger yu
e61ffc77c6 fix typo tests/ (#3936) 2023-06-09 09:49:41 +08:00
Frank Lee
ddcf58cacf Revert "[sync] sync feature/shardformer with develop" 2023-06-09 09:41:27 +08:00
Frank Lee
eb39154d40 [dtensor] updated api and doc (#3845) 2023-06-08 10:18:17 +08:00
Hongxin Liu
ae02d4e4f7 [bf16] add bf16 support (#3882)
* [bf16] add bf16 support for fused adam (#3844)

* [bf16] fused adam kernel support bf16

* [test] update fused adam kernel test

* [test] update fused adam test

* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860)

* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869)

* [bf16] add mixed precision mixin

* [bf16] low level zero optim support bf16

* [text] update low level zero test

* [text] fix low level zero grad acc test

* [bf16] add bf16 support for gemini (#3872)

* [bf16] gemini support bf16

* [test] update gemini bf16 test

* [doc] update gemini docstring

* [bf16] add bf16 support for plugins (#3877)

* [bf16] add bf16 support for legacy zero (#3879)

* [zero] init context support bf16

* [zero] legacy zero support bf16

* [test] add zero bf16 test

* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
Hongxin Liu
dbb32692d2 [lazy] refactor lazy init (#3891)
* [lazy] remove old lazy init

* [lazy] refactor lazy init folder structure

* [lazy] fix lazy tensor deepcopy

* [test] update lazy init test
2023-06-05 14:20:47 +08:00
wukong1992
6b305a99d6 [booster] torch fsdp fix ckpt (#3788) 2023-05-23 16:58:45 +08:00
Frank Lee
615e2e5fc1 [test] fixed lazy init test import error (#3799) 2023-05-23 11:57:15 +08:00
Hongxin Liu
3c07a2846e [plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
* [test] refactor torch ddp checkpoint test

* [plugin] update low level zero optim checkpoint

* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu
5452df63c5 [plugin] torch ddp plugin supports sharded model checkpoint (#3775)
* [plugin] torch ddp plugin add save sharded model

* [test] fix torch ddp ckpt io test

* [test] fix torch ddp ckpt io test

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] remove debug info
2023-05-18 20:05:59 +08:00
wukong1992
6050f37776 [booster] removed models that don't support fsdp (#3744)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 19:35:21 +08:00
Hongxin Liu
afb239bbf8 [devops] update torch version of CI (#3725)
* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test
2023-05-15 17:20:56 +08:00
wukong1992
b37797ed3d [booster] support torch fsdp plugin in booster (#3697)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00