Commit Graph

263 Commits

Author SHA1 Message Date
ver217
8d3250d74b [zero] ZeRO supports pipeline parallel (#477) 2022-03-21 16:55:37 +08:00
ver217
3cb3fc275e zero init ctx receives a dp process group (#471) 2022-03-21 11:18:55 +08:00
ver217
fc8e6db005 [doc] Update docstring for ZeRO (#459)
* polish sharded model docstr

* polish sharded optim docstr

* polish zero docstr

* polish shard strategy docstr
2022-03-18 16:48:20 +08:00
ver217
a241f61b34 [zero] Update initialize for ZeRO (#458)
* polish code

* shard strategy receive pg in shard() / gather()

* update zero engine

* polish code
2022-03-18 16:18:31 +08:00
ver217
642846d6f9 update sharded optim and fix zero init ctx (#457) 2022-03-18 15:44:47 +08:00
Jiarui Fang
e2e9f82588 Revert "[zero] update sharded optim and fix zero init ctx" (#456)
* Revert "polish code"

This reverts commit 8cf7ff08cf.

* Revert "rename variables"

This reverts commit e99af94ab8.

* Revert "remove surplus imports"

This reverts commit 46add4a5c5.

* Revert "update sharded optim and fix zero init ctx"

This reverts commit 57567ee768.
2022-03-18 15:22:43 +08:00
ver217
e99af94ab8 rename variables 2022-03-18 14:25:25 +08:00
ver217
57567ee768 update sharded optim and fix zero init ctx 2022-03-18 14:25:25 +08:00
Jiarui Fang
0fcfb1e00d [test] make zero engine test really work (#447) 2022-03-17 17:24:25 +08:00
Jiarui Fang
237d08e7ee [zero] hybrid cpu adam (#445) 2022-03-17 15:05:41 +08:00
Jiarui Fang
496cbb0760 [hotfix] fix initialize bug with zero (#442) 2022-03-17 13:16:22 +08:00
Jiarui Fang
640a6cd304 [refactory] refactory the initialize method for new zero design (#431) 2022-03-16 19:29:37 +08:00
ver217
fce9432f08 sync before creating empty grad 2022-03-16 14:24:09 +08:00
ver217
ea6905a898 free param.grad 2022-03-16 14:24:09 +08:00
ver217
9506a8beb2 use double buffer to handle grad 2022-03-16 14:24:09 +08:00
Jiarui Fang
adebb3e041 [zero] cuda margin space for OS (#418) 2022-03-15 12:02:19 +08:00
Jiarui Fang
56bb412e72 [polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang
23ba3fc450 [zero] refactory ShardedOptimV2 init method (#416) 2022-03-15 10:45:55 +08:00
Frank Lee
e79ea44247 [fp16] refactored fp16 optimizer (#392) 2022-03-15 10:05:38 +08:00
Jiarui Fang
21dc54e019 [zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00
Jiarui Fang
370f567e7d [zero] new interface for ShardedOptimv2 (#406) 2022-03-14 20:48:41 +08:00
ver217
63469c0f91 polish code 2022-03-14 15:48:55 +08:00
ver217
88804aee49 add bucket tensor shard strategy 2022-03-14 14:48:32 +08:00
HELSON
7c079d9c33 [hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394) 2022-03-11 18:12:46 +08:00
Jiarui Fang
3af13a2c3e [zero] polish ShardedOptimV2 unittest (#385)
* place params on cpu after zero init context

* polish code

* bucketzed cpu gpu tensor transter

* find a bug in sharded optim unittest

* add offload unittest for ShardedOptimV2.

* polish code and make it more robust
2022-03-11 15:50:28 +08:00
Jiarui Fang
272ebfb57d [bug] shard param during initializing the ShardedModelV2 (#381) 2022-03-11 15:50:28 +08:00
Jiarui Fang
b5f43acee3 [zero] find miss code (#378) 2022-03-11 15:50:28 +08:00
Jiarui Fang
6b6002962a [zero] zero init context collect numel of model (#375) 2022-03-11 15:50:28 +08:00
jiaruifang
d9217e1960 Revert "[zero] bucketized tensor cpu gpu copy (#368)"
This reverts commit bef05489b6.
2022-03-11 15:50:28 +08:00
Jiarui Fang
00670c870e [zero] bucketized tensor cpu gpu copy (#368) 2022-03-11 15:50:28 +08:00
Jiarui Fang
44e4891f57 [zero] able to place params on cpu after zero init context (#365)
* place params on cpu after zero init context

* polish code
2022-03-11 15:50:28 +08:00
ver217
253e54d98a fix grad shape 2022-03-11 15:50:28 +08:00
Jiarui Fang
ea2872073f [zero] global model data memory tracer (#360) 2022-03-11 15:50:28 +08:00
Jiarui Fang
cb34cd384d [test] polish zero related unitest (#351) 2022-03-11 15:50:28 +08:00
ver217
d0ae0f2215 [zero] update sharded optim v2 (#334) 2022-03-11 15:50:28 +08:00
jiaruifang
5663616921 polish code 2022-03-11 15:50:28 +08:00
jiaruifang
7977422aeb add bert for unitest and sharded model is not able to pass the bert case 2022-03-11 15:50:28 +08:00
ver217
1388671699 [zero] Update sharded model v2 using sharded param v2 (#323) 2022-03-11 15:50:28 +08:00
Jiarui Fang
11bddb6e55 [zero] update zero context init with the updated test utils (#327) 2022-03-11 15:50:28 +08:00
Jiarui Fang
de0468c7a8 [zero] zero init context (#321)
* add zero init context

* add more flags for zero init context
fix bug of repeated converting param to ShardedParamV2

* polish code
2022-03-11 15:50:28 +08:00
LuGY
a3269de5c9 [zero] cpu adam kernel (#288)
* Added CPU Adam

* finished the cpu adam

* updated the license

* delete useless parameters, removed resnet

* modified the method off cpu adam unittest

* deleted some useless codes

* removed useless codes

Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-03-11 15:50:28 +08:00
Jiarui Fang
90d3aef62c [zero] yet an improved sharded param (#311) 2022-03-11 15:50:28 +08:00
Jiarui Fang
c9e7d9582d [zero] polish shard strategy (#310)
* init shard param from shape tuple

* add more unitest for shard param

* add set_payload method for ShardedParam

* [zero] add shareded tensor class

* polish code

* add shard stratgy

* move shard and gather logic to shard strategy from shard tensor.

* polish code
2022-03-11 15:50:28 +08:00
ver217
3092317b80 polish code 2022-03-11 15:50:28 +08:00
ver217
36f9a74ab2 fix sharded param hook and unit test 2022-03-11 15:50:28 +08:00
ver217
001ca624dd impl shard optim v2 and add unit test 2022-03-11 15:50:28 +08:00
Jiarui Fang
74f77e314b [zero] a shard strategy in granularity of tensor (#307) 2022-03-11 15:50:28 +08:00
Jiarui Fang
80364c7686 [zero] sharded tensor (#305)
* init shard param from shape tuple

* add more unitest for shard param

* add set_payload method for ShardedParam

* [zero] add shareded tensor class

* polish code
2022-03-11 15:50:28 +08:00
ver217
b105371ace rename shared adam to sharded optim v2 2022-03-11 15:50:28 +08:00
ver217
70814dc22f fix master params dtype 2022-03-11 15:50:28 +08:00