85 Commits

Author SHA1 Message Date
HELSON
84fd7c1d4d add moe context, moe utilities and refactor gradient handler (#455) 2022-03-18 16:38:32 +08:00
Frank Lee
af185b5519 [test] fixed amp convergence comparison test (#454) 2022-03-18 16:28:16 +08:00
ver217
a241f61b34 [zero] Update initialize for ZeRO (#458)
* polish code

* shard strategy receive pg in shard() / gather()

* update zero engine

* polish code
2022-03-18 16:18:31 +08:00
ver217
642846d6f9 update sharded optim and fix zero init ctx (#457) 2022-03-18 15:44:47 +08:00
Jiarui Fang
e2e9f82588 Revert "[zero] update sharded optim and fix zero init ctx" (#456)
* Revert "polish code"

This reverts commit 8cf7ff08cf.

* Revert "rename variables"

This reverts commit e99af94ab8.

* Revert "remove surplus imports"

This reverts commit 46add4a5c5.

* Revert "update sharded optim and fix zero init ctx"

This reverts commit 57567ee768.
2022-03-18 15:22:43 +08:00
ver217
8cf7ff08cf polish code 2022-03-18 14:25:25 +08:00
ver217
46add4a5c5 remove surplus imports 2022-03-18 14:25:25 +08:00
ver217
57567ee768 update sharded optim and fix zero init ctx 2022-03-18 14:25:25 +08:00
Frank Lee
f27d801a13 [test] optimized zero data parallel test (#452) 2022-03-18 11:35:54 +08:00
Jiarui Fang
0fcfb1e00d [test] make zero engine test really work (#447) 2022-03-17 17:24:25 +08:00
Frank Lee
bb2790cf0b optimize engine and trainer test (#448) 2022-03-17 15:44:17 +08:00
Frank Lee
b72b8445c6 optimized context test time consumption (#446) 2022-03-17 14:40:52 +08:00
Jiarui Fang
496cbb0760 [hotfix] fix initialize bug with zero (#442) 2022-03-17 13:16:22 +08:00
Jiarui Fang
17b8274f8a [unitest] polish zero config in unittest (#438) 2022-03-17 10:20:53 +08:00
Jiarui Fang
640a6cd304 [refactory] refactory the initialize method for new zero design (#431) 2022-03-16 19:29:37 +08:00
ver217
fce9432f08 sync before creating empty grad 2022-03-16 14:24:09 +08:00
Jiarui Fang
f9c762df85 [test] merge zero optim tests (#428) 2022-03-16 12:22:45 +08:00
Jiarui Fang
5d7dc3525b [hotfix] run cpu adam unittest in pytest (#424) 2022-03-16 10:39:55 +08:00
Jiarui Fang
adebb3e041 [zero] cuda margin space for OS (#418) 2022-03-15 12:02:19 +08:00
Jiarui Fang
56bb412e72 [polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang
23ba3fc450 [zero] refactory ShardedOptimV2 init method (#416) 2022-03-15 10:45:55 +08:00
Frank Lee
e79ea44247 [fp16] refactored fp16 optimizer (#392) 2022-03-15 10:05:38 +08:00
Jiarui Fang
21dc54e019 [zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00
Jiarui Fang
a37bf1bc42 [hotfix] rm test_tensor_detector.py (#413) 2022-03-14 21:39:48 +08:00
Jiarui Fang
370f567e7d [zero] new interface for ShardedOptimv2 (#406) 2022-03-14 20:48:41 +08:00
LuGY
a9c27be42e Added tensor detector (#393)
* Added tensor detector

* Added the - states

* Allowed change include_cpu when detect()
2022-03-14 18:01:46 +08:00
ver217
54fd37f0e0 polish unit test 2022-03-14 15:06:02 +08:00
Frank Lee
1e4bf85cdb fixed bug in activation checkpointing test (#387) 2022-03-11 15:50:28 +08:00
Jiarui Fang
3af13a2c3e [zero] polish ShardedOptimV2 unittest (#385)
* place params on cpu after zero init context

* polish code

* bucketzed cpu gpu tensor transter

* find a bug in sharded optim unittest

* add offload unittest for ShardedOptimV2.

* polish code and make it more robust
2022-03-11 15:50:28 +08:00
Frank Lee
526a318032 [unit test] Refactored test cases with component func (#339)
* refactored test with component func

* fixed bug
2022-03-11 15:50:28 +08:00
LuGY
de46450461 Added activation offload (#331)
* Added activation offload

* Fixed the import bug, used the pytest
2022-03-11 15:50:28 +08:00
Jiarui Fang
b5f43acee3 [zero] find miss code (#378) 2022-03-11 15:50:28 +08:00
Jiarui Fang
6b6002962a [zero] zero init context collect numel of model (#375) 2022-03-11 15:50:28 +08:00
jiaruifang
d9217e1960 Revert "[zero] bucketized tensor cpu gpu copy (#368)"
This reverts commit bef05489b6.
2022-03-11 15:50:28 +08:00
Jiarui Fang
00670c870e [zero] bucketized tensor cpu gpu copy (#368) 2022-03-11 15:50:28 +08:00
Jiarui Fang
44e4891f57 [zero] able to place params on cpu after zero init context (#365)
* place params on cpu after zero init context

* polish code
2022-03-11 15:50:28 +08:00
Jiarui Fang
ea2872073f [zero] global model data memory tracer (#360) 2022-03-11 15:50:28 +08:00
Jiarui Fang
cb34cd384d [test] polish zero related unitest (#351) 2022-03-11 15:50:28 +08:00
ver217
532ae79cb0 add test sharded optim with cpu adam (#347) 2022-03-11 15:50:28 +08:00
HELSON
425bb0df3f Added Profiler Context to manage all profilers (#340) 2022-03-11 15:50:28 +08:00
ver217
d0ae0f2215 [zero] update sharded optim v2 (#334) 2022-03-11 15:50:28 +08:00
ver217
2b8cddd40e skip bert in test engine 2022-03-11 15:50:28 +08:00
ver217
f5f0ad266e fix bert unit test 2022-03-11 15:50:28 +08:00
jiaruifang
d271f2596b polish engine unitest 2022-03-11 15:50:28 +08:00
jiaruifang
354c0f9047 polish code 2022-03-11 15:50:28 +08:00
jiaruifang
4d94cd513e adapting bert unitest interface 2022-03-11 15:50:28 +08:00
jiaruifang
7977422aeb add bert for unitest and sharded model is not able to pass the bert case 2022-03-11 15:50:28 +08:00
ver217
1388671699 [zero] Update sharded model v2 using sharded param v2 (#323) 2022-03-11 15:50:28 +08:00
jiaruifang
799d105bb4 using pytest parametrize 2022-03-11 15:50:28 +08:00
jiaruifang
dec24561cf show pytest parameterize 2022-03-11 15:50:28 +08:00