Commit Graph

73 Commits

Author SHA1 Message Date
ver217
7c6c427db1 [zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00
Jiarui Fang
7675366fce [polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
ver217
014bac0c49 [zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model

* polish comments

* polish comments
2022-03-30 18:14:50 +08:00
Jiarui Fang
f552b11294 [zero] label state for param fp16 and grad (#551) 2022-03-30 15:57:46 +08:00
Jiarui Fang
214da761d4 [zero] add stateful tensor (#549) 2022-03-30 13:51:37 +08:00
Jiarui Fang
53b1b6e340 [zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
ver217
1f90a3b129 [zero] polish ZeroInitContext (#540) 2022-03-29 09:09:04 +08:00
Jiarui Fang
c11ff81b15 [zero] get memory usage of sharded optim v2. (#542) 2022-03-29 09:08:18 +08:00
HELSON
a30e2b4c24 [zero] adapt for no-leaf module in zero (#535)
only process module's own parameters in Zero context

add zero hooks for all modules that contrain parameters

gather parameters only belonging to module itself
2022-03-28 17:42:18 +08:00
Jiarui Fang
705f56107c [zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang
a590ed0ba3 [zero] improve the accuracy of get_memory_usage of sharded param (#538) 2022-03-28 16:19:19 +08:00
Jiarui Fang
37cb70feec [zero] get memory usage for sharded param (#536) 2022-03-28 15:01:21 +08:00
Jiarui Fang
8d8c5407c0 [zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Frank Lee
3601b2bad0 [test] fixed rerun_on_exception and adapted test cases (#487) 2022-03-25 17:25:12 +08:00
Jiarui Fang
4d322b79da [refactor] remove old zero code (#517) 2022-03-25 14:54:39 +08:00
Jiarui Fang
0bebda6ea5 [zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
ver217
9ec1ce6ab1 [zero] sharded model support the reuse of fp16 shard (#495)
* sharded model supports reuse fp16 shard

* rename variable

* polish code

* polish code

* polish code
2022-03-23 14:59:59 +08:00
ver217
62b0a8d644 [zero] sharded optim support hybrid cpu adam (#486)
* sharded optim support hybrid cpu adam

* update unit test

* polish docstring
2022-03-22 14:56:59 +08:00
Jiarui Fang
b334822163 [zero] polish sharded param name (#484)
* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish
2022-03-22 14:36:16 +08:00
Frank Lee
af185b5519 [test] fixed amp convergence comparison test (#454) 2022-03-18 16:28:16 +08:00
ver217
a241f61b34 [zero] Update initialize for ZeRO (#458)
* polish code

* shard strategy receive pg in shard() / gather()

* update zero engine

* polish code
2022-03-18 16:18:31 +08:00
ver217
642846d6f9 update sharded optim and fix zero init ctx (#457) 2022-03-18 15:44:47 +08:00
Jiarui Fang
e2e9f82588 Revert "[zero] update sharded optim and fix zero init ctx" (#456)
* Revert "polish code"

This reverts commit 8cf7ff08cf.

* Revert "rename variables"

This reverts commit e99af94ab8.

* Revert "remove surplus imports"

This reverts commit 46add4a5c5.

* Revert "update sharded optim and fix zero init ctx"

This reverts commit 57567ee768.
2022-03-18 15:22:43 +08:00
ver217
8cf7ff08cf polish code 2022-03-18 14:25:25 +08:00
ver217
46add4a5c5 remove surplus imports 2022-03-18 14:25:25 +08:00
ver217
57567ee768 update sharded optim and fix zero init ctx 2022-03-18 14:25:25 +08:00
Frank Lee
f27d801a13 [test] optimized zero data parallel test (#452) 2022-03-18 11:35:54 +08:00
Jiarui Fang
0fcfb1e00d [test] make zero engine test really work (#447) 2022-03-17 17:24:25 +08:00
Jiarui Fang
496cbb0760 [hotfix] fix initialize bug with zero (#442) 2022-03-17 13:16:22 +08:00
Jiarui Fang
17b8274f8a [unitest] polish zero config in unittest (#438) 2022-03-17 10:20:53 +08:00
Jiarui Fang
640a6cd304 [refactory] refactory the initialize method for new zero design (#431) 2022-03-16 19:29:37 +08:00
ver217
fce9432f08 sync before creating empty grad 2022-03-16 14:24:09 +08:00
Jiarui Fang
f9c762df85 [test] merge zero optim tests (#428) 2022-03-16 12:22:45 +08:00
Jiarui Fang
adebb3e041 [zero] cuda margin space for OS (#418) 2022-03-15 12:02:19 +08:00
Jiarui Fang
56bb412e72 [polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang
23ba3fc450 [zero] refactory ShardedOptimV2 init method (#416) 2022-03-15 10:45:55 +08:00
Jiarui Fang
21dc54e019 [zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00
Jiarui Fang
370f567e7d [zero] new interface for ShardedOptimv2 (#406) 2022-03-14 20:48:41 +08:00
ver217
54fd37f0e0 polish unit test 2022-03-14 15:06:02 +08:00
Jiarui Fang
3af13a2c3e [zero] polish ShardedOptimV2 unittest (#385)
* place params on cpu after zero init context

* polish code

* bucketzed cpu gpu tensor transter

* find a bug in sharded optim unittest

* add offload unittest for ShardedOptimV2.

* polish code and make it more robust
2022-03-11 15:50:28 +08:00
Frank Lee
526a318032 [unit test] Refactored test cases with component func (#339)
* refactored test with component func

* fixed bug
2022-03-11 15:50:28 +08:00
Jiarui Fang
6b6002962a [zero] zero init context collect numel of model (#375) 2022-03-11 15:50:28 +08:00
Jiarui Fang
44e4891f57 [zero] able to place params on cpu after zero init context (#365)
* place params on cpu after zero init context

* polish code
2022-03-11 15:50:28 +08:00
Jiarui Fang
ea2872073f [zero] global model data memory tracer (#360) 2022-03-11 15:50:28 +08:00
Jiarui Fang
cb34cd384d [test] polish zero related unitest (#351) 2022-03-11 15:50:28 +08:00
ver217
532ae79cb0 add test sharded optim with cpu adam (#347) 2022-03-11 15:50:28 +08:00
ver217
d0ae0f2215 [zero] update sharded optim v2 (#334) 2022-03-11 15:50:28 +08:00
ver217
f5f0ad266e fix bert unit test 2022-03-11 15:50:28 +08:00
jiaruifang
d271f2596b polish engine unitest 2022-03-11 15:50:28 +08:00
jiaruifang
354c0f9047 polish code 2022-03-11 15:50:28 +08:00