Commit Graph

22 Commits

Author SHA1 Message Date
HELSON
d7ecaf362b
[zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
2022-04-07 17:38:45 +08:00
Jiarui Fang
e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
Jiarui Fang
7675366fce
[polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
Jiarui Fang
53b1b6e340
[zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
ver217
1f90a3b129
[zero] polish ZeroInitContext (#540) 2022-03-29 09:09:04 +08:00
Jiarui Fang
705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang
8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Frank Lee
3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487) 2022-03-25 17:25:12 +08:00
Jiarui Fang
0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
Jiarui Fang
b334822163
[zero] polish sharded param name (#484)
* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish
2022-03-22 14:36:16 +08:00
ver217
a241f61b34
[zero] Update initialize for ZeRO (#458)
* polish code

* shard strategy receive pg in shard() / gather()

* update zero engine

* polish code
2022-03-18 16:18:31 +08:00
Frank Lee
f27d801a13
[test] optimized zero data parallel test (#452) 2022-03-18 11:35:54 +08:00
Jiarui Fang
56bb412e72
[polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang
21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00
ver217
54fd37f0e0 polish unit test 2022-03-14 15:06:02 +08:00
Jiarui Fang
6b6002962a [zero] zero init context collect numel of model (#375) 2022-03-11 15:50:28 +08:00
Jiarui Fang
44e4891f57 [zero] able to place params on cpu after zero init context (#365)
* place params on cpu after zero init context

* polish code
2022-03-11 15:50:28 +08:00
Jiarui Fang
ea2872073f [zero] global model data memory tracer (#360) 2022-03-11 15:50:28 +08:00
ver217
1388671699 [zero] Update sharded model v2 using sharded param v2 (#323) 2022-03-11 15:50:28 +08:00
jiaruifang
dec24561cf show pytest parameterize 2022-03-11 15:50:28 +08:00
Jiarui Fang
11bddb6e55 [zero] update zero context init with the updated test utils (#327) 2022-03-11 15:50:28 +08:00
Jiarui Fang
de0468c7a8 [zero] zero init context (#321)
* add zero init context

* add more flags for zero init context
fix bug of repeated converting param to ShardedParamV2

* polish code
2022-03-11 15:50:28 +08:00