Commit Graph

70 Commits

Author SHA1 Message Date
HELSON
8213f89fd2 [gemini] add fake_release_chunk for keep-gathered chunk in the inference mode (#2671) 2023-02-13 14:35:32 +08:00
HELSON
707b11d4a0 [gemini] update ddp strict mode (#2518)
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
HELSON
5521af7877 [zero] fix state_dict and load_state_dict for ddp ignored parameters (#2443)
* [ddp] add is_ddp_ignored

[ddp] rename to is_ddp_ignored

* [zero] fix state_dict and load_state_dict

* fix bugs

* [zero] update unit test for ZeroDDP
2023-01-11 14:55:41 +08:00
HELSON
bb4e9a311a [zero] add inference mode and its unit test (#2418) 2023-01-11 10:07:37 +08:00
HELSON
ea13a201bb [polish] polish code for get_static_torch_model (#2405)
* [gemini] polish code

* [testing] remove code

* [gemini] make more robust
2023-01-09 17:41:38 +08:00
HELSON
48d33b1b17 [gemini] add get static torch model (#2356) 2023-01-06 13:41:19 +08:00
HELSON
a3100bd50d [testing] add beit model for unit testings (#2196)
* [testing] add beit model

* [beit] fix bugs

* [beit] fix bugs

* [testing] fix bugs
2022-12-26 17:35:36 +08:00
HELSON
2458659919 [zero] fix error for BEiT models (#2169)
* [zero] fix error for BEiT models

* [ColoParameter] add unpack operation for tuple arguments

* fix bugs

* fix chunkv2 unit testing

* add assertion for gradient state
2022-12-26 15:03:54 +08:00
Jiarui Fang
27327a4c90 [example] add palm pytorch version (#2172) 2022-12-22 10:15:34 +08:00
Jiarui Fang
2827f41898 [Gemini] GeminiDPP convert to PyTorch Module. (#2151) 2022-12-20 10:19:36 +08:00
Jiarui Fang
c89c66a858 [Gemini] update API of the chunkmemstatscollector. (#2129) 2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446 [Gemini] update the non model data record method in runtime memory tracer (#2128) 2022-12-13 17:11:31 +08:00
Jiarui Fang
deee317b0f [Gemini] test step-tensor mapping using repeated_computed_layers.py (#2127) 2022-12-13 16:34:10 +08:00
Jiarui Fang
8fac837679 [Gemini] update non model data calculation method (#2126) 2022-12-13 15:44:07 +08:00
Jiarui Fang
5efda69735 [Gemini] hotfix the unittest bugs (#2125) 2022-12-13 14:14:55 +08:00
Jiarui Fang
05bb28aacf [Gemini] mapping of preop timestep and param (#2124) 2022-12-13 12:50:24 +08:00
Jiarui Fang
9214d1fe28 [Gemini] chunk init using runtime visited param order (#2115) 2022-12-12 18:06:16 +08:00
Jiarui Fang
e5aa8333e4 [NFC] update chunk manager API (#2119) 2022-12-12 16:57:22 +08:00
Jiarui Fang
e99edfcb51 [NFC] polish comments for Chunk class (#2116) 2022-12-12 15:39:31 +08:00
HELSON
63fbba3c19 [zero] add L2 gradient clipping for ZeRO (#2112)
* [zero] add L2 gradient clipping

* [testing] add MlpModel

* [zero] add unit test for grad clipping

* fix atol
2022-12-09 18:09:17 +08:00
Jiarui Fang
70a8556946 [gemini] get the param visited order during runtime (#2108) 2022-12-09 16:13:03 +08:00
Jiarui Fang
85efb7ac2e [Gemini] gemini use the runtime memory tracer (RMT) (#2099) 2022-12-07 23:04:02 +08:00
Jiarui Fang
978242326a [Gemini] remove eval in gemini unittests! (#2092) 2022-12-07 11:58:37 +08:00
Jiarui Fang
1fca5d79ea [Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) 2022-12-06 22:30:16 +08:00
Jiarui Fang
25abae6d7f [Gemini] use MemStats in Runtime Memory tracer (#2088) 2022-12-06 19:48:20 +08:00
Jiarui Fang
4f21c9e8d9 [Gemini] polish runtime tracer tests (#2077) 2022-12-05 16:22:49 +08:00
Jiarui Fang
a7adad9ccb [Gemini] rename hooks related to runtime mem tracer (#2076) 2022-12-05 15:00:03 +08:00
Jiarui Fang
40b7d55bf3 [Gemini] add albert in test models. (#2075) 2022-12-05 14:09:34 +08:00
Jiarui Fang
616ed91ecd [test] bert test in non-distributed way (#2074) 2022-12-05 13:32:16 +08:00
Jiarui Fang
223332ff7e [Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) 2022-12-05 12:45:11 +08:00
Jiarui Fang
9f828ef36f [Gemini] remove not used MemtracerWrapper (#2072) 2022-12-05 11:57:59 +08:00
Zihao
38ea4ba1bd [Gemini] fix grad unreleased issue and param recovery issue (#2052) 2022-12-02 16:04:19 +08:00
HELSON
f6178728a0 [gemini] fix init bugs for modules (#2047)
* [gemini] fix init bugs for modules

* fix bugs
2022-11-30 17:06:10 +08:00
Zihao
6a9158f1fa [Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) 2022-11-30 15:57:45 +08:00
Jiarui Fang
1e885329f4 [test] align model name with the file name. (#2045) 2022-11-30 15:45:26 +08:00
Jiarui Fang
31c644027b [hotfix] hotfix Gemini for no leaf modules bug (#2043) 2022-11-30 14:53:41 +08:00
HELSON
384cd26314 [zero] fix testing parameters (#2042) 2022-11-30 12:09:32 +08:00
HELSON
17a3c685b0 [zero] fix unit-tests (#2039) 2022-11-30 10:40:31 +08:00
Jiarui Fang
eb7742a4bb [Gemini] more tests for Gemini (#2038)
* [Gemini] more tests for Gemini

* polish code
2022-11-29 17:13:10 +08:00
HELSON
537e181705 [testing] fix testing models (#2036)
* [testing] fix testing models

* roll back
2022-11-29 13:42:06 +08:00
Jiarui Fang
96134e7be3 [hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
Jiarui Fang
28aa9a4294 [Gemini] more rigorous unit tests for run_fwd_bwd (#2034) 2022-11-29 09:26:06 +08:00
Zihao
95c4532fff [Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang
8daf1b4db1 [Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Jiarui Fang
2e9cbfca12 [Gemini] add unitests to check gemini correctness (#2015) 2022-11-24 16:51:45 +08:00
Jiarui Fang
0b0d8f9e17 [hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao
0160a62a3c [Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
Jiarui Fang
3d907faede [Gemini] add an inline_op_module to common test models and polish unitests. (#2004) 2022-11-23 16:55:54 +08:00
Jiarui Fang
5bec3b2168 [Gemini] open grad checkpoint when model building (#1984) 2022-11-18 16:32:54 +08:00
Jiarui Fang
3712ac7f90 [Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00