Commit Graph

214 Commits

Author SHA1 Message Date
Hongxin Liu
f313babd11 [gemini] support save state dict in shards (#3581)
* [gemini] support state dict shard

* [gemini] add test state dict shard

* [gemini] polish docstr

* [gemini] fix merge

* [gemini] polish code
2023-04-17 17:11:09 +08:00
YH
d329c294ec Add docstr for zero3 chunk search utils (#3572) 2023-04-17 12:44:17 +08:00
Hongxin Liu
173dad0562 [misc] add verbose arg for zero and op builder (#3552)
* [misc] add print verbose

* [gemini] add print verbose

* [zero] add print verbose for low level

* [misc] add print verbose for op builder
2023-04-17 11:25:35 +08:00
Hongxin Liu
152239bbfa [gemini] gemini supports lazy init (#3379)
* [gemini] fix nvme optimizer init

* [gemini] gemini supports lazy init

* [gemini] add init example

* [gemini] add fool model

* [zero] update gemini ddp

* [zero] update init example

* add chunk method

* add chunk method

* [lazyinit] fix lazy tensor tolist

* [gemini] fix buffer materialization

* [misc] remove useless file

* [booster] update gemini plugin

* [test] update gemini plugin test

* [test] fix gemini plugin test

* [gemini] fix import

* [gemini] fix import

* [lazyinit] use new metatensor

* [lazyinit] use new metatensor

* [lazyinit] fix __set__ method
2023-04-12 16:03:25 +08:00
YH
bcf0cbcbe7 [doc] Add docs for clip args in zero optim (#3504) 2023-04-10 11:11:28 +08:00
ver217
573af84184 [example] update examples related to zero/gemini (#3431)
* [zero] update legacy import

* [zero] update examples

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix import
2023-04-04 17:32:51 +08:00
ver217
26b7aac0be [zero] reorganize zero/gemini folder structure (#3424)
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00
YH
80aed29cd3 [zero] Refactor ZeroContextConfig class using dataclass (#3186) 2023-03-21 12:36:47 +08:00
YH
9d644ff09f Fix docstr for zero statedict (#3185) 2023-03-21 11:48:21 +08:00
ver217
823f3b9cf4 [doc] add deepspeed citation and copyright (#2996)
* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright
2023-03-04 20:08:11 +08:00
YH
7b13f7db18 [zero] trivial zero optimizer refactoring (#2869)
* Fix mionr grad store interface

* Apply lint
2023-02-27 14:04:53 +08:00
Boyuan Yao
8e3f66a0d1 [zero] fix wrong import (#2777) 2023-02-17 10:26:07 +08:00
Nikita Shulga
01066152f1 Don't use torch._six (#2775)
* Don't use `torch._six`

This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709

* Update common.py
2023-02-17 09:22:45 +08:00
YH
ae86a29e23 Refact method of grad store (#2687) 2023-02-15 22:27:58 +08:00
HELSON
df4f020ee3 [zero1&2] only append parameters with gradients (#2681) 2023-02-13 18:00:16 +08:00
HELSON
b528eea0f0 [zero] add zero wrappers (#2523)
* [zero] add zero wrappers

* change names

* add wrapper functions to init
2023-01-29 17:52:58 +08:00
HELSON
077a5cdde4 [zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism

* [testing] change model name to avoid pytest warning

* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
HELSON
d565a24849 [zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
HELSON
a5dc4253c6 [zero] polish low level optimizer (#2473) 2023-01-13 14:56:17 +08:00
Jiarui Fang
867c8c2d3a [zero] low level optim supports ProcessGroup (#2464) 2023-01-13 10:05:58 +08:00
HELSON
7829aa094e [ddp] add is_ddp_ignored (#2434)
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
HELSON
62c38e3330 [zero] polish low level zero optimizer (#2275) 2023-01-03 17:22:34 +08:00
HELSON
a7d95b7024 [example] add zero1, zero2 example in GPT examples (#2146)
* [example] add zero1 and zero2 for GPT

* update readme in gpt example

* polish code

* change init value

* update readme
2022-12-20 14:30:27 +08:00
Jiarui Fang
c89c66a858 [Gemini] update API of the chunkmemstatscollector. (#2129) 2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446 [Gemini] update the non model data record method in runtime memory tracer (#2128) 2022-12-13 17:11:31 +08:00
Jiarui Fang
e99edfcb51 [NFC] polish comments for Chunk class (#2116) 2022-12-12 15:39:31 +08:00
Jiarui Fang
33f4412102 [Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) 2022-12-06 16:43:06 +08:00
Jiarui Fang
b3b89865e2 [Gemini] ParamOpHook -> ColoParamOpHook (#2080) 2022-12-05 17:11:06 +08:00
HELSON
a1ce02d740 [zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Jiarui Fang
cc0ed7cf33 [Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
Jiarui Fang
c4739a725a [Gemini] polish memstats collector (#1962) 2022-11-16 15:45:57 +08:00
Jiarui Fang
f7e276fa71 [Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON
7066dfbf82 [zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
HELSON
6e51d296f0 [zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Zihao
20e255d4e8 MemStatsCollectorStatic (#1765) 2022-11-07 16:49:03 +08:00
HELSON
c6a1a62636 [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
CsRic
ea961d8fd1 [NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717)
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-10-19 12:20:51 +08:00
HELSON
1468e4bcfc [zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON
b28991dd0a [feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6 Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405 [feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
HELSON
f7f2248771 [moe] fix MoE bugs (#1628)
* remove forced FP32 modules

* correct no_shard-contexts' positions
2022-09-22 13:56:30 +08:00
ver217
c9e8ce67b8 fix move fp32 shards (#1604) 2022-09-16 17:33:16 +08:00
Fazzie-Maqianli
06dccdde44 [NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554) 2022-09-08 22:11:04 +08:00
ver217
821c6172e2 [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
ver217
6df3e19be9 [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) 2022-08-09 16:08:12 +08:00
ver217
8dced41ad0 [zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0

* fix unit test
2022-07-29 13:22:50 +08:00
ver217
828b9e5e0d [hotfix] fix zero optim save/load state dict (#1381) 2022-07-28 17:19:39 +08:00
ver217
6b43c789fd fix zero optim backward_by_grad and save/load (#1353) 2022-07-21 16:43:58 +08:00
ver217
d068af81a3 [doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00