Commit Graph

88 Commits

Author SHA1 Message Date
ver217
823f3b9cf4 [doc] add deepspeed citation and copyright (#2996)
* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright
2023-03-04 20:08:11 +08:00
YH
7b13f7db18 [zero] trivial zero optimizer refactoring (#2869)
* Fix mionr grad store interface

* Apply lint
2023-02-27 14:04:53 +08:00
Boyuan Yao
8e3f66a0d1 [zero] fix wrong import (#2777) 2023-02-17 10:26:07 +08:00
Nikita Shulga
01066152f1 Don't use torch._six (#2775)
* Don't use `torch._six`

This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709

* Update common.py
2023-02-17 09:22:45 +08:00
YH
ae86a29e23 Refact method of grad store (#2687) 2023-02-15 22:27:58 +08:00
HELSON
df4f020ee3 [zero1&2] only append parameters with gradients (#2681) 2023-02-13 18:00:16 +08:00
HELSON
b528eea0f0 [zero] add zero wrappers (#2523)
* [zero] add zero wrappers

* change names

* add wrapper functions to init
2023-01-29 17:52:58 +08:00
HELSON
077a5cdde4 [zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism

* [testing] change model name to avoid pytest warning

* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
HELSON
d565a24849 [zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
HELSON
a5dc4253c6 [zero] polish low level optimizer (#2473) 2023-01-13 14:56:17 +08:00
Jiarui Fang
867c8c2d3a [zero] low level optim supports ProcessGroup (#2464) 2023-01-13 10:05:58 +08:00
HELSON
62c38e3330 [zero] polish low level zero optimizer (#2275) 2023-01-03 17:22:34 +08:00
HELSON
a7d95b7024 [example] add zero1, zero2 example in GPT examples (#2146)
* [example] add zero1 and zero2 for GPT

* update readme in gpt example

* polish code

* change init value

* update readme
2022-12-20 14:30:27 +08:00
HELSON
a1ce02d740 [zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
HELSON
7066dfbf82 [zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
HELSON
6e51d296f0 [zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
ver217
c9e8ce67b8 fix move fp32 shards (#1604) 2022-09-16 17:33:16 +08:00
ver217
ce470ba37e [checkpoint] sharded optim save/load grad scaler (#1350) 2022-07-21 15:21:21 +08:00
ver217
a45ddf2d5f [hotfix] fix sharded optim step and clip_grad_norm (#1226) 2022-07-08 13:34:48 +08:00
ver217
9e1daa63d2 [zero] sharded optim supports loading local state dict (#1170)
* sharded optim supports loading local state dict

* polish code

* add unit test
2022-06-24 18:05:16 +08:00
ver217
6690a61b4d [hotfix] prevent nested ZeRO (#1140) 2022-06-21 11:33:53 +08:00
ver217
c4d903e64a [gemini] accelerate adjust_layout() (#878)
* add lru cache

* polish code

* update unit test

* fix sharded optim
2022-04-26 18:08:31 +08:00
ver217
d7e0303d1e [zero] use GeminiMemoryManager when sampling model data (#850) 2022-04-24 17:17:22 +08:00
HELSON
e5ea3fdeef [gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang
61c20b44bc [log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang
8711c706f4 [hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217
f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON
4c4388c46e [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
ver217
6e553748a7 polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
Jiarui Fang
10ef8afdd2 [gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217
4b048a8728 fix prepare grads in sharded optim (#749) 2022-04-13 22:36:11 +08:00
ver217
e396bb71f2 [zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56 [zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
Jiarui Fang
4d90a7b513 [refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
HELSON
dbd96fe90a [zero] check whether gradients have inf and nan in gpu (#712) 2022-04-11 15:40:13 +08:00
HELSON
a9b8300d54 [zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
HELSON
ee112fe1da [zero] adapt zero hooks for unsharded module (#699) 2022-04-08 20:23:26 +08:00
ver217
3c9cd5bb5e [zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
HELSON
17e73e62cc [hotfix] fix bugs for unsharded parameters when restore data (#664) 2022-04-03 22:02:11 +08:00
Jiarui Fang
0aab52301e [hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
HELSON
055fbf5be6 [zero] adapt zero for unsharded paramters (Optimizer part) (#601) 2022-04-01 20:10:47 +08:00
ver217
0ef8819c67 polish docstring of zero (#612) 2022-04-01 14:50:56 +08:00
ver217
9bee119104 [hotfix] fix sharded optim zero grad (#604)
* fix sharded optim zero grad

* polish comments
2022-04-01 12:41:20 +08:00
Jiarui Fang
e956d93ac2 [refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
ver217
7c6c427db1 [zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00
Jiarui Fang
7675366fce [polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
ver217
014bac0c49 [zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model

* polish comments

* polish comments
2022-03-30 18:14:50 +08:00
Jiarui Fang
f552b11294 [zero] label state for param fp16 and grad (#551) 2022-03-30 15:57:46 +08:00
Jiarui Fang
107b99ddb1 [zero] dump memory stats for sharded model (#548) 2022-03-30 09:38:44 +08:00