Commit Graph

349 Commits

Author SHA1 Message Date
ver217
1f894e033f [gemini] zero supports gemini (#1093)
* add placement policy

* add gemini mgr

* update mem stats collector

* update zero

* update zero optim

* fix bugs

* zero optim monitor os

* polish unit test

* polish unit test

* add assert
2022-06-10 14:48:28 +08:00
ver217
be01db37c8 [tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
* polish chunk manager

* polish unit test

* impl add_extern_static_tensor for chunk mgr

* add mem stats collector v2

* polish code

* polish unit test

* polish code

* polish get chunks
2022-06-09 20:56:34 +08:00
ver217
c5cd3b0f35 [zero] zero optim copy chunk rather than copy tensor (#1070) 2022-06-07 10:30:46 +08:00
Jiarui Fang
49832b2344 [refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00
ver217
e3fde4ee6b fix import error in sharded model v2 (#1053) 2022-06-02 13:48:22 +08:00
ver217
51b9a49655 [zero] add zero optimizer for ColoTensor (#1046)
* add zero optimizer

* torch ok

* unit test ok

* polish code

* fix bugs

* polish unit test

* polish zero optim

* polish colo ddp v2

* refactor folder structure

* add comment

* polish unit test

* polish zero optim

* polish unit test
2022-06-02 12:13:15 +08:00
ver217
9492a561c3 [tensor] ColoTensor supports ZeRo (#1015)
* impl chunk manager

* impl param op hook

* add reduce_chunk

* add zero hook v2

* add zero dp

* fix TensorInfo

* impl load balancing when using zero without chunk

* fix zero hook

* polish chunk

* fix bugs

* ddp ok

* zero ok

* polish code

* fix bugs about load balancing

* polish code

* polish code

* add ene-to-end test

* polish code

* polish code

* polish code

* fix typo

* add test_chunk

* fix bugs

* fix bugs

* polish code
2022-05-31 12:00:12 +08:00
ver217
7cfd6c827e [zero] add load_state_dict for sharded model (#894)
* add load_state_dict for sharded model

* fix bug

* fix bug

* fix ckpt dtype and device

* support load state dict in zero init ctx

* fix bugs
2022-05-27 10:25:08 +08:00
ver217
c4d903e64a [gemini] accelerate adjust_layout() (#878)
* add lru cache

* polish code

* update unit test

* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON
425b4a96b8 [gemini] polish stateful_tensor_mgr (#876) 2022-04-26 15:05:03 +08:00
ver217
d7e0303d1e [zero] use GeminiMemoryManager when sampling model data (#850) 2022-04-24 17:17:22 +08:00
ver217
0f7ed8c192 fix _post_init_method of zero init ctx (#847) 2022-04-24 14:16:50 +08:00
HELSON
e5ea3fdeef [gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang
595bedf767 revert zero tensors back (#829) 2022-04-22 12:12:35 +08:00
Jiarui Fang
294a6060d0 [tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.

* [tensor] ZeRO use ColoTensor as the base class.

* polish
2022-04-22 12:00:48 +08:00
Jiarui Fang
eb1b89908c [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang
3ddbd1bce1 [gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
Jiarui Fang
61c20b44bc [log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
ver217
dd92b90a68 [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly

* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang
e761ad2cd7 Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON
88759e289e [zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang
8711c706f4 [hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217
f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON
4c4388c46e [hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
HELSON
a65cbb7e4e [zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
ver217
6e553748a7 polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
Jiarui Fang
10ef8afdd2 [gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217
dcca614eee [hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
ver217
a93a7d7364 [hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard

* disable test stm

* polish code
2022-04-14 14:56:46 +08:00
ver217
8f7ce94b8e [hotfix] fix auto tensor placement policy (#753) 2022-04-14 12:04:45 +08:00
HELSON
84c6700b2a [zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
Jiarui Fang
3d7dc46d33 [zero] use factory pattern for tensor_placement_policy (#752) 2022-04-14 11:07:29 +08:00
ver217
4b048a8728 fix prepare grads in sharded optim (#749) 2022-04-13 22:36:11 +08:00
ver217
e396bb71f2 [zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56 [zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
ver217
e6212f56cd [hotfix] fix memory leak in backward of sharded model (#741) 2022-04-13 09:59:05 +08:00
Jiarui Fang
7db3ccc79b [hotfix] remove duplicated param register to stateful tensor manager (#728) 2022-04-12 13:55:25 +08:00
Jiarui Fang
4d90a7b513 [refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Jiarui Fang
193dc8dacb [refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
HELSON
dbd96fe90a [zero] check whether gradients have inf and nan in gpu (#712) 2022-04-11 15:40:13 +08:00
ver217
715b86eadd [hotfix] fix stm cuda model data size (#710) 2022-04-11 15:10:39 +08:00
HELSON
a9b8300d54 [zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
ver217
ab8c6b4a0e [zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
HELSON
ee112fe1da [zero] adapt zero hooks for unsharded module (#699) 2022-04-08 20:23:26 +08:00
ver217
3c9cd5bb5e [zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
HELSON
d7ecaf362b [zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
2022-04-07 17:38:45 +08:00
Jiarui Fang
59bf2dc590 [zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
HELSON
17e73e62cc [hotfix] fix bugs for unsharded parameters when restore data (#664) 2022-04-03 22:02:11 +08:00
Jiarui Fang
0aab52301e [hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00