Commit Graph

213 Commits

Author SHA1 Message Date
Jiarui Fang
31c644027b [hotfix] hotfix Gemini for no leaf modules bug (#2043) 2022-11-30 14:53:41 +08:00
ver217
f8a7148dec [kernel] move all symlinks of kernel to colossalai._C (#1971) 2022-11-17 13:42:33 +08:00
Jiarui Fang
7e24b9b9ee [Gemini] clean no used MemTraceOp (#1970) 2022-11-17 13:41:54 +08:00
Jiarui Fang
52c6ad26e0 [ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
Jiarui Fang
9f4fb3f28a [ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
Frank Lee
e6ec99d389 [utils] fixed lazy init context (#1867) 2022-11-10 15:17:20 +08:00
Jiarui Fang
3ce4463fe6 [utils] remove lazy_memory_allocate from ColoInitContext (#1844) 2022-11-09 11:50:33 +08:00
ver217
99870726b1 [CheckpointIO] a uniform checkpoint I/O module (#1689) 2022-11-08 15:15:13 +08:00
HELSON
1468e4bcfc [zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
Kirigaya Kazuto
3b2a59b0ba [pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug (#1681)
* [pipeline/tuning] improve dispatch performance both time and space cost

* [pipeline/converge] add interface for testing convergence

* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style

* Update PipelineBase.py

* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera

* [pipeline/chimera] test chimera | fix bug of initializing

* [pipeline/pytree] add pytree to process args and kwargs | provide  to process args and kwargs after forward
2022-10-09 17:32:57 +08:00
CsRic
2ac46f7be4 [NFC] polish utils/tensor_detector/__init__.py code style (#1573)
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-09-08 22:11:04 +08:00
LuGY
c7d4932956 [NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style (#1566) 2022-09-08 22:11:04 +08:00
Kirigaya Kazuto
318fbf1145 [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style (#1559) 2022-09-08 22:04:34 +08:00
ver217
ae71036cd2 [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548)
* refactor parallel layer

* broadcast rank0 model after load ckpt
2022-09-06 20:18:35 +08:00
ver217
2bed096848 [utils] optimize partition_tensor_parallel_state_dict (#1546) 2022-09-06 17:45:31 +08:00
ver217
a203b709d5 [hotfix] fix init context (#1543)
* fix init context

* fix lazy init ctx
2022-09-06 11:45:08 +08:00
Boyuan Yao
47fd8e4a02 [utils] Add use_reetrant=False in utils.activation_checkpoint (#1460)
* [utils] Add use_reetrant=False into colossalai checkpoint

* [utils] add some annotation in utils.activaion_checkpoint

* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py

* [test] modify test_activation_checkpoint.py

* [test] modify test for reentrant=False
2022-08-16 15:39:20 +08:00
Frank Lee
5a52e21fe3 [test] fixed the activation codegen test (#1447)
* [test] fixed the activation codegen test

* polish code
2022-08-12 14:52:31 +08:00
ver217
821c6172e2 [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
HELSON
527758b2ae [hotfix] fix a running error in test_colo_checkpoint.py (#1387) 2022-07-29 15:58:06 +08:00
HELSON
b6fd165f66 [checkpoint] add kwargs for load_state_dict (#1374) 2022-07-28 15:56:52 +08:00
Frank Lee
0c1a16ea5b [util] standard checkpoint function naming (#1377) 2022-07-28 09:29:30 +08:00
Super Daniel
be229217ce [fx] add torchaudio test (#1369)
* [fx]add torchaudio test

* [fx]add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test

* [fx] add torchaudio test and test patches

* Delete ~

* [fx] add patches and patches test

* [fx] add patches and patches test

* [fx] fix patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] fix rnn patches

* [fx] merge upstream

* [fx] fix import errors
2022-07-27 11:03:14 +08:00
HELSON
8463290642 [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) 2022-07-26 14:41:53 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
HELSON
943a96323e [hotfix] fix no optimizer in save/load (#1363) 2022-07-26 10:53:53 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
Frank Lee
2cc1175c76 [fx] tested the complete workflow for auto-parallel (#1336)
* [fx] tested the complete workflow for auto-parallel

* polish code

* polish code

* polish code
2022-07-20 10:45:17 +08:00
HELSON
f92c100ddd [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
Frank Lee
250be4d31e [utils] integrated colotensor with lazy init context (#1324)
* [utils] integrated colotensor with lazy init context

* polish code

* polish code

* polish code
2022-07-15 17:47:12 +08:00
Jiarui Fang
9e4c6449b0 [checkpoint] add ColoOptimizer checkpointing (#1316) 2022-07-15 09:52:55 +08:00
Jiarui Fang
3ef3791a3b [checkpoint] add test for bert and hotfix save bugs (#1297) 2022-07-14 15:38:18 +08:00
Jiarui Fang
4165eabb1e [hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
Jiarui Fang
c92f84fcdb [tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Jiarui Fang
9bcd2fd4af [tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang
20da6e48c8 [checkpoint] save sharded optimizer states (#1237) 2022-07-08 16:33:13 +08:00
Jiarui Fang
3b500984b1 [tensor] fix some unittests (#1234) 2022-07-08 14:18:30 +08:00
ver217
a45ddf2d5f [hotfix] fix sharded optim step and clip_grad_norm (#1226) 2022-07-08 13:34:48 +08:00
Yi Zhao
04537bf83e [checkpoint]support generalized scheduler (#1222) 2022-07-07 18:16:38 +08:00
Jiarui Fang
52736205d9 [checkpoint] make unitest faster (#1217) 2022-07-06 17:39:46 +08:00
Jiarui Fang
f38006ea83 [checkpoint] checkpoint for ColoTensor Model (#1196) 2022-07-06 17:22:03 +08:00
Jiarui Fang
ae7d3f4927 [refactor] move process group from _DistSpec to ColoTensor. (#1203) 2022-07-06 16:15:16 +08:00
YuliangLiu0306
63d2a93878 [context]support arbitary module materialization. (#1193)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [context]support arbitary module materialization.

* [test]add numerical check for lazy init context.
2022-07-04 10:12:02 +08:00
YuliangLiu0306
2053e138a2 [context]use meta tensor to init model lazily. (#1187)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [context]use meta tensor to init model lazily.

* polish

* make module with device kwargs bypass the normal init.

* change unit test to adapt updated context.
2022-06-29 21:02:30 +08:00
YuliangLiu0306
e27645376d [hotfix]different overflow status lead to communication stuck. (#1175)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [hotfix]fix some bugs caused by refactored schedule.

* [hotfix]different overflow statu llead to communication stuck.
2022-06-27 09:53:57 +08:00
Jiarui Fang
4b9bba8116 [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) 2022-06-24 13:08:54 +08:00
Frank Lee
f8eec98ff5 [tensor] fixed non-serializable colo parameter during model checkpointing (#1153) 2022-06-22 11:43:38 +08:00
Frank Lee
73ad05fc8c [zero] added error message to handle on-the-fly import of torch Module class (#1135)
* [zero] added error message to handle on-the-fly import of torch Module class

* polish code
2022-06-20 11:24:27 +08:00
Frank Lee
2b2dc1c86b [pipeline] refactor the pipeline module (#1087)
* [pipeline] refactor the pipeline module

* polish code
2022-06-10 11:27:38 +08:00
Frank Lee
bad5d4c0a1 [context] support lazy init of module (#1088)
* [context] support lazy init of module

* polish code
2022-06-10 10:09:48 +08:00