Commit Graph

213 Commits

Author SHA1 Message Date
Frank Lee
bfdc5ccb7b [context] maintain the context object in with statement (#1073) 2022-06-07 10:48:45 +08:00
Jiarui Fang
49832b2344 [refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00
Jiarui Fang
a00644079e reorgnize colotensor directory (#1062)
* reorgnize colotensor directory

* polish code
2022-06-03 18:04:22 +08:00
Ziyue Jiang
df9dcbbff6 [Tensor] add hybrid device demo and fix bugs (#1059) 2022-06-03 12:09:49 +08:00
Ziyue Jiang
7c530b9de2 [Tensor] add Parameter inheritance for ColoParameter (#1041)
* add Parameter inheritance for ColoParameter

* remove tricks

* remove tricks

* polish

* polish
2022-05-30 17:23:44 +08:00
Ziyue Jiang
6c5996a56e [Tensor] add module check and bert test (#1031)
* add Embedding

* Add bert test

* polish

* add check module test

* polish

* polish

* polish

* polish
2022-05-26 18:15:42 +08:00
Ziyue Jiang
32291dd73f [Tensor] add module handler for linear (#1021)
* add module spec for linear

* polish

* polish

* polish
2022-05-26 11:50:44 +08:00
ver217
007ca0df92 fix colo init context (#1026) 2022-05-25 20:41:58 +08:00
ver217
ad536e308e [tensor] refactor colo-tensor (#992)
* refactor colo-tensor and update linear op

* polish code

* polish code

* update ops and unit tests

* update unit tests

* polish code

* rename dist_spec module

* polish code

* polish code

* remove unneeded import

* fix pipelinable
2022-05-19 12:44:59 +08:00
Ziyue Jiang
d73c2b1d79 [Tensor] fix init context (#931)
* change torch.Parameter to ColoParameter

* fix post assignment for init context

* polish

* polish
2022-05-11 15:48:12 +08:00
Ziyue Jiang
dfc88b85ea [Tensor] simplify named param (#928)
* simplify ColoModulize

* simplify ColoModulize

* polish

* polish
2022-05-11 10:54:19 +08:00
YuliangLiu0306
32a45cd7ef [pipelinable]use pipelinable to support GPT model. (#903)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [pipelinable]use pipelinable to support GPT model.

* fix a bug caused by ShardedModel

* polish

* fix front func list
2022-05-11 09:23:58 +08:00
Ziyue Jiang
c195d2814c [Tensor] add from_pretrained support and bert pretrained test (#921)
* add from_pretrained support and test

* polish

* polish

* polish

* polish
2022-05-09 16:11:47 +08:00
Jiarui Fang
ab95ec9aea [Tensor] init ColoParameter (#914) 2022-05-06 12:57:14 +08:00
Jiarui Fang
d16671da75 [Tensor] initialize the ColoOptimizer (#898)
* [Tensor] activation is an attr of ColoTensor

* [Tensor] add optimizer

* only detach parameters in context

* polish code
2022-04-28 15:23:40 +08:00
Jiarui Fang
676f191532 [Tensor] activation is an attr of ColoTensor (#897) 2022-04-28 14:43:22 +08:00
Jiarui Fang
26c49639d8 [Tensor] overriding paramters() for Module using ColoTensor (#889) 2022-04-27 15:28:59 +08:00
ver217
4df6471f5d fix import error (#880) 2022-04-26 19:28:40 +08:00
Jiarui Fang
d01d3b8cb0 colo init context add device attr. (#866) 2022-04-25 14:24:26 +08:00
YuliangLiu0306
c6930d8ddf [pipelinable]use ColoTensor to replace dummy tensor. (#853) 2022-04-24 18:31:22 +08:00
ver217
232142f402 [utils] refactor profiler (#837)
* add model data profiler

* add a subclass of torch.profiler.profile

* refactor folder structure

* remove redundant codes

* polish code

* use GeminiMemoryManager

* fix import path

* fix stm profiler ext

* polish comments

* remove useless file
2022-04-24 17:03:59 +08:00
Jiarui Fang
62f059251b [Tensor] init a tp network training unittest (#849) 2022-04-24 16:43:44 +08:00
ver217
0dea140760 [hotfix] add deconstructor for stateful tensor (#848)
* add deconstructor for stateful tensor

* fix colo init context
2022-04-24 15:03:04 +08:00
YuliangLiu0306
35ea6e1023 [pipelinable]use pipelinable context to initialize non-pipeline model (#816)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [pipeline]add module lazy init feature to support large model initization.

* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model

* refactor the module structure

* polish

* [pipelinable]add unit test for pipelinable

* polish

* polish

* Fix CodeFactor issues.
2022-04-24 13:03:12 +08:00
Jiarui Fang
8789850eea Init Conext supports lazy allocate model memory (#842) 2022-04-22 18:03:35 +08:00
Jiarui Fang
eb1b89908c [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang
227d1cd4b3 [gemini] APIs to set cpu memory capacity (#809) 2022-04-19 16:05:22 +08:00
Jiarui Fang
681addb512 [refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Jiarui Fang
4d9332b4c5 [refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
HELSON
84c6700b2a [zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
HELSON
340e59f968 [utils] add synchronized cuda memory monitor (#740) 2022-04-13 10:50:54 +08:00
Jiarui Fang
53cb584808 [utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
Frank Lee
2412429d54 [util] fixed activation checkpointing on torch 1.9 (#719) 2022-04-12 09:35:45 +08:00
Jiarui Fang
193dc8dacb [refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
LuGY
140263a394 [hotfix]fixed bugs of assigning grad states to non leaf nodes (#711)
* fixed bugs of assigning grad states to non leaf nodes

* use detach()
2022-04-11 14:04:58 +08:00
ver217
ab8c6b4a0e [zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
ver217
3c9cd5bb5e [zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
Jiarui Fang
59bf2dc590 [zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
Jiarui Fang
0aab52301e [hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
HELSON
e5d615aeee [hotfix] fix bugs in testing (#659)
* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
LuGY
1e2557e801 [zero] fixed the activation offload (#647)
* fixed the activation offload

* polish
2022-04-02 16:21:32 +08:00
ver217
f5d3a9c2b0 polish checkpoint docstring (#637) 2022-04-02 13:34:33 +08:00
HELSON
055fbf5be6 [zero] adapt zero for unsharded paramters (Optimizer part) (#601) 2022-04-01 20:10:47 +08:00
アマデウス
acae68eb04 [model checkpoint] updated checkpoint save/load utils (#592) 2022-04-01 16:49:21 +08:00
ver217
369a288bf3 polish utils docstring (#620) 2022-04-01 16:36:47 +08:00
LuGY
02b187c14f [zero] add sampling time for memstats collector (#610) 2022-04-01 14:03:00 +08:00
アマデウス
54e688b623 moved ensure_path_exists to utils.common (#591) 2022-04-01 09:46:33 +08:00
Jiarui Fang
e956d93ac2 [refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
HELSON
e6d50ec107 [zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
ver217
7c6c427db1 [zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00