Commit Graph

998 Commits

Author SHA1 Message Date
HELSON
a1ce02d740 [zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Ziyue Jiang
b0936e4a44 [rpc] split with dag (#2028)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

* add DAG middleware in scheduler

* add test case for scheduler

* remove break

* recover old lifecycle

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-29 11:36:28 +08:00
Jiarui Fang
96134e7be3 [hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
YuliangLiu0306
0dbcd4a6f5 [autoparallel] add split handler (#2032)
* [autoparallel] add split handler

* add numerical test and runtime passes
2022-11-29 11:03:51 +08:00
Jiarui Fang
28aa9a4294 [Gemini] more rigorous unit tests for run_fwd_bwd (#2034) 2022-11-29 09:26:06 +08:00
YuliangLiu0306
81330b0352 [autoparallel] add experimental permute handler (#2029) 2022-11-27 20:26:52 +08:00
Zihao
95c4532fff [Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang
8daf1b4db1 [Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Ziyue Jiang
632753abbc [fx]Split partition with DAG information (#2025)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306
ea0f6b8df9 [autoparallel] add runtime pass and numerical test for view handler (#2018) 2022-11-25 15:50:16 +08:00
Zihao
a719b89a41 [gemini] param_trace_hook (#2020) 2022-11-24 18:08:36 +08:00
Jiarui Fang
0b0d8f9e17 [hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao
aba3db464d [Gemini] ParamMemHook (#2008) 2022-11-24 15:22:51 +08:00
Zihao
0160a62a3c [Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
YuliangLiu0306
1438993113 [autoparallel] add experimental view handler (#2011)
* [autoparallel] add experimental view handler

* polish

* polish

* polish code

* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang
d655eea515 [autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Frank Lee
2bab6f512c [release] release v0.1.11rc4 (#2007) 2022-11-23 17:14:32 +08:00
Boyuan Yao
6cd784ffee [autoparallel] Add metainfo support for F.linear (#1987)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
Super Daniel
2edbef13cc [fx] add more meta_registry for MetaTensor execution. (#2000)
* [sc] add examples for auto checkpoint.

* merge upstream

* [fx] add more meta_registry for MetaTensor execution.
2022-11-23 10:55:46 +08:00
Jiarui Fang
a2d3266648 [hotfix] make Gemini work for conv DNN (#1998) 2022-11-22 14:52:36 +08:00
YuliangLiu0306
155891113e [autoparallel] use pytree map style to process data (#1989) 2022-11-21 10:44:22 +08:00
YuliangLiu0306
35e6b9ec82 [autoparallel] adapt handlers with attention block (#1990)
* [autoparallel] adapt handlers with attention block

* polish
2022-11-21 10:44:11 +08:00
YuliangLiu0306
05020e50d0 [autoparallel] support more flexible data type (#1967) 2022-11-18 17:01:06 +08:00
Boyuan Yao
c26f21d365 [autoparallel] add pooling metainfo (#1968)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo
2022-11-18 15:13:03 +08:00
Jiarui Fang
3712ac7f90 [Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00
Jiarui Fang
e481489aa6 [Gemini] MemtracerWrapper unittests (#1981) 2022-11-18 14:19:40 +08:00
Jiarui Fang
31922110ad [Gemini] memory trace hook (#1978) 2022-11-18 11:52:55 +08:00
Jiarui Fang
0529fcde06 [Gemini] independent runtime tracer (#1974) 2022-11-18 10:53:42 +08:00
YuliangLiu0306
0da1d00399 [autoparallel] support distributed dataloader option (#1906)
* [autoparallel] support distributed dataloader option

* update output handler to support ddp dataloader

* poish code
2022-11-17 20:11:53 +08:00
Genghan Zhang
6630d45546 [autoparallel] Add alpha beta (#1973)
* Add alpha beta

* Fix test

* Fix test
2022-11-17 16:01:14 +08:00
Jiarui Fang
cc0ed7cf33 [Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
ver217
f8a7148dec [kernel] move all symlinks of kernel to colossalai._C (#1971) 2022-11-17 13:42:33 +08:00
Jiarui Fang
7e24b9b9ee [Gemini] clean no used MemTraceOp (#1970) 2022-11-17 13:41:54 +08:00
Boyuan Yao
7c7921f71b [autoparallel] add torch.nn.ReLU metainfo (#1868)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input
2022-11-16 23:12:31 +08:00
Jiarui Fang
8c66a1d0aa [polish] remove useless file _mem_tracer_hook.py (#1963) 2022-11-16 15:55:10 +08:00
Jiarui Fang
c4739a725a [Gemini] polish memstats collector (#1962) 2022-11-16 15:45:57 +08:00
YuliangLiu0306
fea3cb661c [autoparallel] support addmm in tracer and solver (#1961)
* [fx] patch addmm

* [autoparallel] support addmm in tracer and solver
2022-11-16 14:59:18 +08:00
Jiarui Fang
f7e276fa71 [Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON
7066dfbf82 [zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
Jiarui Fang
52c6ad26e0 [ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
zbian
598d456d0e fixed logger 2022-11-15 16:00:07 +08:00
zbian
6877121377 updated flash attention api 2022-11-15 15:25:39 +08:00
YuliangLiu0306
36c0f3ea5b [autoparallel] remove redundancy comm node (#1893) 2022-11-15 10:53:41 +08:00
アマデウス
e52f9d9109 [tensorparallel] fixed tp layers (#1938) 2022-11-14 17:34:03 +08:00
Jiarui Fang
9f4fb3f28a [ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
Boyuan Yao
d5c5bc219e [SC] add GPT example for auto checkpoint (#1889)
* [sc] SC tutorial for auto checkpoint

* [sc] polish examples

* [sc] polish readme

* [sc] polish readme and help information

* [sc] polish readme and help information
2022-11-11 23:17:25 +08:00
Junming Wu
14a0b18305 [NFC] polish colossalai/amp/naive_amp/__init__.py code style (#1905) 2022-11-11 17:49:18 +08:00
HELSON
6e51d296f0 [zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Super Daniel
cc55ff0aa4 [autoparallel] user-friendly API for CheckpointSolver. (#1879)
Merge for SC tutorial
2022-11-10 20:59:28 +08:00
Super Daniel
448248b27c [fx] metainfo_trace as an API. (#1873)
* [fx] metainfo_trace as an API.

* [fx] add return.
2022-11-10 20:58:37 +08:00