1
0
mirror of https://github.com/hpcaitech/ColossalAI.git synced 2025-10-08 21:34:29 +00:00

19 Commits

Author SHA1 Message Date
ver217
26b7aac0be [zero] reorganize zero/gemini folder structure ()
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00
ver217
823f3b9cf4 [doc] add deepspeed citation and copyright ()
* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright
2023-03-04 20:08:11 +08:00
Jiarui Fang
4165eabb1e [hotfix] remove potiential circle import ()
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
Frank Lee
7f2d2b2b5b [engine] fixed empty op hook check ()
* [engine] fixed empty op hook check

* polish code
2022-06-10 17:27:27 +08:00
Frank Lee
11f54c7b6b [doc] improved docstring and assertion messages for the engine module () 2022-04-26 10:00:18 +08:00
RichardoLuo
ad1e7ab2b2 '[NFC] polish <colossalai/engine/_base_engine.py> code style' ()
Co-authored-by: RichardoLuo <14049555596@qq.com>
2022-04-06 11:40:59 +08:00
YuliangLiu0306
ade05a5d83 [refactor] pipeline, put runtime schedule into engine. () 2022-04-03 20:46:45 +08:00
Liang Bowen
ec5086c49c Refactored docstring to google style 2022-03-29 17:17:47 +08:00
Jie Zhu
73d36618a6 [profiler] add MemProfiler ()
* add memory trainer hook

* fix bug

* add memory trainer hook

* fix import bug

* fix import bug

* add trainer hook

* fix  git log bug

* modify `to_tensorboard` function to support better output

* remove useless output

* change the name of `MemProfiler`

* complete memory profiler

* replace error with warning

* finish trainer hook

* modify interface of MemProfiler

* modify `__init__.py` in profiler

* remove unnecessary pass statement

* add usage to doc string

* add usage to trainer hook

* new location to store temp data file
2022-03-29 12:48:34 +08:00
Frank Lee
6a3188167c set criterion as optional in colossalai initialize () 2022-03-11 15:50:28 +08:00
Jie Zhu
d344689274 [profiler] primary memory tracer 2022-03-11 15:50:28 +08:00
Jiarui Fang
569357fea0 add pytorch hooks ()
* add pytorch hooks
fix 

* remove licenses in src code

* add gpu memory tracer

* replacing print with logger in ophooks.
2022-01-25 22:20:54 +08:00
HELSON
0f8c7f9804 Fixed docstring in colossalai () 2022-01-21 10:44:30 +08:00
Frank Lee
e2089c5c15 adapted for sequence parallel () 2022-01-20 13:44:51 +08:00
ver217
96780e6ee4 Optimize pipeline schedule ()
* add pipeline shared module wrapper and update load batch

* added model parallel process group for amp and clip grad ()

* added model parallel process group for amp and clip grad

* update amp and clip with model parallel process group

* remove pipeline_prev/next group ()

* micro batch offload

* optimize pipeline gpu memory usage

* pipeline can receive tensor shape ()

* optimize pipeline gpu memory usage

* fix grad accumulation step counter

* rename classes and functions

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
2021-12-30 15:56:46 +08:00
Frank Lee
35813ed3c4 update examples and sphnix docs for the new api () 2021-12-13 22:07:01 +08:00
Frank Lee
da01c234e1 Develop/experiments ()
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel ()

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule ()

Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI ()

* fixed 1D and 2D convergence ()

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp ()

* remove redundancy func in setup () ()

* use env to control the language of doc () ()

* Support TP-compatible Torch AMP and Update trainer API ()

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel ()

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule ()

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB ()

* add explanation for ViT example () ()

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline ()

* remove redundancy func in setup () ()

* use env to control the language of doc () ()

* Support TP-compatible Torch AMP and Update trainer API ()

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel ()

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule ()

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB ()

* add explanation for ViT example () ()

* optimize communication of pipeline parallel

* fix grad clip for pipeline

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers ()

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability ()

update api for better usability

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 15:08:29 +08:00
Frank Lee
3defa32aee Support TP-compatible Torch AMP and Update trainer API ()
* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel ()

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule ()

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
2021-11-18 19:45:06 +08:00
zbian
404ecbdcc6 Migrated project 2021-10-28 18:21:23 +02:00