Develop/experiments (#59)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000

* Integrate 1d tensor parallel in Colossal-AI (#39)

* fixed 1D and 2D convergence (#38)

* optimized 2D operations

* fixed 1D ViT convergence problem

* Feature/ddp (#49)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* support torch ddp

* fix loss accumulation

* add log for ddp

* change seed

* modify timing hook

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* Feature/pipeline (#40)

* remove redundancy func in setup (#19) (#20)

* use env to control the language of doc (#24) (#25)

* Support TP-compatible Torch AMP and Update trainer API (#27)

* Add gradient accumulation, fix lr scheduler

* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)

* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes

* fixed trainer

* Revert "fixed trainer"

This reverts commit 2e0b0b7699.

* improved consistency between trainer, engine and schedule (#23)

Co-authored-by: 1SAA <c2h214748@gmail.com>

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>

* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)

* add explanation for ViT example (#35) (#36)

* optimize communication of pipeline parallel

* fix grad clip for pipeline

Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)

* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset

* update api for better usability (#58)

update api for better usability

Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
This commit is contained in:
Frank Lee
2021-12-09 15:08:29 +08:00
committed by GitHub
parent eb2f8b1f6b
commit da01c234e1
229 changed files with 6532 additions and 8741 deletions

View File

@@ -1,8 +1,6 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
import time
from .cuda import synchronize
@@ -10,7 +8,6 @@ class Timer:
'''
A timer object which helps to log the execution times, and provides different tools to assess the times.
'''
def __init__(self):
self._started = False
self._start_time = time.time()
@@ -31,7 +28,6 @@ class Timer:
def stop(self, keep_in_history: bool = False):
'''Stop the timer and record the start-stop time interval.
:param keep_in_history: whether does it record into history each start-stop interval, defaults to False
:type keep_in_history: bool, optional
:return: start-stop interval
@@ -48,7 +44,6 @@ class Timer:
def get_history_mean(self):
'''mean of all history start-stop time intervals.
:return: mean of time intervals
:rtype: int
'''
@@ -56,7 +51,6 @@ class Timer:
def get_history_sum(self):
'''add up all the start-stop time intervals.
:return: sum of time intervals
:rtype: int
'''
@@ -64,7 +58,6 @@ class Timer:
def get_elapsed_time(self):
'''return the last start-stop time interval. *use it only when timer is not in progress*
:return: the last time interval
:rtype: int
'''
@@ -89,7 +82,6 @@ class MultiTimer:
def start(self, name: str):
'''Start namely one of the timers
:param name: timer's key
:type name: str
'''
@@ -100,7 +92,6 @@ class MultiTimer:
def stop(self, name: str, keep_in_history: bool):
'''Stop namely one of the timers.
:param name: timer's key
:param keep_in_history: whether does it record into history each start-stop interval
:type keep_in_history: bool
@@ -112,7 +103,6 @@ class MultiTimer:
def get_timer(self, name):
'''Get timer by its name (from multitimer)
:param name: timer's key
:return: timer with the name you give correctly
:rtype: Timer
@@ -121,7 +111,6 @@ class MultiTimer:
def reset(self, name=None):
'''Reset timers.
:param name: if name is designated, the named timer will be reset and others will not, defaults to None
'''
if self._on:
@@ -132,7 +121,6 @@ class MultiTimer:
timer.reset()
def is_on(self):
return self._on
def set_status(self, mode: bool):
@@ -140,4 +128,4 @@ class MultiTimer:
def __iter__(self):
for name, timer in self._timers.items():
yield name, timer
yield name, timer