Commit Graph

292 Commits

Author SHA1 Message Date
Sze-qq
2144cbae8c [NFC] polish colossalai/nn/lr_scheduler/multistep.py code style (#1572) 2022-09-08 22:11:04 +08:00
superhao1995
e4bf7ae667 [NFC] polish colossalai/nn/lr_scheduler/torch.py code style (#1571)
Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>
2022-09-08 22:11:04 +08:00
Jiatong Han
3263cdf57f [NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570)
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-09-08 22:11:04 +08:00
DouJS
f586887a90 [NFC] polish colossalai/nn/layer/colossalai_layer/dropout.py code style (#1568) 2022-09-08 22:11:04 +08:00
BigOneLiXiaoMing
0c4c9aa6e0 [NFC] polish colossalai/nn/_ops/embedding.py code style (#1561) 2022-09-08 22:11:04 +08:00
Ofey Chan
7cc052f6c0 [NFC] polish colossalai/nn/layer/colossalai_layer/linear.py (#1556) 2022-09-08 22:11:04 +08:00
yuxuan-lou
413f9c19f4 [NFC] polish colossalai/nn/_ops/layernorm.py code style (#1555) 2022-09-08 22:11:04 +08:00
shenggan
8edb777cc2 [NFC] polish colossalai/nn/loss/loss_2p5d.py code style (#1553) 2022-09-08 22:11:04 +08:00
Maruyama_Aya
bd2d789832 [NFC] polish colossalai/nn/_ops/embedding_bag.py code style (#1552) 2022-09-08 22:11:04 +08:00
binmakeswell
73e9eb13b7 [NFC] polish colossalai/nn/lr_scheduler/cosine.py code style 2022-09-08 22:11:04 +08:00
CsRic
a389ac4ec9 [embedding] cache_embedding small improvement (#1564) 2022-09-08 16:41:19 +08:00
ver217
10dd8226b1 add gather_output for VocabParallelClassifier1D (#1569) 2022-09-08 16:40:56 +08:00
ver217
ae71036cd2 [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548)
* refactor parallel layer

* broadcast rank0 model after load ckpt
2022-09-06 20:18:35 +08:00
Jiarui Fang
64169f3e8f [embedding] polish parallel embedding tablewise (#1545) 2022-09-06 10:41:20 +08:00
CsRic
964123ae0f [embedding] freq_aware_embedding: add small functions for caller application (#1537) 2022-09-05 15:12:53 +08:00
Jiarui Fang
521078ffc9 [embedding] fix a bug in table wise sharding (#1538) 2022-09-02 15:48:35 +08:00
Jiarui Fang
87134524fd [embedding] tablewise sharding polish (#1535) 2022-09-02 11:09:37 +08:00
CsRic
5156d5b4f8 [embedding] add tablewise sharding for FAW (#1526) 2022-09-01 17:55:41 +08:00
Jiarui Fang
4537d39df9 [doc] docstring for FreqAwareEmbeddingBag (#1525) 2022-08-31 13:52:30 +08:00
Jiarui Fang
9a9ef65313 [FAW] cpu caching operations (#1520) 2022-08-30 14:50:02 +08:00
Jiarui Fang
af5438caa2 [FAW] refactor reorder() for CachedParamMgr (#1514) 2022-08-29 14:22:07 +08:00
Jiarui Fang
9feee6d06b [FAW] LFU initialize with dataset freq (#1513) 2022-08-29 12:52:53 +08:00
CsRic
1b8fee8e9c [FAW] shrink freq_cnter size (#1509) 2022-08-29 11:44:55 +08:00
Jiarui Fang
ba61109b6c [FAW] remove code related to chunk (#1501) 2022-08-26 14:23:30 +08:00
Jiarui Fang
d5085bb317 [FAW] add more docs and fix a warning (#1500) 2022-08-26 14:10:21 +08:00
CsRic
0ed2f46131 [FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) 2022-08-26 11:24:12 +08:00
CsRic
b8d0e39eaf [FAW] LFU cache for the FAW 2022-08-25 13:08:46 +08:00
Jiarui Fang
cde7b8a5b8 [FAW] init an LFU implementation for FAW (#1488) 2022-08-24 17:37:22 +08:00
Geng Zhang
0aad53c62b [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) 2022-08-23 17:38:24 +08:00
Jiarui Fang
a1476ea882 [NFC] polish doc style for ColoTensor (#1457) 2022-08-16 09:21:05 +08:00
ver217
367c615818 fix nvme docstring (#1450) 2022-08-12 18:01:02 +08:00
Geng Zhang
9f3eed66eb [FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) 2022-08-12 15:55:46 +08:00
Frank Lee
ae1b58cd16 [tensor] added linear implementation for the new sharding spec (#1416)
* [tensor] added linear implementation for the new sharding spec

* polish code
2022-08-12 11:33:09 +08:00
Jiarui Fang
30b4dd17c0 [FAW] export FAW in _ops (#1438) 2022-08-11 13:43:24 +08:00
Jiarui Fang
c9427a323f hotfix #1434 (#1437) 2022-08-11 13:14:25 +08:00
Jiarui Fang
10b3df65c8 [FAW] move coloparam setting in test code. (#1429) 2022-08-10 14:31:53 +08:00
Jiarui Fang
cb98cf5558 [FAW] parallel FreqAwareEmbedding (#1424) 2022-08-10 13:44:30 +08:00
Jiarui Fang
d209aff684 Add FreqAwareEmbeddingBag (#1421) 2022-08-09 16:26:12 +08:00
Jiarui Fang
504419d261 [FAW] add cache manager for the cached embedding (#1419) 2022-08-09 15:17:17 +08:00
ver217
12b4887097 [hotfix] fix CPUAdam kernel nullptr (#1410) 2022-08-05 19:45:45 +08:00
ver217
04c9a86af8 [zero] ZeroDDP supports controlling outputs' dtype (#1399) 2022-08-02 17:49:11 +08:00
HELSON
4e98e938ce [zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
HELSON
c7221cb2d4 [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) 2022-07-29 19:33:24 +08:00
ver217
83328329dd [hotfix] fix zero ddp buffer cast (#1376)
* fix zero ddp buffer cast

* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217
5d5031e946 fix zero ddp state dict (#1378) 2022-07-28 09:31:42 +08:00
ver217
c415240db6 [nvme] CPUAdam and HybridAdam support NVMe offload (#1360)
* impl nvme optimizer

* update cpu adam

* add unit test

* update hybrid adam

* update docstr

* add TODOs

* update CI

* fix CI

* fix CI

* fix CI path

* fix CI path

* fix CI path

* fix install tensornvme

* fix CI

* fix CI path

* fix CI env variables

* test CI

* test CI

* fix CI

* fix nvme optim __del__

* fix adam __del__

* fix nvme optim

* fix CI env variables

* fix nvme optim import

* test CI

* test CI

* fix CI
2022-07-26 17:25:24 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
ver217
d068af81a3 [doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
HELSON
7a8702c06d [colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00