Commit Graph

100 Commits

Author SHA1 Message Date
CsRic
a389ac4ec9 [embedding] cache_embedding small improvement (#1564) 2022-09-08 16:41:19 +08:00
Jiarui Fang
64169f3e8f [embedding] polish parallel embedding tablewise (#1545) 2022-09-06 10:41:20 +08:00
CsRic
964123ae0f [embedding] freq_aware_embedding: add small functions for caller application (#1537) 2022-09-05 15:12:53 +08:00
Jiarui Fang
521078ffc9 [embedding] fix a bug in table wise sharding (#1538) 2022-09-02 15:48:35 +08:00
Jiarui Fang
87134524fd [embedding] tablewise sharding polish (#1535) 2022-09-02 11:09:37 +08:00
CsRic
5156d5b4f8 [embedding] add tablewise sharding for FAW (#1526) 2022-09-01 17:55:41 +08:00
Jiarui Fang
4537d39df9 [doc] docstring for FreqAwareEmbeddingBag (#1525) 2022-08-31 13:52:30 +08:00
Jiarui Fang
9a9ef65313 [FAW] cpu caching operations (#1520) 2022-08-30 14:50:02 +08:00
Jiarui Fang
af5438caa2 [FAW] refactor reorder() for CachedParamMgr (#1514) 2022-08-29 14:22:07 +08:00
Jiarui Fang
9feee6d06b [FAW] LFU initialize with dataset freq (#1513) 2022-08-29 12:52:53 +08:00
CsRic
1b8fee8e9c [FAW] shrink freq_cnter size (#1509) 2022-08-29 11:44:55 +08:00
Jiarui Fang
ba61109b6c [FAW] remove code related to chunk (#1501) 2022-08-26 14:23:30 +08:00
Jiarui Fang
d5085bb317 [FAW] add more docs and fix a warning (#1500) 2022-08-26 14:10:21 +08:00
CsRic
0ed2f46131 [FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) 2022-08-26 11:24:12 +08:00
CsRic
b8d0e39eaf [FAW] LFU cache for the FAW 2022-08-25 13:08:46 +08:00
Jiarui Fang
cde7b8a5b8 [FAW] init an LFU implementation for FAW (#1488) 2022-08-24 17:37:22 +08:00
Geng Zhang
0aad53c62b [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) 2022-08-23 17:38:24 +08:00
Jiarui Fang
a1476ea882 [NFC] polish doc style for ColoTensor (#1457) 2022-08-16 09:21:05 +08:00
Geng Zhang
9f3eed66eb [FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) 2022-08-12 15:55:46 +08:00
Jiarui Fang
30b4dd17c0 [FAW] export FAW in _ops (#1438) 2022-08-11 13:43:24 +08:00
ver217
04c9a86af8 [zero] ZeroDDP supports controlling outputs' dtype (#1399) 2022-08-02 17:49:11 +08:00
HELSON
4e98e938ce [zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
ver217
83328329dd [hotfix] fix zero ddp buffer cast (#1376)
* fix zero ddp buffer cast

* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217
5d5031e946 fix zero ddp state dict (#1378) 2022-07-28 09:31:42 +08:00
HELSON
87775a0682 [colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
ver217
d068af81a3 [doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
ver217
0c51ff2c13 [hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
HELSON
1b41686461 [hotfix] fix unit test test_module_spec (#1321) 2022-07-15 14:02:32 +08:00
Jiarui Fang
9bcd2fd4af [tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang
ae7d3f4927 [refactor] move process group from _DistSpec to ColoTensor. (#1203) 2022-07-06 16:15:16 +08:00
Jiarui Fang
b5f25eb32a [Tensor] add cpu group to ddp (#1200) 2022-07-05 14:58:28 +08:00
Jiarui Fang
060b917daf [refactor] remove gpc dependency in colotensor's _ops (#1189) 2022-07-04 18:54:37 +08:00
Jiarui Fang
372f791444 [refactor] move chunk and chunkmgr to directory gemini (#1182) 2022-06-29 13:31:02 +08:00
ver217
6b2f2ab9bb [ddp] ColoDDP uses bucket all-reduce (#1177)
* add reducer

* update colo ddp with reducer

* polish unit test

* polish unit test
2022-06-29 10:34:13 +08:00
Ziyue Jiang
dd0420909f [Tensor] rename parallel_action (#1174)
* rename parallel_action

* polish
2022-06-27 10:04:45 +08:00
Jiarui Fang
4b9bba8116 [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) 2022-06-24 13:08:54 +08:00
Jiarui Fang
f4ef224358 [Tensor] remove ParallelAction, use ComputeSpec instread (#1166) 2022-06-23 17:34:59 +08:00
ver217
54aabb8da4 [gemini] refactor gemini mgr (#1151)
* refactor gemini mgr

* udpate __init__
2022-06-22 11:54:36 +08:00
ver217
8106d7b8c7 [ddp] refactor ColoDDP and ZeroDDP (#1146)
* ColoDDP supports overwriting default process group

* rename ColoDDPV2 to ZeroDDP

* add docstr for ZeroDDP

* polish docstr
2022-06-21 16:35:23 +08:00
Frank Lee
15aab1476e [zero] avoid zero hook spam by changing log to debug level (#1137) 2022-06-21 10:44:01 +08:00
ver217
d26902645e [ddp] add save/load state dict for ColoDDP (#1127)
* add save/load state dict for ColoDDP

* add unit test

* refactor unit test folder

* polish unit test

* rename unit test
2022-06-20 10:51:47 +08:00
ver217
f0a954f16d [ddp] add set_params_to_ignore for ColoDDP (#1122)
* add set_params_to_ignore for ColoDDP

* polish code

* fix zero hook v2

* add unit test

* polish docstr
2022-06-16 12:54:46 +08:00
ver217
e127b4375b cast colo ddp v2 inputs/outputs (#1120) 2022-06-15 15:57:04 +08:00
ver217
7d14b473f0 [gemini] gemini mgr supports "cpu" placement policy (#1118)
* update gemini mgr

* update chunk

* add docstr

* polish placement policy

* update test chunk

* update test zero

* polish unit test

* remove useless unit test
2022-06-15 15:05:19 +08:00
ver217
895c1c5ee7 [tensor] refactor param op hook (#1097)
* refactor param op hook

* add docstr

* fix bug
2022-06-13 16:11:53 +08:00
Frank Lee
cb18922c47 [doc] added documentation to chunk and chunk manager (#1094)
* [doc] added documentation to chunk and chunk manager

* polish code

* polish code

* polish code
2022-06-10 15:33:06 +08:00
ver217
1f894e033f [gemini] zero supports gemini (#1093)
* add placement policy

* add gemini mgr

* update mem stats collector

* update zero

* update zero optim

* fix bugs

* zero optim monitor os

* polish unit test

* polish unit test

* add assert
2022-06-10 14:48:28 +08:00
ver217
be01db37c8 [tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
* polish chunk manager

* polish unit test

* impl add_extern_static_tensor for chunk mgr

* add mem stats collector v2

* polish code

* polish unit test

* polish code

* polish get chunks
2022-06-09 20:56:34 +08:00
Ziyue Jiang
4fc748f69b [Tensor] fix optimizer for CPU parallel (#1069) 2022-06-06 17:36:11 +08:00
Jiarui Fang
49832b2344 [refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00