ColossalAI

mirror of https://github.com/hpcaitech/ColossalAI.git synced 2026-01-17 07:58:15 +00:00

Author	SHA1	Message	Date
Jiarui Fang	d209aff684	Add FreqAwareEmbeddingBag (#1421 )	2022-08-09 16:26:12 +08:00
ver217	6df3e19be9	[hotfix] zero optim prevents calling inner optim.zero_grad (#1422 )	2022-08-09 16:08:12 +08:00
Jiarui Fang	504419d261	[FAW] add cache manager for the cached embedding (#1419 )	2022-08-09 15:17:17 +08:00
Kirigaya Kazuto	44fd3c83ab	[communication] add p2p_v2.py to support communication with List[Any] (#1407 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code	2022-08-09 11:40:04 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
ver217	12b4887097	[hotfix] fix CPUAdam kernel nullptr (#1410 )	2022-08-05 19:45:45 +08:00
YuliangLiu0306	0442f940f0	[device] add DeviceMesh class to support logical device layout (#1394 ) * [device] add DeviceMesh class to support logical device layout * polish code * add doc string	2022-08-02 19:23:48 +08:00
ver217	04c9a86af8	[zero] ZeroDDP supports controlling outputs' dtype (#1399 )	2022-08-02 17:49:11 +08:00
HELSON	4e98e938ce	[zero] alleviate memory usage in ZeRODDP state_dict (#1398 )	2022-08-02 15:49:13 +08:00
ver217	56b8863b87	[zero] chunk manager allows filtering ex-large params (#1393 )	2022-08-02 10:40:27 +08:00
Frank Lee	7d6293927f	[fx] patched torch.max and data movement operator (#1391 ) * [fx] patched torch.max and data movement operator * polish code	2022-08-01 15:31:50 +08:00
Frank Lee	89e60d1505	[fx] fixed indentation error in checkpointing codegen (#1385 )	2022-07-30 00:27:12 +08:00
HELSON	c7221cb2d4	[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388 )	2022-07-29 19:33:24 +08:00
Frank Lee	ad678921db	[fx] patched torch.full for huggingface opt (#1386 )	2022-07-29 17:56:28 +08:00
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2022-07-29 15:58:06 +08:00
Jiarui Fang	f792507ff3	[chunk] add PG check for tensor appending (#1383 )	2022-07-29 13:27:05 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
YuliangLiu0306	df54481473	[hotfix] fix some bugs during gpt2 testing (#1379 )	2022-07-28 17:21:07 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
HELSON	b6fd165f66	[checkpoint] add kwargs for load_state_dict (#1374 )	2022-07-28 15:56:52 +08:00
ver217	83328329dd	[hotfix] fix zero ddp buffer cast (#1376 ) * fix zero ddp buffer cast * fix zero ddp ignore params	2022-07-28 10:54:44 +08:00
ver217	5d5031e946	fix zero ddp state dict (#1378 )	2022-07-28 09:31:42 +08:00
Frank Lee	0c1a16ea5b	[util] standard checkpoint function naming (#1377 )	2022-07-28 09:29:30 +08:00
YuliangLiu0306	52bc2dc271	[fx] update split module pass and add customized policy (#1373 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]update split module pass and add customized policy	2022-07-27 13:40:54 +08:00
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2022-07-27 11:03:14 +08:00
ver217	c415240db6	[nvme] CPUAdam and HybridAdam support NVMe offload (#1360 ) * impl nvme optimizer * update cpu adam * add unit test * update hybrid adam * update docstr * add TODOs * update CI * fix CI * fix CI * fix CI path * fix CI path * fix CI path * fix install tensornvme * fix CI * fix CI path * fix CI env variables * test CI * test CI * fix CI * fix nvme optim __del__ * fix adam __del__ * fix nvme optim * fix CI env variables * fix nvme optim import * test CI * test CI * fix CI	2022-07-26 17:25:24 +08:00
HELSON	8463290642	[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368 )	2022-07-26 14:41:53 +08:00
YuliangLiu0306	5542816690	[fx]add gpt2 passes for pipeline performance test (#1366 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add gpt2 passes for pipeline performance test	2022-07-26 14:31:00 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2022-07-26 10:53:53 +08:00
Frank Lee	cd063ac37f	[fx] added activation checkpoint codegen support for torch < 1.12 (#1359 )	2022-07-25 23:35:31 +08:00
Frank Lee	644582eee9	[fx] added activation checkpoint codegen (#1355 )	2022-07-25 09:39:10 +08:00
ver217	6b43c789fd	fix zero optim backward_by_grad and save/load (#1353 )	2022-07-21 16:43:58 +08:00
ver217	d068af81a3	[doc] update rst and docstring (#1351 ) * update rst * add zero docstr * fix docstr * remove fx.tracer.meta_patch * fix docstr * fix docstr * update fx rst * fix fx docstr * remove useless rst	2022-07-21 15:54:53 +08:00
Frank Lee	274c1a3b5f	[fx] fixed apex normalization patch exception (#1352 )	2022-07-21 15:29:11 +08:00
ver217	ce470ba37e	[checkpoint] sharded optim save/load grad scaler (#1350 )	2022-07-21 15:21:21 +08:00
Frank Lee	05fae1fd56	[fx] added activation checkpointing annotation (#1349 ) * [fx] added activation checkpointing annotation * polish code * polish code	2022-07-21 11:14:28 +08:00
YuliangLiu0306	051592c64e	[fx] update MetaInforProp pass to process more complex node.meta (#1344 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] update MetaInforProp pass to process more complex node.meta	2022-07-21 10:57:52 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
YuliangLiu0306	942c8cd1fb	[fx] refactor tracer to trace complete graph (#1342 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] refactor tracer to trace complete graph * add comments and solve conflicts.	2022-07-20 11:20:38 +08:00
Frank Lee	2cc1175c76	[fx] tested the complete workflow for auto-parallel (#1336 ) * [fx] tested the complete workflow for auto-parallel * polish code * polish code * polish code	2022-07-20 10:45:17 +08:00
YuliangLiu0306	4631fef8a0	[fx]refactor tracer (#1335 )	2022-07-19 15:50:42 +08:00
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2022-07-19 14:15:28 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00
Frank Lee	75abc75c15	[fx] fixed compatiblity issue with torch 1.10 (#1331 )	2022-07-18 11:41:27 +08:00
ver217	7a05367101	[hotfix] shared model returns cpu state_dict (#1328 )	2022-07-15 22:11:37 +08:00
Frank Lee	b2475d8c5c	[fx] fixed unit tests for torch 1.12 (#1327 )	2022-07-15 18:22:15 +08:00
HELSON	d49708ae43	[hotfix] fix ddp for unit test test_gpt2 (#1326 )	2022-07-15 18:19:52 +08:00
Frank Lee	250be4d31e	[utils] integrated colotensor with lazy init context (#1324 ) * [utils] integrated colotensor with lazy init context * polish code * polish code * polish code	2022-07-15 17:47:12 +08:00
YuliangLiu0306	e8acf55e8b	[fx] add balanced policy v2 (#1251 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] add balanced policy v2 * add unittest	2022-07-15 14:54:26 +08:00

1 2 3 4 5 ...

660 Commits