7 Commits

Author SHA1 Message Date
Hongxin Liu
6280cb18b8 [checkpointio] support debug log (#6153)
* [checkpointio] support debug log

* [checkpointio] refactor async writer api

* fix test

* fix test
2024-12-02 11:29:19 +08:00
Hongxin Liu
a2596519fd [zero] support extra dp (#6123)
* [zero] support extra dp

* [zero] update checkpoint

* fix bugs

* fix bugs
2024-11-12 11:20:46 +08:00
Guangyao Zhang
f20b066c59 [fp8] Disable all_gather intranode. Disable Redundant all_gather fp8 (#6059)
* all_gather only internode, fix pytest

* fix cuda arch <89 compile pytest error

* fix pytest failure

* disable all_gather_into_tensor_flat_fp8

* fix fp8 format

* fix pytest

* fix conversations

* fix chunk tuple to list
2024-09-14 10:40:01 +08:00
ver217
ae486ce005 [fp8] add fp8 comm for low level zero 2024-08-02 11:12:12 +08:00
Hongxin Liu
5dfbcd7746 [zero] use bucket during allgather (#5860)
* [zero] use bucket during allgather

* [zero] rename api
2024-06-27 16:34:44 +08:00
Hongxin Liu
079bf3cb26 [misc] update pre-commit and run all files (#4752)
* [misc] update pre-commit

* [misc] run pre-commit

* [misc] remove useless configuration files

* [misc] ignore cuda for clang-format
2023-09-19 14:20:26 +08:00
ver217
26b7aac0be [zero] reorganize zero/gemini folder structure (#3424)
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00