[misc] refactor launch API and tensor constructor (#5666)

* [misc] remove config arg from initialize

* [misc] remove old tensor contrusctor

* [plugin] add npu support for ddp

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [devops] fix doc test ci

* [test] fix test launch

* [doc] update launch doc

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
Hongxin Liu
2024-04-29 10:40:11 +08:00
committed by GitHub
parent 91fa553775
commit 7f8b16635b
223 changed files with 294 additions and 403 deletions

View File

@@ -55,7 +55,7 @@ from colossalai.booster.plugin import TorchDDPPlugin
def train():
# launch colossalai
colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost')
colossalai.launch(rank=rank, world_size=world_size, port=port, host='localhost')
# create plugin and objects for training
plugin = TorchDDPPlugin()

View File

@@ -87,8 +87,7 @@ import colossalai
args = colossalai.get_default_parser().parse_args()
# launch distributed environment
colossalai.launch(config=args.config,
rank=args.rank,
colossalai.launch(rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
@@ -106,20 +105,11 @@ First, we need to set the launch method in our code. As this is a wrapper of the
use `colossalai.launch_from_torch`. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch
launcher and can be read from the environment variable directly.
config.py
```python
BATCH_SIZE = 512
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3
NUM_EPOCHS = 2
```
train.py
```python
import colossalai
colossalai.launch_from_torch(
config="./config.py",
)
colossalai.launch_from_torch()
...
```
@@ -203,7 +193,6 @@ Do this in your training script:
import colossalai
colossalai.launch_from_slurm(
config=<CONFIG>,
host=args.host,
port=args.port
)
@@ -224,7 +213,6 @@ use them to start the distributed backend.
Do this in your train.py:
```python
colossalai.launch_from_openmpi(
config=<CONFIG>,
host=args.host,
port=args.port
)
@@ -238,3 +226,5 @@ mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node n
- --hostfile: use this option to specify a list of hosts on which to run
- --np: set the number of processes (GPUs) to launch in total. For example, if --np 4, 4 python processes will be initialized to run train.py.
<!-- doc-test-command: echo -->