mirror of
https://github.com/hpcaitech/ColossalAI.git
synced 2025-09-11 05:49:55 +00:00
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402
This commit is contained in:
@@ -42,7 +42,7 @@ Given $P$ processors, we present the theoretical computation and memory cost, as
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -60,7 +60,7 @@ Given $P=q\times q$ processors, we present the theoretical computation and memor
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -57,7 +57,7 @@ Given $P=q \times q \times d$ processors, we present the theoretical computation
|
||||
|
||||
## Usage
|
||||
|
||||
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
|
||||
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
|
||||
```python
|
||||
CONFIG = dict(parallel=dict(
|
||||
data=1,
|
||||
|
@@ -28,7 +28,7 @@ gradient_accumulation = <int>
|
||||
## Hands-on Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command:
|
||||
to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
|
@@ -101,7 +101,7 @@ you can use `colossalai.amp.convert_to_amp`.
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# exmaple of using torch amp
|
||||
# example of using torch amp
|
||||
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
@@ -220,7 +220,7 @@ The default parameters of Naive AMP:
|
||||
- initial_scale(int): initial scale of gradient scaler
|
||||
- growth_factor(int): the growth rate of loss scale
|
||||
- backoff_factor(float): the decrease rate of loss scale
|
||||
- hysterisis(int): delay shift in dynamic loss scaling
|
||||
- hysteresis(int): delay shift in dynamic loss scaling
|
||||
- max_scale(int): maximum loss scale allowed
|
||||
- verbose(bool): if set to `True`, will print debug info
|
||||
|
||||
@@ -292,7 +292,7 @@ colossalai.launch_from_torch(config=args.config)
|
||||
### Step 4. Create training components
|
||||
|
||||
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
|
||||
obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
|
||||
obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
|
||||
to a path on your machine. Data will be automatically downloaded to the root path.
|
||||
|
||||
```python
|
||||
@@ -326,7 +326,7 @@ to a path on your machine. Data will be automatically downloaded to the root pat
|
||||
# build loss
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# lr_scheduelr
|
||||
# lr_scheduler
|
||||
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
|
@@ -57,7 +57,7 @@ It's compatible with all parallel methods in ColossalAI.
|
||||
|
||||
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
|
||||
|
||||
We should install denpendencies first:
|
||||
We should install dependencies first:
|
||||
|
||||
```shell
|
||||
pip install psutil transformers
|
||||
@@ -99,7 +99,7 @@ class GPTLMLoss(nn.Module):
|
||||
shift_labels.view(-1))
|
||||
```
|
||||
|
||||
And we define some utility functions, which generates random data, computes the number of paramters of a model and get memory usage of current process:
|
||||
And we define some utility functions, which generates random data, computes the number of parameters of a model and get memory usage of current process:
|
||||
|
||||
```python
|
||||
def get_data(batch_size: int, seq_len: int,
|
||||
@@ -251,7 +251,7 @@ Time: 3.691 s
|
||||
Mem usage: 5298.344 MB
|
||||
```
|
||||
|
||||
NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can aslo observe a memory usage drop about 900 MB.
|
||||
NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can also observe a memory usage drop about 900 MB.
|
||||
|
||||
## API Reference
|
||||
|
||||
|
@@ -32,11 +32,11 @@ and the first and second momentum estimates) are partitioned across the processe
|
||||
|
||||
3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
|
||||
|
||||
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
|
||||
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for parameters, gradients and optimizer states.
|
||||
|
||||
Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
|
||||
|
||||
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
|
||||
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
|
||||
|
||||
Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
|
||||
|
||||
|
Reference in New Issue
Block a user