[gemini] improve compatibility and add static placement policy (#4479)

* [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example
2026-01-29 21:49:54 +00:00 · 2023-08-24 09:29:25 +08:00
parent 285fe7ba71
commit 27061426f7
82 changed files with 1008 additions and 4036 deletions
--- a/examples/language/bert/README.md
+++ b/examples/language/bert/README.md
@@ -7,6 +7,14 @@ This directory includes two parts: Using the Booster API finetune Huggingface Be
 bash test_ci.sh
 ```

+### Results on 2-GPU
+
+| Plugin         | Accuracy | F1-score |
+| -------------- | -------- | -------- |
+| torch_ddp      | 84.4%    | 88.6%    |
+| torch_ddp_fp16 | 84.7%    | 88.8%    |
+| gemini         | 84.0%    | 88.4%    |
+
 ## Benchmark
 ```
 bash benchmark.sh
@@ -14,9 +22,9 @@ bash benchmark.sh

 Now include these metrics in benchmark: CUDA mem occupy, throughput and the number of model parameters. If you have custom metrics, you can add them to benchmark_util.

-## Results
+### Results

-### Bert
+#### Bert

 |       | max cuda mem | throughput(sample/s) | params |
 | :-----| -----------: | :--------: | :----: |
@@ -25,10 +33,10 @@ Now include these metrics in benchmark: CUDA mem occupy, throughput and the numb
 | gemini | 11.0 GB | 12.9 | 82M |
 | low_level_zero | 11.29 G | 14.7 | 82M |

-### AlBert
+#### AlBert
 |       | max cuda mem | throughput(sample/s) | params |
 | :-----| -----------: | :--------: | :----: |
 | ddp | OOM |  | |
 | ddp_fp16 | OOM |  | |
 | gemini | 69.39 G | 1.3 | 208M |
-| low_level_zero | 56.89 G | 1.4 | 208M |
+| low_level_zero | 56.89 G | 1.4 | 208M |
--- a/examples/language/bert/finetune.py
+++ b/examples/language/bert/finetune.py
@@ -38,8 +38,8 @@ def move_to_cuda(batch):


@torch.no_grad()
-def evaluate_model(model: nn.Module, test_dataloader: Union[DataLoader, List[DataLoader]], num_labels: int, task_name: str,
-             eval_splits: List[str], coordinator: DistCoordinator):
+def evaluate_model(model: nn.Module, test_dataloader: Union[DataLoader, List[DataLoader]], num_labels: int,
+                   task_name: str, eval_splits: List[str], coordinator: DistCoordinator):
    metric = evaluate.load("glue", task_name, process_id=coordinator.rank, num_process=coordinator.world_size)
    model.eval()

@@ -142,7 +142,7 @@ def main():
    if args.plugin.startswith('torch_ddp'):
        plugin = TorchDDPPlugin()
    elif args.plugin == 'gemini':
-        plugin = GeminiPlugin(placement_policy='cuda', strict_ddp_mode=True, initial_scale=2**5)
+        plugin = GeminiPlugin(initial_scale=2**5)
    elif args.plugin == 'low_level_zero':
        plugin = LowLevelZeroPlugin(initial_scale=2**5)

@@ -208,7 +208,7 @@ def main():
        train_epoch(epoch, model, optimizer, lr_scheduler, train_dataloader, booster, coordinator)

    results = evaluate_model(model, test_dataloader, data_builder.num_labels, args.task, data_builder.eval_splits,
-                       coordinator)
+                             coordinator)

    if coordinator.is_master():
        print(results)