[Inference]Fused kv copy into rotary calculation (#5383)

* revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del
2025-09-28 21:17:08 +00:00 · 2024-02-21 11:31:48 +08:00
parent b21aac5bae
commit 730103819d
8 changed files with 391 additions and 498 deletions
--- a/examples/inference/benchmark_llama.py
+++ b/examples/inference/benchmark_llama.py
@@ -204,7 +204,7 @@ def benchmark_inference(args):
                torch.cuda.cudart().cudaProfilerStop()
            if args.profile:
                ctx.step()
-
+    print(f"config:batch_size {args.batch_size}, input_len{ args.seq_len}, output_len {args.output_len}")
    print_details_info(model.config, args, whole_end2end, total_token_num)