[Inference]Fused kv copy into rotary calculation (#5383)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* fused kv copy

* fused copy

* colossalai/kernel/triton/no_pad_rotary_embedding.py

* del padding llama

* del
This commit is contained in:
Jianghai
2024-02-21 11:31:48 +08:00
committed by GitHub
parent b21aac5bae
commit 730103819d
8 changed files with 391 additions and 498 deletions

View File

@@ -204,7 +204,7 @@ def benchmark_inference(args):
torch.cuda.cudart().cudaProfilerStop()
if args.profile:
ctx.step()
print(f"config:batch_size {args.batch_size}, input_len{ args.seq_len}, output_len {args.output_len}")
print_details_info(model.config, args, whole_end2end, total_token_num)