[inference]Optimize the usage of the mid tensors space in flash attn (#5304)

* opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py
2025-09-28 21:17:08 +00:00 · 2024-01-26 14:00:10 +08:00
parent af8359c430
commit 4f28cb43c0
16 changed files with 199 additions and 57 deletions
--- a/examples/inference/benchmark_llama.py
+++ b/examples/inference/benchmark_llama.py
@@ -91,7 +91,7 @@ def benchmark_inference(args):
        config.pad_token_id = config.eos_token_id
        model = transformers.LlamaForCausalLM(config).cuda()
        model = model.eval()
-        tokenizer = AutoTokenizer.from_pretrained("/home/caidi/llama_model/")
+        tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")

        if args.dtype == "fp16":
            model = model.half()