[inference/model]Adapted to the baichuan2-7B model (#5591)

* Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE'
2025-09-02 01:28:31 +00:00 · 2024-04-15 16:53:02 +08:00
parent d4cb023b62
commit 56b222eff8
8 changed files with 354 additions and 2 deletions
--- a/examples/inference/benchmark_llama.py
+++ b/examples/inference/benchmark_llama.py
@@ -117,6 +117,7 @@ def benchmark_inference(args):
                max_output_len=args.output_len,
                prefill_ratio=1.2,
                block_size=32,
+                use_cuda_kernel=True,
            )
            engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
        elif args.mode == "vllm":