https://github.com/ggml-org/llama.cpp/blob/master/tools/llama-bench/README.md
命令说明
usage: llama-bench [options]
options:
-h, --help
--numa <distribute|isolate|numactl> numa mode (default: disabled)
-r, --repetitions <n> number of times to repeat each test (default: 5)
--prio <0|1|2|3> process/thread priority (default: 0)
--delay <0...N> (seconds) delay between each test (default: 0)
-o, --output <csv|json|jsonl|md|sql> output format printed to stdout (default: md)
-oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
--list-devices list available devices and exit
-v, --verbose verbose output
--progress print test progress indicators
-rpc, --rpc <rpc_servers> register RPC devices (comma separated)
test parameters:
-m, --model <filename> (default: models/7B/ggml-model-q4_0.gguf)
-p, --n-prompt <n> (default: 512)
-n, --n-gen <n> (default: 128)
-pg <pp,tg> (default: )
-d, --n-depth <n> (default: 0)
-b, --batch-size <n> (default: 2048)
-ub, --ubatch-size <n> (default: 512)
-ctk, --cache-type-k <t> (default: f16)
-ctv, --cache-type-v <t> (default: f16)
-t, --threads <n> (default: system dependent)
-C, --cpu-mask <hex,hex> (default: 0x0)
--cpu-strict <0|1> (default: 0)
--poll <0...100> (default: 50)
-ngl, --n-gpu-layers <n> (default: 99)
-ncmoe, --n-cpu-moe <n> (default: 0)
-sm, --split-mode <none|layer|row> (default: layer)
-mg, --main-gpu <i> (default: 0)
-nkvo, --no-kv-offload <0|1> (default: 0)
-fa, --flash-attn <0|1> (default: 0)
-dev, --device <dev0/dev1/...> (default: auto)
-mmp, --mmap <0|1> (default: 1)
-embd, --embeddings <0|1> (default: 0)
-ts, --tensor-split <ts0/ts1/..> (default: 0)
-ot --override-tensors <tensor name pattern>=<buffer type>;...
(default: disabled)
-nopo, --no-op-offload <0|1> (default: 0)
-fitt, --fit-target <MiB> fit model to device memory with this margin per device in MiB (default: off)
-fitc, --fit-ctx <n> minimum ctx size for --fit-target (default: 4096)
Multiple values can be given for each parameter by separating them with ','
or by specifying the parameter multiple times. Ranges can be given as
'first-last' or 'first-last+step' or 'first-last*mult'.示例 - llama-bench
$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 132.19 ± 0.55
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 256 129.37 ± 0.54
llama 7B mostly Q4_0 3.56 GiB 6.74 B CUDA 99 tg 512 123.83 ± 0.25
llama 13B mostly Q4_0 6.86 GiB 13.02 B CUDA 99 tg 128 82.17 ± 0.31
llama 13B mostly Q4_0 6.86 GiB 13.02 B CUDA 99 tg 256 80.74 ± 0.23
llama 13B mostly Q4_0 6.86 GiB 13.02 B CUDA 99 tg 512 78.08 ± 0.07示例 - llama-batched-bench
./llama-batched-bench \
-m /path/to/your/model.gguf \ # 你正在加载的模型
-ngl 99 \ # GPU层数,和你当前运行一致(99=全放GPU)
-c 4096 \ # 上下文窗口,和你当前n_ctx一致
-b 2048 \ # 批大小(prompt batch)
-ub 512 \ # 微批大小
-npp 512 \ # 每个请求的prompt token数(512)
-ntg 128 \ # 每个请求生成token数(128)
-npl 1,2,4,8,16,32 \ # 并发数:1、2、4、8、16、32并行请求
-pps \ # 共享prompt(复用KV缓存,更贴近真实对话)
-t 16 \ # CPU线程数(和你当前-t一致)
-kvu \ # 启用KV缓存复用
-p "你的测试prompt" # 固定测试prompt,避免随机示例 - curl
curl -s -X POST http://localhost:8080/completion -d '{"model":"Qwen-27B", "prompt":"test","n_predict":128}'
hey -n 3 -c 3 -H "Authorization: Bearer 你的APIKEY" -H "Content-Type: application/json" -m POST -d '{"model":"Qwen-27B","prompt":"test","max_tokens":64}' http://localhost:8080/v1/completions