llama.cpp/tools/llama-bench

https://github.com/ggml-org/llama.cpp/blob/master/tools/llama-bench/README.md

命令说明

usage: llama-bench [options]

options:
  -h, --help
  --numa <distribute|isolate|numactl>       numa mode (default: disabled)
  -r, --repetitions <n>                     number of times to repeat each test (default: 5)
  --prio <0|1|2|3>                          process/thread priority (default: 0)
  --delay <0...N> (seconds)                 delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
  --list-devices                            list available devices and exit
  -v, --verbose                             verbose output
  --progress                                print test progress indicators
  -rpc, --rpc <rpc_servers>                 register RPC devices (comma separated)

test parameters:
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -d, --n-depth <n>                         (default: 0)
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -t, --threads <n>                         (default: system dependent)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -ncmoe, --n-cpu-moe <n>                   (default: 0)
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -dev, --device <dev0/dev1/...>            (default: auto)
  -mmp, --mmap <0|1>                        (default: 1)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -ot --override-tensors <tensor name pattern>=<buffer type>;...
                                            (default: disabled)
  -nopo, --no-op-offload <0|1>              (default: 0)
  -fitt, --fit-target <MiB>                 fit model to device memory with this margin per device in MiB (default: off)
  -fitc, --fit-ctx <n>                      minimum ctx size for --fit-target (default: 4096)

Multiple values can be given for each parameter by separating them with ','
or by specifying the parameter multiple times. Ranges can be given as
'first-last' or 'first-last+step' or 'first-last*mult'.

示例 - llama-bench

$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	132.19 ± 0.55
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 256	129.37 ± 0.54
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 512	123.83 ± 0.25
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 128	82.17 ± 0.31
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 256	80.74 ± 0.23
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 512	78.08 ± 0.07

示例 - llama-batched-bench

./llama-batched-bench \
  -m /path/to/your/model.gguf \  # 你正在加载的模型
  -ngl 99 \                      # GPU层数，和你当前运行一致（99=全放GPU）
  -c 4096 \                      # 上下文窗口，和你当前n_ctx一致
  -b 2048 \                      # 批大小（prompt batch）
  -ub 512 \                      # 微批大小
  -npp 512 \                     # 每个请求的prompt token数（512）
  -ntg 128 \                     # 每个请求生成token数（128）
  -npl 1,2,4,8,16,32 \           # 并发数：1、2、4、8、16、32并行请求
  -pps \                         # 共享prompt（复用KV缓存，更贴近真实对话）
  -t 16 \                        # CPU线程数（和你当前-t一致）
  -kvu \                         # 启用KV缓存复用
  -p "你的测试prompt"             # 固定测试prompt，避免随机

示例 - curl

curl -s -X POST http://localhost:8080/completion -d '{"model":"Qwen-27B", "prompt":"test","n_predict":128}'
hey -n 3 -c 3 -H "Authorization: Bearer 你的APIKEY" -H "Content-Type: application/json" -m POST -d '{"model":"Qwen-27B","prompt":"test","max_tokens":64}' http://localhost:8080/v1/completions