显卡算力性能测试工具大全

一、主流显卡大模型计算能力开源测试工具（按场景分类）

（一）底层算子/微观算力测试（测GPU理论算力、算子效率）

byte_micro_perf（xpu-perf）：字节开源，测GEMM、GELU等LLM核心算子，输出TFLOPS、带宽，支持多厂商GPU，定位底层计算瓶颈；地址：https://github.com/bytedance/xpu-perf
NVIDIA CUDA Microbenchmarks：官方LLM算子微基准，测FP8/FP16矩阵乘、FlashAttention等，精准定位硬件/编译问题；地址：https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/LLM
torch-benchmark（PyTorch官方）：原生算子基准，含LLM常用算子，支持多GPU、量化，适合PyTorch生态下GPU对比；地址：https://github.com/pytorch/benchmark

（二）端到端LLM推理性能测试（真实模型实战，最常用）

vLLM Benchmark：vLLM内置，测TTFT、token/s、吞吐量、量化等，支持主流模型及多厂商GPU；地址：https://github.com/vllm-project/vllm/tree/main/benchmarks；命令示例：python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3-8B --tensor-parallel-size 1
llmperf（Ray）：端到端服务压测，测延迟、吞吐量、并发，兼容多推理框架，适合模拟生产流量；地址：https://github.com/ray-project/llmperf
trtllm-bench（TensorRT-LLM内置）：NVIDIA官方，测量化、批处理、多GPU并行，输出NVIDIA GPU极致性能；地址：https://github.com/NVIDIA/TensorRT-LLM；命令示例：trtllm-bench throughput --model meta-llama/Llama-3-8B --engine_dir ./engine
SGLang Benchmark：测高并发、长上下文，对比多推理框架；地址：https://github.com/sgl-project/sglang/tree/main/benchmark
GenAI-Perf（NVIDIA Triton）：Triton配套，测LLM/VLM全链路性能，支持多框架、多GPU；地址：https://github.com/triton-inference-server/genai-perf

（三）轻量快速/消费级GPU测试（个人用户适用）

ollama-benchmark：基于Ollama，一键跑GGUF模型，测速度、显存，生成可视化图表；地址：https://github.com/CordatusAI/ollama-benchmark
ModelBench：轻量化本地基准，输出延迟、VRAM占用及HTML/CSV报告；地址：https://github.com/ayinedjimi/ModelBench
CanIRunAI：快速读取GPU参数，给出可跑模型及量化建议；地址：https://github.com/canirunai/canirunai

（四）综合基准/排行榜类（跨GPU/模型/框架对比）

lm-evaluation-harness：兼顾精度与推理速度，支持多框架，适合综合对比；地址：https://github.com/EleutherAI/lm-evaluation-harness
InferenceX：自动化基准平台，长期跟踪多硬件/框架性能；地址：https://github.com/gpumode/inferencex

（五）其他专用工具

text-generation-inference (TGI) Benchmark：HuggingFace TGI配套，测并发、吞吐量；地址：https://github.com/huggingface/text-generation-inference/tree/main/benchmark
gpu_benchmark：跨框架统一基准，测推理/训练、GPU利用率；地址：https://github.com/luckyjoy/gpu_benchmark

工具选型速览

场景	推荐工具	核心优势
NVIDIA GPU极致推理性能	trtllm-bench、vLLM Bench	优化拉满、支持量化/长上下文
消费级显卡快速对比（RTX）	ollama-benchmark、ModelBench	一键运行、轻量可视化
底层算子/算力瓶颈分析	byte_micro_perf、CUDA Microbench	精准测TFLOPS、带宽、算子延迟
生产服务压测/并发	llmperf、GenAI-Perf	模拟真实流量、多框架兼容
跨GPU/模型/框架综合对比	lm-evaluation-harness、InferenceX	权威数据、长期跟踪

二、测试工具对LMS OpenAI接口的适配（核心要点）

（一）核心结论

绝大多数LLM推理服务（vLLM、TGI、SGLang、Ollama、LM Studio等）的OpenAI兼容接口，均可使用上述工具测试，核心要求：服务开启OpenAI兼容模式，工具支持OpenAI API协议（/v1/chat/completions、/v1/completions）。

（二）原生支持OpenAI接口的工具（无需改代码）

llmperf（首推，生产级）：支持OpenAI接口，测TTFT、延迟、吞吐量、并发，输出可视化报告；命令示例：llmperf --api-base http://IP:8000/v1 --api-key dummy --model 模型名 --num-clients 16 --num-requests 100 --prompt-length 1024 --max-tokens 512
GenAI-Perf：Triton配套，支持OpenAI接口，适合精细化调优；命令示例：genai-perf profile --service-kind openai --url http://IP:8000/v1 --model 模型名 --input-tokens 2048 --output-tokens 1024 --concurrency 32
TGI Benchmark：TGI自带，兼容OpenAI接口；命令：python benchmark/benchmark.py --endpoint http://IP:8000/v1/chat/completions --model 模型名
ollama-benchmark：直接对接Ollama默认OpenAI接口（http://localhost:11434/v1），轻量快速
Python极简脚本：用openai库对接，测单请求/并发性能（代码见下文）

（三）主流LMS开启OpenAI兼容接口方法

vLLM：启动命令加--api-key dummy --api-port 8000，接口地址http://IP:8000/v1；命令：python -m vllm.entrypoints.openai.api_server --model 模型名 --api-key dummy --port 8000
TGI：默认开启，地址http://IP:8080/v1；命令：text-generation-launcher --model-id 模型名 --port 8080
SGLang：启动命令：python -m sglang.launch_server --model-path 模型名 --api-key dummy --port 8000
Ollama：默认接口http://localhost:11434/v1，无需额外配置
LM Studio：启动后开启Local Server，地址http://localhost:1234/v1

（四）关键测试参数&避坑要点

1. 必测指标

TTFT（首token延迟）、Inter-token Latency（逐token延迟）、token吞吐量、QPS、错误率、GPU显存/利用率（接口层延迟比引擎层高10-50ms）。

2. 关键参数

--api-base（接口根地址，需带/v1）、--model（与LMS启动时一致）、--api-key（多数填dummy）、并发数（从8→16→32递增找拐点）、prompt/生成长度（模拟真实场景）。

3. 避坑要点

接口层有额外开销，吞吐量略低于直接调用引擎（低5%-15%）；
并发上限受LMS配置（--max-num-seqs等）限制；
测TTFT必须用流式请求（stream=True）；
模型名需与LMS启动时完全一致，本地测试避免网络瓶颈。

（五）OpenAI接口场景工具选型

生产级压测：llmperf（首选）
Triton部署调优：GenAI-Perf
消费级LMS：ollama-benchmark
TGI部署：TGI Benchmark
极简验证：Python+openai库脚本

（六）Python极简测试脚本

from openai import OpenAI
import time
# 对接LMS OpenAI接口
client = OpenAI(base_url="http://你的LMS-IP:8000/v1", api_key="dummy")
start = time.time()
# 流式输出测TTFT
stream = client.chat.completions.create(model="模型名", messages=[{"role":"user","content":"测试prompt"}], stream=True, max_tokens=512)
ttft = None
token_count=0
for chunk in stream:
    if not ttft:
        ttft = time.time()-start
    if chunk.choices[0].delta.content:
        token_count+=1
total_time = time.time()-start
print(f"TTFT: {ttft:.2f}s, 生成token数: {token_count}, 吞吐量: {token_count/total_time:.2f} token/s")