原始方案
.\llama-server.exe
--model Qwen3.5-9B-UD-Q4_K_XL.gguf # 模型名称(GGUF格式)
--mmproj mmproj-F16-9B.gguf # 多模态视觉投影文件,让Qwen能“看图”
--alias "unsloth/Qwen3.5-9B-GGUF" # 客户端API链接模型时,显示的名称
--ctx-size 65536 # 设置64k上下文,支持更长文本处理
--temp 0.7 # 创造力调节器,控制生成随机性
--top-p 0.8 # 核心采样参数,控制多样性
--top-k 20 # 核心采样参数,限制候选词数量
--min-p 0.00 # 核心采样参数,过滤低概率词
--port 8880 # 暴露服务端口,可自定义
--n-gpu-layers 99 # 将99层模型层(本模型约48层)全部加载到GPU
--flash-attn auto # 自动启用Flash Attention,显存效率更高、文本更长
--reasoning off # 关闭思考模式(可选)
--cache-type-k q4_0 # 开启缓存量化(K缓存,Q4_0精度)
--cache-type-v q4_0 # 开启缓存量化(V缓存,Q4_0精度)
--batch-size 4096 # 增加吞吐量,默认2048
--ubatch-size 1024 # 增加吞吐量,默认512
测试 - 抖音版
---------------------- mobax cmd-------------------------
D:\softWin\ProgramFiles_AI\llama\cuda\llama-bench.exe ^
-m D:\OS\gguf\Qwen3.5-9B-Claude-4.6-Opus-Deckard-V4.2-Uncensored-Heretic-Thinking.i1-Q4_K_M.gguf ^
-p 512 -n 128 ^
-ngl 99 ^
-fa 1 ^
-ctk q4_0 -ctv q4_0 ^
-b 4096 -ub 1024
----------------------------------------------------------
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
| model | size | params | backend | ngl | n_batch | n_ubatch | type_k |
type_v | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: |
-----: | -: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 4096 | 1024 | q4_0 |
q4_0 | 1 | pp512 | 2011.09 ± 21.36 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 4096 | 1024 | q4_0 |
q4_0 | 1 | tg128 | 42.43 ± 0.15 |
build: 5d2b52d80 (8911)
测试 - 在用版
---------------------- n-gpu-layers -------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 1000 --n-gen 50 --n-gpu-layers 10-90+10
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 10 | pp1000 | 584.60 ± 10.29 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 10 | tg50 | 7.82 ± 0.71 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 20 | pp1000 | 842.14 ± 4.67 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 20 | tg50 | 12.82 ± 0.91 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 30 | pp1000 | 1468.95 ± 15.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 30 | tg50 | 27.27 ± 0.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 40 | pp1000 | 1963.72 ± 9.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 40 | tg50 | 42.61 ± 0.18 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 50 | pp1000 | 1962.71 ± 10.00 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 50 | tg50 | 42.71 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 60 | pp1000 | 1962.88 ± 10.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 60 | tg50 | 42.76 ± 0.02 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 70 | pp1000 | 1958.58 ± 6.35 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 70 | tg50 | 42.71 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 80 | pp1000 | 1961.97 ± 7.95 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 80 | tg50 | 42.68 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 90 | pp1000 | 1956.76 ± 12.66 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 90 | tg50 | 42.53 ± 0.26 |
build: e5f070a1d (8913)
---------------------- batch-size 512+256*n 效率最高-------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 1000 --n-gen 50 --batch-size 64-1024+64 --ubatch-size 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 64 | 256 | pp1000 | 1359.29 ± 9.24 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 64 | 256 | tg50 | 42.53 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 128 | 256 | pp1000 | 1741.71 ± 8.24 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 128 | 256 | tg50 | 42.34 ± 0.12 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 192 | 256 | pp1000 | 1727.74 ± 8.95 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 192 | 256 | tg50 | 42.48 ± 0.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 256 | pp1000 | 1901.91 ± 8.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 256 | tg50 | 42.48 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 320 | 256 | pp1000 | 1751.55 ± 9.39 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 320 | 256 | tg50 | 42.47 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 384 | 256 | pp1000 | 1874.68 ± 11.17 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 384 | 256 | tg50 | 42.42 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 448 | 256 | pp1000 | 1862.55 ± 8.53 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 448 | 256 | tg50 | 42.51 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 256 | pp1000 | 1933.47 ± 7.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 256 | tg50 | 42.60 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 576 | 256 | pp1000 | 1866.80 ± 10.74 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 576 | 256 | tg50 | 42.62 ± 0.16 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 640 | 256 | pp1000 | 1909.93 ± 8.39 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 640 | 256 | tg50 | 42.46 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 704 | 256 | pp1000 | 1878.27 ± 4.38 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 704 | 256 | tg50 | 42.33 ± 0.18 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 256 | pp1000 | 1926.32 ± 6.26 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 256 | tg50 | 42.53 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 832 | 256 | pp1000 | 1858.07 ± 12.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 832 | 256 | tg50 | 42.54 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 896 | 256 | pp1000 | 1906.32 ± 7.79 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 896 | 256 | tg50 | 42.58 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 960 | 256 | pp1000 | 1874.13 ± 6.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 960 | 256 | tg50 | 42.44 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 256 | pp1000 | 1939.59 ± 9.39 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 256 | tg50 | 42.36 ± 0.24 |
build: e5f070a1d (8913)
---------------------- ubatch-size 256+128*n 效率最高-------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 1000 --n-gen 50 --batch-size 2048 --ubatch-size 16-1024+16
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 16 | pp1000 | 586.35 ± 0.86 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 16 | tg50 | 42.55 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 32 | pp1000 | 1020.18 ± 4.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 32 | tg50 | 42.60 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 48 | pp1000 | 1318.09 ± 3.26 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 48 | tg50 | 42.43 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 64 | pp1000 | 1468.59 ± 12.25 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 64 | tg50 | 42.45 ± 0.25 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 80 | pp1000 | 1580.26 ± 12.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 80 | tg50 | 42.45 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 96 | pp1000 | 1678.51 ± 3.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 96 | tg50 | 42.52 ± 0.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 112 | pp1000 | 1767.84 ± 7.74 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 112 | tg50 | 42.55 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 128 | pp1000 | 1831.16 ± 6.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 128 | tg50 | 42.63 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 144 | pp1000 | 1606.74 ± 5.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 144 | tg50 | 42.52 ± 0.04 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 160 | pp1000 | 1707.84 ± 4.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 160 | tg50 | 42.52 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 176 | pp1000 | 1732.59 ± 7.94 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 176 | tg50 | 42.59 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 192 | pp1000 | 1792.50 ± 14.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 192 | tg50 | 42.66 ± 0.05 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 208 | pp1000 | 1798.57 ± 4.68 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 208 | tg50 | 42.54 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 224 | pp1000 | 1891.70 ± 4.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 224 | tg50 | 42.64 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 240 | pp1000 | 1842.72 ± 8.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 240 | tg50 | 42.68 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | pp1000 | 1944.21 ± 7.36 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | tg50 | 42.70 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 272 | pp1000 | 1818.48 ± 5.18 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 272 | tg50 | 42.67 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 288 | pp1000 | 1838.53 ± 7.01 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 288 | tg50 | 42.62 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 304 | pp1000 | 1807.15 ± 4.04 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 304 | tg50 | 42.64 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 320 | pp1000 | 1845.77 ± 11.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 320 | tg50 | 42.56 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 336 | pp1000 | 1945.42 ± 3.82 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 336 | tg50 | 42.64 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 352 | pp1000 | 1870.62 ± 7.87 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 352 | tg50 | 42.64 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 368 | pp1000 | 1904.86 ± 10.72 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 368 | tg50 | 42.65 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 384 | pp1000 | 1959.55 ± 14.42 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 384 | tg50 | 42.53 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 400 | pp1000 | 1794.80 ± 11.30 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 400 | tg50 | 42.40 ± 0.26 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 416 | pp1000 | 1817.33 ± 9.22 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 416 | tg50 | 42.52 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 432 | pp1000 | 1839.53 ± 6.42 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 432 | tg50 | 42.65 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 448 | pp1000 | 1911.06 ± 6.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 448 | tg50 | 42.67 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 464 | pp1000 | 1829.77 ± 13.92 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 464 | tg50 | 42.63 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 480 | pp1000 | 1856.59 ± 16.05 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 480 | tg50 | 42.69 ± 0.04 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 496 | pp1000 | 1821.14 ± 4.66 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 496 | tg50 | 42.71 ± 0.05 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | pp1000 | 1961.17 ± 10.40 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | tg50 | 42.68 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 528 | pp1000 | 1860.16 ± 16.75 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 528 | tg50 | 42.63 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 544 | pp1000 | 1867.96 ± 8.98 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 544 | tg50 | 42.63 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 560 | pp1000 | 1916.88 ± 7.48 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 560 | tg50 | 42.71 ± 0.02 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 576 | pp1000 | 1853.73 ± 9.95 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 576 | tg50 | 42.69 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 592 | pp1000 | 1851.00 ± 6.58 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 592 | tg50 | 42.68 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 608 | pp1000 | 1854.81 ± 5.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 608 | tg50 | 42.66 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 624 | pp1000 | 1951.19 ± 6.59 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 624 | tg50 | 42.55 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 640 | pp1000 | 1946.45 ± 16.30 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 640 | tg50 | 42.59 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 656 | pp1000 | 1885.80 ± 10.73 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 656 | tg50 | 42.63 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 672 | pp1000 | 1924.92 ± 5.24 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 672 | tg50 | 42.63 ± 0.03 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 688 | pp1000 | 1852.18 ± 9.94 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 688 | tg50 | 42.65 ± 0.12 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 704 | pp1000 | 1846.43 ± 14.88 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 704 | tg50 | 42.61 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 720 | pp1000 | 1882.94 ± 16.80 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 720 | tg50 | 42.62 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 736 | pp1000 | 1879.84 ± 11.63 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 736 | tg50 | 42.24 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 752 | pp1000 | 1938.76 ± 7.20 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 752 | tg50 | 42.36 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | pp1000 | 1937.19 ± 4.75 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | tg50 | 42.18 ± 0.34 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 784 | pp1000 | 1887.95 ± 8.68 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 784 | tg50 | 42.35 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 800 | pp1000 | 1809.33 ± 12.25 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 800 | tg50 | 42.25 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 816 | pp1000 | 1824.44 ± 4.74 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 816 | tg50 | 42.23 ± 0.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 832 | pp1000 | 1822.04 ± 6.61 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 832 | tg50 | 42.25 ± 0.20 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 848 | pp1000 | 1824.51 ± 27.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 848 | tg50 | 42.16 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 864 | pp1000 | 1841.82 ± 15.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 864 | tg50 | 42.12 ± 0.33 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 880 | pp1000 | 1897.23 ± 11.73 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 880 | tg50 | 42.20 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 896 | pp1000 | 1904.40 ± 7.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 896 | tg50 | 42.15 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 912 | pp1000 | 1778.73 ± 7.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 912 | tg50 | 42.00 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 928 | pp1000 | 1783.65 ± 9.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 928 | tg50 | 41.92 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 944 | pp1000 | 1796.26 ± 9.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 944 | tg50 | 41.97 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 960 | pp1000 | 1808.56 ± 9.45 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 960 | tg50 | 42.00 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 976 | pp1000 | 1815.52 ± 5.17 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 976 | tg50 | 41.92 ± 0.12 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 992 | pp1000 | 1769.97 ± 9.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 992 | tg50 | 41.97 ± 0.18 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1008 | pp1000 | 1908.29 ± 14.36 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1008 | tg50 | 42.05 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | pp1000 | 1904.56 ± 16.32 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | tg50 | 41.90 ± 0.43 |
build: e5f070a1d (8913)
---------------------- batch & ubatch b要大于u-------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 5000 --n-gen 50 --batch-size 256-2048+256 --ubatch-size 256-1024+256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | n_batch | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 256 | pp5000 | 1845.00 ± 1.18 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 256 | tg50 | 42.45 ± 0.30 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 512 | pp5000 | 1840.02 ± 3.24 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 512 | tg50 | 42.67 ± 0.03 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 768 | pp5000 | 1836.64 ± 1.77 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 768 | tg50 | 42.63 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 1024 | pp5000 | 1835.10 ± 1.43 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 256 | 1024 | tg50 | 42.39 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 256 | pp5000 | 1863.42 ± 1.75 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 256 | tg50 | 42.58 ± 0.03 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 512 | pp5000 | 1880.21 ± 1.93 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 512 | tg50 | 42.60 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 768 | pp5000 | 1880.69 ± 1.56 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 768 | tg50 | 42.66 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 1024 | pp5000 | 1881.73 ± 2.29 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 512 | 1024 | tg50 | 42.58 ± 0.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 256 | pp5000 | 1869.49 ± 3.35 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 256 | tg50 | 42.44 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 512 | pp5000 | 1879.94 ± 3.42 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 512 | tg50 | 42.26 ± 0.27 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 768 | pp5000 | 1886.64 ± 3.25 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 768 | tg50 | 42.34 ± 0.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 1024 | pp5000 | 1888.16 ± 2.46 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 768 | 1024 | tg50 | 42.37 ± 0.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 256 | pp5000 | 1869.44 ± 4.25 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 256 | tg50 | 41.93 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 512 | pp5000 | 1893.26 ± 2.80 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 512 | tg50 | 42.00 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 768 | pp5000 | 1890.45 ± 3.19 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 768 | tg50 | 41.99 ± 0.17 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 1024 | pp5000 | 1853.56 ± 2.69 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1024 | 1024 | tg50 | 42.10 ± 0.08 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 256 | pp5000 | 1870.20 ± 5.99 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 256 | tg50 | 41.98 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 512 | pp5000 | 1890.24 ± 4.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 512 | tg50 | 42.04 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 768 | pp5000 | 1894.81 ± 3.65 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 768 | tg50 | 42.04 ± 0.14 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 1024 | pp5000 | 1865.08 ± 3.58 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1280 | 1024 | tg50 | 41.95 ± 0.05 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 256 | pp5000 | 1876.39 ± 3.89 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 256 | tg50 | 42.41 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 512 | pp5000 | 1897.34 ± 1.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 512 | tg50 | 41.86 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 768 | pp5000 | 1893.02 ± 5.53 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 768 | tg50 | 42.46 ± 0.23 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 1024 | pp5000 | 1874.94 ± 3.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1536 | 1024 | tg50 | 42.37 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 256 | pp5000 | 1881.56 ± 2.31 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 256 | tg50 | 42.48 ± 0.16 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 512 | pp5000 | 1895.81 ± 2.20 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 512 | tg50 | 42.41 ± 0.09 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 768 | pp5000 | 1899.87 ± 2.78 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 768 | tg50 | 42.43 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 1024 | pp5000 | 1882.54 ± 1.34 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1792 | 1024 | tg50 | 42.28 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 256 | pp5000 | 1877.33 ± 3.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 256 | tg50 | 42.26 ± 0.20 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 512 | pp5000 | 1899.21 ± 1.62 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 512 | tg50 | 42.19 ± 0.57 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 768 | pp5000 | 1902.25 ± 2.92 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 768 | tg50 | 42.39 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 1024 | pp5000 | 1860.29 ± 3.93 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 2048 | 1024 | tg50 | 42.29 ± 0.21 |
build: e5f070a1d (8913)
---------------------- threads -------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 1000 --n-gen 50 --threads 4-12+2
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 4 | pp1000 | 1964.22 ± 10.54 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 4 | tg50 | 42.42 ± 0.15 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 6 | pp1000 | 1962.20 ± 5.93 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 6 | tg50 | 42.68 ± 0.11 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 8 | pp1000 | 1955.54 ± 3.69 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 8 | tg50 | 42.29 ± 0.28 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 10 | pp1000 | 1950.89 ± 9.84 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 10 | tg50 | 42.33 ± 0.23 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 12 | pp1000 | 1955.29 ± 15.56 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 12 | tg50 | 42.61 ± 0.14 |
build: e5f070a1d (8913)
---------------------- kv cache -------------------------
---------------------- mem manager 只有no-kv-offload有用 -------------------------
llama-bench.exe -m D:\OS\gguf\Qwopus3.5-9B-v3.5.i1-Q4_K_M.gguf --n-prompt 1000 --n-gen 50 --flash-attn 0,1 --mmap 0,1 --no-kv-offload 0,1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8187 MiB):
Device 0: NVIDIA GeForce RTX 4060, compute capability 8.9, VMM: yes, VRAM: 8187 MiB
load_backend: loaded CUDA backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cuda.dll
load_backend: loaded RPC backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-rpc.dll
load_backend: loaded CPU backend from D:\softWin\ProgramFiles_AI\llama\cuda\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | nkvo | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -: | ---: | --------------: | -------------------: |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 0 | 0 | pp1000 | 1969.64 ± 2.70 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 0 | 0 | tg50 | 42.70 ± 0.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 1 | 0 | pp1000 | 2008.14 ± 13.90 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 1 | 0 | tg50 | 42.77 ± 0.23 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 0 | 0 | pp1000 | 937.21 ± 14.13 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 0 | 0 | tg50 | 17.57 ± 0.17 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 1 | 0 | pp1000 | 1448.66 ± 4.71 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 1 | 0 | tg50 | 17.57 ± 0.21 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 0 | 1 | pp1000 | 1966.31 ± 13.86 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 0 | 1 | tg50 | 42.51 ± 0.10 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 1 | 1 | pp1000 | 2012.92 ± 8.02 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 0 | 1 | 1 | tg50 | 42.78 ± 0.07 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 0 | 1 | pp1000 | 932.81 ± 5.61 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 0 | 1 | tg50 | 17.64 ± 0.06 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 1 | 1 | pp1000 | 1451.72 ± 4.30 |
| qwen35 9B Q4_K - Medium | 5.23 GiB | 8.95 B | CUDA | 99 | 1 | 1 | 1 | tg50 | 17.65 ± 0.07 |
build: e5f070a1d (8913)