no-kv-offload 影响推理速度

 cline速度

  • 加no-kv-offload:5万上下文请求,只有5t/s
  • 不加:5万,60+t/s
[1]-  已完成               nohup llama-server --models-preset ~/gguf/0410.ini --no-kv-offload --host 0.0.0.0 --port 8080 --api-key-file ~/gguf/apikey.txt > ~/gguf/llama_server.log 2>&1
[51477] srv    load_model: loading model '/home/x99/gguf/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf'
[51477] llama_model_load_from_file_impl: using device CUDA0 (Tesla V100-SXM2-16GB) (0000:05:00.0) - 15513 MiB free
[51477] llama_model_load_from_file_impl: using device CUDA1 (Tesla V100-SXM2-16GB) (0000:06:00.0) - 15828 MiB free
[51477] srv    load_model: initializing slots, n_slots = 2
[51477] srv    load_model: speculative decoding will use checkpoints
[51477] srv    load_model: prompt cache is enabled, size limit: 8192 MiB
[51477] srv    load_model: use `--cache-ram 0` to disable the prompt cache
[51477] srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[51477] slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 483
[51477] prompt eval time =     993.56 ms /   483 tokens (    2.06 ms per token,   486.13 tokens per second)
[51477]        eval time =   11499.93 ms /   836 tokens (   13.76 ms per token,    72.70 tokens per second)
[51477] slot update_slots: id  0 | task 838 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1551
[51477] prompt eval time =    2246.99 ms /  1551 tokens (    1.45 ms per token,   690.26 tokens per second)
[51477]        eval time =    1245.89 ms /    90 tokens (   13.84 ms per token,    72.24 tokens per second)
[51477] slot update_slots: id  1 | task 931 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1624
[51477] prompt eval time =    2363.23 ms /  1624 tokens (    1.46 ms per token,   687.19 tokens per second)
[51477]        eval time =     209.93 ms /    16 tokens (   13.12 ms per token,    76.22 tokens per second)
[51477] slot update_slots: id  0 | task 950 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1497
[51477] prompt eval time =    2265.03 ms /  1497 tokens (    1.51 ms per token,   660.92 tokens per second)
[51477]        eval time =     238.70 ms /    18 tokens (   13.26 ms per token,    75.41 tokens per second)
[51477] slot update_slots: id  1 | task 971 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 12204
[51477] prompt eval time =   17580.23 ms / 12204 tokens (    1.44 ms per token,   694.19 tokens per second)
[51477]        eval time =    2168.97 ms /   160 tokens (   13.56 ms per token,    73.77 tokens per second)
[51477] slot update_slots: id  1 | task 1136 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 23959
[51477] prompt eval time =   17740.57 ms / 11759 tokens (    1.51 ms per token,   662.83 tokens per second)
[51477]        eval time =   14551.12 ms /  1018 tokens (   14.29 ms per token,    69.96 tokens per second)
[51477] slot update_slots: id  1 | task 2159 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 33082
[51477] prompt eval time =   14794.25 ms /  9127 tokens (    1.62 ms per token,   616.93 tokens per second)
[51477]        eval time =    9150.81 ms /   612 tokens (   14.95 ms per token,    66.88 tokens per second)
[51477] slot update_slots: id  1 | task 2775 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 41544
[51477] prompt eval time =   14782.43 ms /  8466 tokens (    1.75 ms per token,   572.71 tokens per second)
[51477]        eval time =   14778.78 ms /   945 tokens (   15.64 ms per token,    63.94 tokens per second)
[51477] slot update_slots: id  1 | task 3724 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 50820
[51477] prompt eval time =   16430.88 ms /  9280 tokens (    1.77 ms per token,   564.79 tokens per second)
[51477]        eval time =    3849.42 ms /   240 tokens (   16.04 ms per token,    62.35 tokens per second)
[51477] slot update_slots: id  1 | task 3969 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 51652
[51477] prompt eval time =    1767.43 ms /   836 tokens (    2.11 ms per token,   473.00 tokens per second)
[51477]        eval time =    6477.86 ms /   400 tokens (   16.19 ms per token,    61.75 tokens per second)
[51477] slot update_slots: id  1 | task 4371 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 36387
[51477] prompt eval time =   38483.23 ms / 24187 tokens (    1.59 ms per token,   628.51 tokens per second)
[51477]        eval time =   80746.42 ms /  5259 tokens (   15.35 ms per token,    65.13 tokens per second)
[51477] slot update_slots: id  1 | task 9638 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 46686
[51477] prompt eval time =   18436.35 ms / 10303 tokens (    1.79 ms per token,   558.84 tokens per second)
[51477]        eval time =    7739.64 ms /   488 tokens (   15.86 ms per token,    63.05 tokens per second)
[51477] slot update_slots: id  1 | task 10131 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47412
[51477] prompt eval time =    1599.57 ms /   730 tokens (    2.19 ms per token,   456.37 tokens per second)
[51477]        eval time =   10876.61 ms /   685 tokens (   15.88 ms per token,    62.98 tokens per second)

OC速度

  • 加no-kv-offload:5万上下文请求,只有2t/s
  • 不加:60t/s
[51477] slot update_slots: id  0 | task 15576 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44754
[51477] slot update_slots: id  1 | task 15579 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44927
[51477] slot update_slots: id  0 | task 15590 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44927
[51477] prompt eval time =   18051.27 ms /  1225 tokens (   14.74 ms per token,    67.86 tokens per second)
[51477]        eval time =   57689.30 ms /  1461 tokens (   39.49 ms per token,    25.33 tokens per second)
[51477] slot update_slots: id  1 | task 17058 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 156690
[51477] slot update_slots: id  0 | task 17061 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 45082
[51477] prompt eval time =    2312.85 ms /  1207 tokens (    1.92 ms per token,   521.87 tokens per second)
[51477]        eval time =   21914.03 ms /  1373 tokens (   15.96 ms per token,    62.65 tokens per second)
[51477] slot update_slots: id  1 | task 18437 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 33328
[51477] prompt eval time =   36565.39 ms / 25136 tokens (    1.45 ms per token,   687.43 tokens per second)
[51477]        eval time =    1060.71 ms /    71 tokens (   14.94 ms per token,    66.94 tokens per second)
[51477] slot update_slots: id  1 | task 18516 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 34363
[51477] prompt eval time =    1714.95 ms /   965 tokens (    1.78 ms per token,   562.70 tokens per second)
[51477]        eval time =     933.99 ms /    63 tokens (   14.83 ms per token,    67.45 tokens per second)
[51477] slot update_slots: id  1 | task 18581 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 37273
[51477] prompt eval time =    4773.91 ms /  2848 tokens (    1.68 ms per token,   596.58 tokens per second)
[51477]        eval time =     948.46 ms /    63 tokens (   15.05 ms per token,    66.42 tokens per second)
[51477] slot update_slots: id  1 | task 18647 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 39877
[51477] prompt eval time =    4464.22 ms /  2542 tokens (    1.76 ms per token,   569.42 tokens per second)
[51477]        eval time =     963.62 ms /    63 tokens (   15.30 ms per token,    65.38 tokens per second)
[51477] slot update_slots: id  1 | task 18713 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 42448
[51477] prompt eval time =    4524.50 ms /  2509 tokens (    1.80 ms per token,   554.54 tokens per second)
[51477]        eval time =     973.05 ms /    63 tokens (   15.45 ms per token,    64.74 tokens per second)
[51477] slot update_slots: id  1 | task 18779 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 46816
[51477] prompt eval time =    7670.65 ms /  4306 tokens (    1.78 ms per token,   561.36 tokens per second)
[51477]        eval time =     935.93 ms /    59 tokens (   15.86 ms per token,    63.04 tokens per second)
[51477] slot update_slots: id  1 | task 18841 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47066
[51477] prompt eval time =     671.89 ms /   192 tokens (    3.50 ms per token,   285.76 tokens per second)
[51477]        eval time =    1324.34 ms /    83 tokens (   15.96 ms per token,    62.67 tokens per second)
[51477] slot update_slots: id  1 | task 18926 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47169
[51477] prompt eval time =     167.03 ms /    21 tokens (    7.95 ms per token,   125.72 tokens per second)
[51477]        eval time =    1657.65 ms /   105 tokens (   15.79 ms per token,    63.34 tokens per second)
[51477] slot update_slots: id  1 | task 19033 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47980
[51477] prompt eval time =    1535.51 ms /   707 tokens (    2.17 ms per token,   460.43 tokens per second)
[51477]        eval time =    3405.93 ms /   212 tokens (   16.07 ms per token,    62.24 tokens per second)
[51477] slot update_slots: id  1 | task 19247 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 48299
[51477] prompt eval time =     552.24 ms /   108 tokens (    5.11 ms per token,   195.57 tokens per second)
[51477]        eval time =    1264.43 ms /    79 tokens (   16.01 ms per token,    62.48 tokens per second)
[51477] slot update_slots: id  1 | task 19328 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49447
[51477] prompt eval time =    1993.75 ms /  1070 tokens (    1.86 ms per token,   536.68 tokens per second)
[51477]        eval time =    1703.71 ms /   106 tokens (   16.07 ms per token,    62.22 tokens per second)
[51477] slot update_slots: id  1 | task 19437 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49813
[51477] prompt eval time =     825.79 ms /   261 tokens (    3.16 ms per token,   316.06 tokens per second)
[51477]        eval time =    1122.26 ms /    70 tokens (   16.03 ms per token,    62.37 tokens per second)
[51477] slot update_slots: id  1 | task 19509 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49974
[51477] prompt eval time =     539.29 ms /    92 tokens (    5.86 ms per token,   170.59 tokens per second)
[51477]        eval time =    9760.58 ms /   602 tokens (   16.21 ms per token,    61.68 tokens per second)
[51477] slot update_slots: id  1 | task 20113 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 50677
[51477] prompt eval time =     534.61 ms /   102 tokens (    5.24 ms per token,   190.79 tokens per second)
[51477]        eval time =   11957.72 ms /   728 tokens (   16.43 ms per token,    60.88 tokens per second)