cline速度
- 加no-kv-offload:5万上下文请求,只有5t/s
- 不加:5万,60+t/s
[1]- 已完成 nohup llama-server --models-preset ~/gguf/0410.ini --no-kv-offload --host 0.0.0.0 --port 8080 --api-key-file ~/gguf/apikey.txt > ~/gguf/llama_server.log 2>&1
[51477] srv load_model: loading model '/home/x99/gguf/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf'
[51477] llama_model_load_from_file_impl: using device CUDA0 (Tesla V100-SXM2-16GB) (0000:05:00.0) - 15513 MiB free
[51477] llama_model_load_from_file_impl: using device CUDA1 (Tesla V100-SXM2-16GB) (0000:06:00.0) - 15828 MiB free
[51477] srv load_model: initializing slots, n_slots = 2
[51477] srv load_model: speculative decoding will use checkpoints
[51477] srv load_model: prompt cache is enabled, size limit: 8192 MiB
[51477] srv load_model: use `--cache-ram 0` to disable the prompt cache
[51477] srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[51477] slot update_slots: id 1 | task 0 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 483
[51477] prompt eval time = 993.56 ms / 483 tokens ( 2.06 ms per token, 486.13 tokens per second)
[51477] eval time = 11499.93 ms / 836 tokens ( 13.76 ms per token, 72.70 tokens per second)
[51477] slot update_slots: id 0 | task 838 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1551
[51477] prompt eval time = 2246.99 ms / 1551 tokens ( 1.45 ms per token, 690.26 tokens per second)
[51477] eval time = 1245.89 ms / 90 tokens ( 13.84 ms per token, 72.24 tokens per second)
[51477] slot update_slots: id 1 | task 931 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1624
[51477] prompt eval time = 2363.23 ms / 1624 tokens ( 1.46 ms per token, 687.19 tokens per second)
[51477] eval time = 209.93 ms / 16 tokens ( 13.12 ms per token, 76.22 tokens per second)
[51477] slot update_slots: id 0 | task 950 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 1497
[51477] prompt eval time = 2265.03 ms / 1497 tokens ( 1.51 ms per token, 660.92 tokens per second)
[51477] eval time = 238.70 ms / 18 tokens ( 13.26 ms per token, 75.41 tokens per second)
[51477] slot update_slots: id 1 | task 971 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 12204
[51477] prompt eval time = 17580.23 ms / 12204 tokens ( 1.44 ms per token, 694.19 tokens per second)
[51477] eval time = 2168.97 ms / 160 tokens ( 13.56 ms per token, 73.77 tokens per second)
[51477] slot update_slots: id 1 | task 1136 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 23959
[51477] prompt eval time = 17740.57 ms / 11759 tokens ( 1.51 ms per token, 662.83 tokens per second)
[51477] eval time = 14551.12 ms / 1018 tokens ( 14.29 ms per token, 69.96 tokens per second)
[51477] slot update_slots: id 1 | task 2159 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 33082
[51477] prompt eval time = 14794.25 ms / 9127 tokens ( 1.62 ms per token, 616.93 tokens per second)
[51477] eval time = 9150.81 ms / 612 tokens ( 14.95 ms per token, 66.88 tokens per second)
[51477] slot update_slots: id 1 | task 2775 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 41544
[51477] prompt eval time = 14782.43 ms / 8466 tokens ( 1.75 ms per token, 572.71 tokens per second)
[51477] eval time = 14778.78 ms / 945 tokens ( 15.64 ms per token, 63.94 tokens per second)
[51477] slot update_slots: id 1 | task 3724 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 50820
[51477] prompt eval time = 16430.88 ms / 9280 tokens ( 1.77 ms per token, 564.79 tokens per second)
[51477] eval time = 3849.42 ms / 240 tokens ( 16.04 ms per token, 62.35 tokens per second)
[51477] slot update_slots: id 1 | task 3969 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 51652
[51477] prompt eval time = 1767.43 ms / 836 tokens ( 2.11 ms per token, 473.00 tokens per second)
[51477] eval time = 6477.86 ms / 400 tokens ( 16.19 ms per token, 61.75 tokens per second)
[51477] slot update_slots: id 1 | task 4371 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 36387
[51477] prompt eval time = 38483.23 ms / 24187 tokens ( 1.59 ms per token, 628.51 tokens per second)
[51477] eval time = 80746.42 ms / 5259 tokens ( 15.35 ms per token, 65.13 tokens per second)
[51477] slot update_slots: id 1 | task 9638 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 46686
[51477] prompt eval time = 18436.35 ms / 10303 tokens ( 1.79 ms per token, 558.84 tokens per second)
[51477] eval time = 7739.64 ms / 488 tokens ( 15.86 ms per token, 63.05 tokens per second)
[51477] slot update_slots: id 1 | task 10131 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47412
[51477] prompt eval time = 1599.57 ms / 730 tokens ( 2.19 ms per token, 456.37 tokens per second)
[51477] eval time = 10876.61 ms / 685 tokens ( 15.88 ms per token, 62.98 tokens per second)
OC速度
- 加no-kv-offload:5万上下文请求,只有2t/s
- 不加:60t/s
[51477] slot update_slots: id 0 | task 15576 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44754
[51477] slot update_slots: id 1 | task 15579 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44927
[51477] slot update_slots: id 0 | task 15590 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 44927
[51477] prompt eval time = 18051.27 ms / 1225 tokens ( 14.74 ms per token, 67.86 tokens per second)
[51477] eval time = 57689.30 ms / 1461 tokens ( 39.49 ms per token, 25.33 tokens per second)
[51477] slot update_slots: id 1 | task 17058 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 156690
[51477] slot update_slots: id 0 | task 17061 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 45082
[51477] prompt eval time = 2312.85 ms / 1207 tokens ( 1.92 ms per token, 521.87 tokens per second)
[51477] eval time = 21914.03 ms / 1373 tokens ( 15.96 ms per token, 62.65 tokens per second)
[51477] slot update_slots: id 1 | task 18437 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 33328
[51477] prompt eval time = 36565.39 ms / 25136 tokens ( 1.45 ms per token, 687.43 tokens per second)
[51477] eval time = 1060.71 ms / 71 tokens ( 14.94 ms per token, 66.94 tokens per second)
[51477] slot update_slots: id 1 | task 18516 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 34363
[51477] prompt eval time = 1714.95 ms / 965 tokens ( 1.78 ms per token, 562.70 tokens per second)
[51477] eval time = 933.99 ms / 63 tokens ( 14.83 ms per token, 67.45 tokens per second)
[51477] slot update_slots: id 1 | task 18581 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 37273
[51477] prompt eval time = 4773.91 ms / 2848 tokens ( 1.68 ms per token, 596.58 tokens per second)
[51477] eval time = 948.46 ms / 63 tokens ( 15.05 ms per token, 66.42 tokens per second)
[51477] slot update_slots: id 1 | task 18647 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 39877
[51477] prompt eval time = 4464.22 ms / 2542 tokens ( 1.76 ms per token, 569.42 tokens per second)
[51477] eval time = 963.62 ms / 63 tokens ( 15.30 ms per token, 65.38 tokens per second)
[51477] slot update_slots: id 1 | task 18713 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 42448
[51477] prompt eval time = 4524.50 ms / 2509 tokens ( 1.80 ms per token, 554.54 tokens per second)
[51477] eval time = 973.05 ms / 63 tokens ( 15.45 ms per token, 64.74 tokens per second)
[51477] slot update_slots: id 1 | task 18779 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 46816
[51477] prompt eval time = 7670.65 ms / 4306 tokens ( 1.78 ms per token, 561.36 tokens per second)
[51477] eval time = 935.93 ms / 59 tokens ( 15.86 ms per token, 63.04 tokens per second)
[51477] slot update_slots: id 1 | task 18841 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47066
[51477] prompt eval time = 671.89 ms / 192 tokens ( 3.50 ms per token, 285.76 tokens per second)
[51477] eval time = 1324.34 ms / 83 tokens ( 15.96 ms per token, 62.67 tokens per second)
[51477] slot update_slots: id 1 | task 18926 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47169
[51477] prompt eval time = 167.03 ms / 21 tokens ( 7.95 ms per token, 125.72 tokens per second)
[51477] eval time = 1657.65 ms / 105 tokens ( 15.79 ms per token, 63.34 tokens per second)
[51477] slot update_slots: id 1 | task 19033 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 47980
[51477] prompt eval time = 1535.51 ms / 707 tokens ( 2.17 ms per token, 460.43 tokens per second)
[51477] eval time = 3405.93 ms / 212 tokens ( 16.07 ms per token, 62.24 tokens per second)
[51477] slot update_slots: id 1 | task 19247 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 48299
[51477] prompt eval time = 552.24 ms / 108 tokens ( 5.11 ms per token, 195.57 tokens per second)
[51477] eval time = 1264.43 ms / 79 tokens ( 16.01 ms per token, 62.48 tokens per second)
[51477] slot update_slots: id 1 | task 19328 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49447
[51477] prompt eval time = 1993.75 ms / 1070 tokens ( 1.86 ms per token, 536.68 tokens per second)
[51477] eval time = 1703.71 ms / 106 tokens ( 16.07 ms per token, 62.22 tokens per second)
[51477] slot update_slots: id 1 | task 19437 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49813
[51477] prompt eval time = 825.79 ms / 261 tokens ( 3.16 ms per token, 316.06 tokens per second)
[51477] eval time = 1122.26 ms / 70 tokens ( 16.03 ms per token, 62.37 tokens per second)
[51477] slot update_slots: id 1 | task 19509 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 49974
[51477] prompt eval time = 539.29 ms / 92 tokens ( 5.86 ms per token, 170.59 tokens per second)
[51477] eval time = 9760.58 ms / 602 tokens ( 16.21 ms per token, 61.68 tokens per second)
[51477] slot update_slots: id 1 | task 20113 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 50677
[51477] prompt eval time = 534.61 ms / 102 tokens ( 5.24 ms per token, 190.79 tokens per second)
[51477] eval time = 11957.72 ms / 728 tokens ( 16.43 ms per token, 60.88 tokens per second)