问题
- opencode按照设计.md进行TDD测试驱动开发,llama.cpp 65%显存立刻崩,为exited with status 0正常退出而非崩溃
- open-webui对话没问题
日志 - open-webui请求-优化前
日志
[42507] reasoning-budget: forced sequence complete, done [42507] slot print_timing: id 1 | task 4161 | [42507] prompt eval time = 7430.64 ms / 4378 tokens ( 1.70 ms per token, 589.18 tokens per second) [42507] eval time = 232.68 ms / 18 tokens ( 12.93 ms per token, 77.36 tokens per second) [42507] total time = 7663.32 ms / 4396 tokens [42507] slot release: id 1 | task 4161 | stop processing: n_tokens = 4395, truncated = 0 [42507] srv update_slots: all slots are idle [42507] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv proxy_reques: proxying request to model Qwen3.5-27B on port 42507 [42507] srv params_from_: Chat format: peg-native [42507] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 702397047188 [42507] srv get_availabl: updating prompt cache [42507] srv prompt_save: - saving prompt with length 4433, total state size = 87.249 MiB [42507] srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.002 [42507] srv update: - cache state: 2 prompts, 298.140 MiB (limits: 8192.000 MiB, 204800 tokens, 233719 est) [42507] srv update: - prompt 0x5b05b9eccbf0: 4073 tokens, checkpoints: 0, 85.265 MiB [42507] srv update: - prompt 0x5b05b77c6c70: 4433 tokens, checkpoints: 2, 212.876 MiB [42507] srv get_availabl: prompt cache update took 141.89 ms [42507] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [42507] slot launch_slot_: id 0 | task 4182 | processing task, is_child = 0 [42507] slot update_slots: id 0 | task 4182 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 4251 [42507] slot update_slots: id 0 | task 4182 | n_past = 7, slot.prompt.tokens.size() = 4433, seq_id = 0, pos_min = 4432, n_swa = 0 [42507] slot update_slots: id 0 | task 4182 | Checking checkpoint with [4300, 4300] against 7... [42507] slot update_slots: id 0 | task 4182 | Checking checkpoint with [3788, 3788] against 7... [42507] slot update_slots: id 0 | task 4182 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) [42507] slot update_slots: id 0 | task 4182 | erased invalidated context checkpoint (pos_min = 3788, pos_max = 3788, n_tokens = 3789, n_swa = 0, pos_next = 0, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | erased invalidated context checkpoint (pos_min = 4300, pos_max = 4300, n_tokens = 4301, n_swa = 0, pos_next = 0, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | n_tokens = 0, memory_seq_rm [0, end) [42507] slot update_slots: id 0 | task 4182 | prompt processing progress, n_tokens = 3735, batch.n_tokens = 3735, progress = 0.878617 [42507] slot update_slots: id 0 | task 4182 | n_tokens = 3735, memory_seq_rm [3735, end) [42507] slot update_slots: id 0 | task 4182 | prompt processing progress, n_tokens = 4247, batch.n_tokens = 512, progress = 0.999059 [42507] slot create_check: id 0 | task 4182 | created context checkpoint 1 of 32 (pos_min = 3734, pos_max = 3734, n_tokens = 3735, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | n_tokens = 4247, memory_seq_rm [4247, end) [42507] reasoning-budget: activated, budget=0 tokens [42507] reasoning-budget: budget=0, forcing immediately [42507] slot init_sampler: id 0 | task 4182 | init sampler, took 1.20 ms, tokens: text = 4251, total = 4251 [42507] slot update_slots: id 0 | task 4182 | prompt processing done, n_tokens = 4251, batch.n_tokens = 4 [42507] slot create_check: id 0 | task 4182 | created context checkpoint 2 of 32 (pos_min = 4246, pos_max = 4246, n_tokens = 4247, size = 62.813 MiB) [42507] reasoning-budget: forced sequence complete, done [42507] slot print_timing: id 0 | task 4182 | [42507] prompt eval time = 7171.00 ms / 4251 tokens ( 1.69 ms per token, 592.80 tokens per second) [42507] eval time = 230.19 ms / 18 tokens ( 12.79 ms per token, 78.20 tokens per second) [42507] total time = 7401.19 ms / 4269 tokens [42507] slot release: id 0 | task 4182 | stop processing: n_tokens = 4268, truncated = 0 [42507] srv update_slots: all slots are idle [42507] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200gemini分析 - 强制重算 (Full Re-processing)
forcing full prompt re-processing due to lack of cache data (likely due to SWA...) ctx-size 设为了 200k,但 Qwen 3.5 模型使用了 SWA(滑动窗口注意力机制) 发送的新请求与之前的缓存 Checkpoint 稍微有一点不匹配,llama.cpp无法复用之前的缓存 它擦除了两个旧的 Checkpoint(erased invalidated context checkpoint)。 全量重算: 它重新处理了 4251 个 token。 耗时: prompt eval time = 7171.00 ms。这意味着模型在正式说话前,先花了 7 秒钟 在后台狂跑计算。gemini分析 - 推理预算的冲突 (Reasoning Budget)
reasoning-budget: activated, budget=0 tokens reasoning-budget: budget=0, forcing immediately 由于你在 .ini 里设置了 reasoning-budget = 0 且 enable_thinking: false,模型在准备“思考”时被系统强行掐断了。 这样节省了显存和时间。 如果 OpenCode 期望模型进行深度推理来解决复杂的 Bug,而你把它掐死在 0 token,模型给出的代码质量可能会下降(因为它被迫直接输出结果,没有经过内部逻辑梳理)。gemini分析 - 性能表现 (Performance Data)
吞吐量: 592.80 tokens per second (Prompt Eval)。这个速度非常快,说明你的 CUDA 编译和双卡 tensor-split 工作得很好。 生成速度: 78.20 tokens per second (Eval time)。对于 27B 的模型,每秒 78 个 token 是极高的性能,属于“秒回”级别。
日志 - open-webui重复请求(5万上下文+1024u批次+关闭reason)
日志
-------------------open webui第二次重复请求------ srv proxy_reques: proxying request to model Qwen3.5-27B on port 39875 [39875] srv params_from_: Chat format: peg-native [39875] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 707218965739 [39875] srv get_availabl: updating prompt cache [39875] srv prompt_save: - saving prompt with length 2580, total state size = 77.035 MiB [39875] srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.004 [39875] srv load: - found better prompt with f_keep = 0.348, sim = 1.000 [39875] srv update: - cache state: 1 prompts, 202.661 MiB (limits: 8192.000 MiB, 51200 tokens, 104289 est) [39875] srv update: - prompt 0x650c85dfbff0: 2580 tokens, checkpoints: 2, 202.661 MiB [39875] srv get_availabl: prompt cache update took 238.98 ms [39875] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [39875] slot launch_slot_: id 0 | task 1566 | processing task, is_child = 0 [39875] slot update_slots: id 0 | task 1566 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 789 [39875] slot update_slots: id 0 | task 1566 | n_past = 789, slot.prompt.tokens.size() = 2266, seq_id = 0, pos_min = 2265, n_swa = 0 [39875] slot update_slots: id 0 | task 1566 | Checking checkpoint with [784, 784] against 789... [39875] slot update_slots: id 0 | task 1566 | restored context checkpoint (pos_min = 784, pos_max = 784, n_tokens = 785, n_past = 785, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 1566 | n_tokens = 785, memory_seq_rm [785, end) [39875] slot init_sampler: id 0 | task 1566 | init sampler, took 0.29 ms, tokens: text = 789, total = 789 [39875] slot update_slots: id 0 | task 1566 | prompt processing done, n_tokens = 789, batch.n_tokens = 4 [39875] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 [39875] slot print_timing: id 0 | task 1566 | [39875] prompt eval time = 45.13 ms / 4 tokens ( 11.28 ms per token, 88.64 tokens per second) [39875] eval time = 25533.13 ms / 1752 tokens ( 14.57 ms per token, 68.62 tokens per second) [39875] total time = 25578.26 ms / 1756 tokens [39875] slot release: id 0 | task 1566 | stop processing: n_tokens = 2540, truncated = 0 [39875] srv update_slots: all slots are idle srv proxy_reques: proxying request to model Qwen3.5-27B on port 39875 [39875] srv params_from_: Chat format: peg-native [39875] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 707268226515 [39875] srv get_availabl: updating prompt cache [39875] srv prompt_save: - saving prompt with length 2540, total state size = 76.815 MiB [39875] srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.001 [39875] srv load: - found better prompt with f_keep = 0.396, sim = 0.368 [39875] srv update: - cache state: 1 prompts, 139.628 MiB (limits: 8192.000 MiB, 51200 tokens, 149022 est) [39875] srv update: - prompt 0x650c7fe5fda0: 2540 tokens, checkpoints: 1, 139.628 MiB [39875] srv get_availabl: prompt cache update took 133.95 ms [39875] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [39875] slot launch_slot_: id 0 | task 3319 | processing task, is_child = 0 [39875] slot update_slots: id 0 | task 3319 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 2772 [39875] slot update_slots: id 0 | task 3319 | n_past = 1021, slot.prompt.tokens.size() = 2580, seq_id = 0, pos_min = 2579, n_swa = 0 [39875] slot update_slots: id 0 | task 3319 | Checking checkpoint with [2493, 2493] against 1021... [39875] slot update_slots: id 0 | task 3319 | Checking checkpoint with [1469, 1469] against 1021... [39875] slot update_slots: id 0 | task 3319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) [39875] slot update_slots: id 0 | task 3319 | erased invalidated context checkpoint (pos_min = 1469, pos_max = 1469, n_tokens = 1470, n_swa = 0, pos_next = 0, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | erased invalidated context checkpoint (pos_min = 2493, pos_max = 2493, n_tokens = 2494, n_swa = 0, pos_next = 0, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | n_tokens = 0, memory_seq_rm [0, end) [39875] slot update_slots: id 0 | task 3319 | prompt processing progress, n_tokens = 1744, batch.n_tokens = 1744, progress = 0.629149 [39875] slot update_slots: id 0 | task 3319 | n_tokens = 1744, memory_seq_rm [1744, end) [39875] slot update_slots: id 0 | task 3319 | prompt processing progress, n_tokens = 2768, batch.n_tokens = 1024, progress = 0.998557 [39875] slot create_check: id 0 | task 3319 | created context checkpoint 1 of 32 (pos_min = 1743, pos_max = 1743, n_tokens = 1744, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | n_tokens = 2768, memory_seq_rm [2768, end) [39875] slot init_sampler: id 0 | task 3319 | init sampler, took 0.88 ms, tokens: text = 2772, total = 2772 [39875] slot update_slots: id 0 | task 3319 | prompt processing done, n_tokens = 2772, batch.n_tokens = 4 [39875] slot create_check: id 0 | task 3319 | created context checkpoint 2 of 32 (pos_min = 2767, pos_max = 2767, n_tokens = 2768, size = 62.813 MiB) [39875] slot print_timing: id 0 | task 3319 | [39875] prompt eval time = 3682.90 ms / 2772 tokens ( 1.33 ms per token, 752.67 tokens per second) [39875] eval time = 1357.64 ms / 98 tokens ( 13.85 ms per token, 72.18 tokens per second) [39875] total time = 5040.54 ms / 2870 tokens [39875] slot release: id 0 | task 3319 | stop processing: n_tokens = 2869, truncated = 0 [39875] srv update_slots: all slots are idle [39875] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200gemini分析
1. 推理速度起飞:752.67 tokens/s (吞吐量峰值) 看 Task 3319 的 prompt eval time: 处理 2772 个 token 仅耗时 3.6 秒。 每 token 耗时降到了 1.33 ms。 结论:关闭了 Reasoning 后,GPU 的算力被完全释放到了 Prompt 的预处理上。即便发生了全量重算(Cache Miss),你的等待感也从之前的“漫长”变成了“稍微缓口气”。 2. 缓存相似度难题:sim = 0.368 这是你目前最核心的问题:为什么重复请求,相似度却这么低? 在第二个请求中,虽然模型是一样的,内容也高度重合,但 llama.cpp 报告相似度只有 36.8%。 这直接导致了 n_past 只能匹配到 1021 个 token。 原因排查:由于你关闭了推理模式,现在的变量只剩下一个——Open WebUI 的 Prompt 组装逻辑。 检查 WebUI 是否在对话历史中加入了**“随机打乱的上下文”或者“变化的 System Prompt”**。 如果相似度一直上不去,说明每次发给后端的“文本流”在物理字符级别上有很大差异。 3. SWA 再次“背锅” 日志再次提到:forcing full prompt re-processing due to lack of cache data (likely due to SWA...)。 因为相似度低,导致 n_past 匹配到的位置(1021)和你之前存的 Checkpoint 位置(1469, 2493)完全错开了。 llama.cpp 的策略非常保守:只要没法从 Checkpoint 完美恢复,它就宁愿把 1024 之前的缓存全部清空(memory_seq_rm [0, end)),然后利用你强大的显卡暴力重算。 💡 现在的状态评估 你的机器现在已经是 "完全体" 状态了: 生成速度 (Eval):稳定在 72 t/s 左右。 处理速度 (Prompt Eval):达到了惊人的 752 t/s。 显存利用:ctx-size 已经放宽到了 51200 (50k),且运行稳定。
日志 - opencode tdd编程任务
opencode.json
"limit": { "context": 49152, "input": 32768, "output": 4096 }- 日志
排查