oc编程llama.cpp 立刻崩

问题

  • opencode按照设计.md进行TDD测试驱动开发,llama.cpp 65%显存立刻崩,为exited with status 0正常退出而非崩溃
  • open-webui对话没问题

日志 - open-webui请求-优化前

  • 日志

    
    [42507] reasoning-budget: forced sequence complete, done
    [42507] slot print_timing: id  1 | task 4161 |
    [42507] prompt eval time =    7430.64 ms /  4378 tokens (    1.70 ms per token,   589.18 tokens per second)
    [42507]        eval time =     232.68 ms /    18 tokens (   12.93 ms per token,    77.36 tokens per second)
    [42507]       total time =    7663.32 ms /  4396 tokens
    [42507] slot      release: id  1 | task 4161 | stop processing: n_tokens = 4395, truncated = 0
    [42507] srv  update_slots: all slots are idle
    [42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 42507
    [42507] srv  params_from_: Chat format: peg-native
    [42507] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 702397047188
    [42507] srv  get_availabl: updating prompt cache
    [42507] srv   prompt_save:  - saving prompt with length 4433, total state size = 87.249 MiB
    [42507] srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.002
    [42507] srv        update:  - cache state: 2 prompts, 298.140 MiB (limits: 8192.000 MiB, 204800 tokens, 233719 est)
    [42507] srv        update:    - prompt 0x5b05b9eccbf0:    4073 tokens, checkpoints:  0,    85.265 MiB
    [42507] srv        update:    - prompt 0x5b05b77c6c70:    4433 tokens, checkpoints:  2,   212.876 MiB
    [42507] srv  get_availabl: prompt cache update took 141.89 ms
    [42507] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [42507] slot launch_slot_: id  0 | task 4182 | processing task, is_child = 0
    [42507] slot update_slots: id  0 | task 4182 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 4251
    [42507] slot update_slots: id  0 | task 4182 | n_past = 7, slot.prompt.tokens.size() = 4433, seq_id = 0, pos_min = 4432, n_swa = 0
    [42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [4300, 4300] against 7...
    [42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [3788, 3788] against 7...
    [42507] slot update_slots: id  0 | task 4182 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
    [42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 3788, pos_max = 3788, n_tokens = 3789, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 4300, pos_max = 4300, n_tokens = 4301, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 0, memory_seq_rm [0, end)
    [42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 3735, batch.n_tokens = 3735, progress = 0.878617
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 3735, memory_seq_rm [3735, end)
    [42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 4247, batch.n_tokens = 512, progress = 0.999059
    [42507] slot create_check: id  0 | task 4182 | created context checkpoint 1 of 32 (pos_min = 3734, pos_max = 3734, n_tokens = 3735, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 4247, memory_seq_rm [4247, end)
    [42507] reasoning-budget: activated, budget=0 tokens
    [42507] reasoning-budget: budget=0, forcing immediately
    [42507] slot init_sampler: id  0 | task 4182 | init sampler, took 1.20 ms, tokens: text = 4251, total = 4251
    [42507] slot update_slots: id  0 | task 4182 | prompt processing done, n_tokens = 4251, batch.n_tokens = 4
    [42507] slot create_check: id  0 | task 4182 | created context checkpoint 2 of 32 (pos_min = 4246, pos_max = 4246, n_tokens = 4247, size = 62.813 MiB)
    [42507] reasoning-budget: forced sequence complete, done
    [42507] slot print_timing: id  0 | task 4182 |
    [42507] prompt eval time =    7171.00 ms /  4251 tokens (    1.69 ms per token,   592.80 tokens per second)
    [42507]        eval time =     230.19 ms /    18 tokens (   12.79 ms per token,    78.20 tokens per second)
    [42507]       total time =    7401.19 ms /  4269 tokens
    [42507] slot      release: id  0 | task 4182 | stop processing: n_tokens = 4268, truncated = 0
    [42507] srv  update_slots: all slots are idle
    [42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
  • gemini分析 - 强制重算 (Full Re-processing)

    forcing full prompt re-processing due to lack of cache data (likely due to SWA...)
    
    ctx-size 设为了 200k,但 Qwen 3.5 模型使用了 SWA(滑动窗口注意力机制)
    发送的新请求与之前的缓存 Checkpoint 稍微有一点不匹配,llama.cpp无法复用之前的缓存
    它擦除了两个旧的 Checkpoint(erased invalidated context checkpoint)。
    全量重算: 它重新处理了 4251 个 token。
    耗时: prompt eval time = 7171.00 ms。这意味着模型在正式说话前,先花了 7 秒钟 在后台狂跑计算。
  • gemini分析 - 推理预算的冲突 (Reasoning Budget)

    reasoning-budget: activated, budget=0 tokens
    reasoning-budget: budget=0, forcing immediately
    
    由于你在 .ini 里设置了 reasoning-budget = 0 且 enable_thinking: false,模型在准备“思考”时被系统强行掐断了。
    这样节省了显存和时间。
    如果 OpenCode 期望模型进行深度推理来解决复杂的 Bug,而你把它掐死在 0 token,模型给出的代码质量可能会下降(因为它被迫直接输出结果,没有经过内部逻辑梳理)。
  • gemini分析 - 性能表现 (Performance Data)

    吞吐量: 592.80 tokens per second (Prompt Eval)。这个速度非常快,说明你的 CUDA 编译和双卡 tensor-split 工作得很好。
    生成速度: 78.20 tokens per second (Eval time)。对于 27B 的模型,每秒 78 个 token 是极高的性能,属于“秒回”级别。

日志 - open-webui重复请求(5万上下文+1024u批次+关闭reason)

  • 日志

    
    -------------------open webui第二次重复请求------
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
    [39875] srv  params_from_: Chat format: peg-native
    [39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707218965739
    [39875] srv  get_availabl: updating prompt cache
    [39875] srv   prompt_save:  - saving prompt with length 2580, total state size = 77.035 MiB
    [39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.004
    [39875] srv          load:  - found better prompt with f_keep = 0.348, sim = 1.000
    [39875] srv        update:  - cache state: 1 prompts, 202.661 MiB (limits: 8192.000 MiB, 51200 tokens, 104289 est)
    [39875] srv        update:    - prompt 0x650c85dfbff0:    2580 tokens, checkpoints:  2,   202.661 MiB
    [39875] srv  get_availabl: prompt cache update took 238.98 ms
    [39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [39875] slot launch_slot_: id  0 | task 1566 | processing task, is_child = 0
    [39875] slot update_slots: id  0 | task 1566 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 789
    [39875] slot update_slots: id  0 | task 1566 | n_past = 789, slot.prompt.tokens.size() = 2266, seq_id = 0, pos_min = 2265, n_swa = 0
    [39875] slot update_slots: id  0 | task 1566 | Checking checkpoint with [784, 784] against 789...
    [39875] slot update_slots: id  0 | task 1566 | restored context checkpoint (pos_min = 784, pos_max = 784, n_tokens = 785, n_past = 785, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 1566 | n_tokens = 785, memory_seq_rm [785, end)
    [39875] slot init_sampler: id  0 | task 1566 | init sampler, took 0.29 ms, tokens: text = 789, total = 789
    [39875] slot update_slots: id  0 | task 1566 | prompt processing done, n_tokens = 789, batch.n_tokens = 4
    [39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    [39875] slot print_timing: id  0 | task 1566 |
    [39875] prompt eval time =      45.13 ms /     4 tokens (   11.28 ms per token,    88.64 tokens per second)
    [39875]        eval time =   25533.13 ms /  1752 tokens (   14.57 ms per token,    68.62 tokens per second)
    [39875]       total time =   25578.26 ms /  1756 tokens
    [39875] slot      release: id  0 | task 1566 | stop processing: n_tokens = 2540, truncated = 0
    [39875] srv  update_slots: all slots are idle
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
    [39875] srv  params_from_: Chat format: peg-native
    [39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707268226515
    [39875] srv  get_availabl: updating prompt cache
    [39875] srv   prompt_save:  - saving prompt with length 2540, total state size = 76.815 MiB
    [39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.001
    [39875] srv          load:  - found better prompt with f_keep = 0.396, sim = 0.368
    [39875] srv        update:  - cache state: 1 prompts, 139.628 MiB (limits: 8192.000 MiB, 51200 tokens, 149022 est)
    [39875] srv        update:    - prompt 0x650c7fe5fda0:    2540 tokens, checkpoints:  1,   139.628 MiB
    [39875] srv  get_availabl: prompt cache update took 133.95 ms
    [39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [39875] slot launch_slot_: id  0 | task 3319 | processing task, is_child = 0
    [39875] slot update_slots: id  0 | task 3319 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 2772
    [39875] slot update_slots: id  0 | task 3319 | n_past = 1021, slot.prompt.tokens.size() = 2580, seq_id = 0, pos_min = 2579, n_swa = 0
    [39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [2493, 2493] against 1021...
    [39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [1469, 1469] against 1021...
    [39875] slot update_slots: id  0 | task 3319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
    [39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 1469, pos_max = 1469, n_tokens = 1470, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 2493, pos_max = 2493, n_tokens = 2494, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 0, memory_seq_rm [0, end)
    [39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 1744, batch.n_tokens = 1744, progress = 0.629149
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 1744, memory_seq_rm [1744, end)
    [39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 2768, batch.n_tokens = 1024, progress = 0.998557
    [39875] slot create_check: id  0 | task 3319 | created context checkpoint 1 of 32 (pos_min = 1743, pos_max = 1743, n_tokens = 1744, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 2768, memory_seq_rm [2768, end)
    [39875] slot init_sampler: id  0 | task 3319 | init sampler, took 0.88 ms, tokens: text = 2772, total = 2772
    [39875] slot update_slots: id  0 | task 3319 | prompt processing done, n_tokens = 2772, batch.n_tokens = 4
    [39875] slot create_check: id  0 | task 3319 | created context checkpoint 2 of 32 (pos_min = 2767, pos_max = 2767, n_tokens = 2768, size = 62.813 MiB)
    [39875] slot print_timing: id  0 | task 3319 |
    [39875] prompt eval time =    3682.90 ms /  2772 tokens (    1.33 ms per token,   752.67 tokens per second)
    [39875]        eval time =    1357.64 ms /    98 tokens (   13.85 ms per token,    72.18 tokens per second)
    [39875]       total time =    5040.54 ms /  2870 tokens
    [39875] slot      release: id  0 | task 3319 | stop processing: n_tokens = 2869, truncated = 0
    [39875] srv  update_slots: all slots are idle
    [39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    
    
    
  • gemini分析

    1. 推理速度起飞:752.67 tokens/s (吞吐量峰值)
    看 Task 3319 的 prompt eval time:
    处理 2772 个 token 仅耗时 3.6 秒。
    每 token 耗时降到了 1.33 ms。
    结论:关闭了 Reasoning 后,GPU 的算力被完全释放到了 Prompt 的预处理上。即便发生了全量重算(Cache Miss),你的等待感也从之前的“漫长”变成了“稍微缓口气”。
    
    2. 缓存相似度难题:sim = 0.368
    这是你目前最核心的问题:为什么重复请求,相似度却这么低?
    在第二个请求中,虽然模型是一样的,内容也高度重合,但 llama.cpp 报告相似度只有 36.8%。
    这直接导致了 n_past 只能匹配到 1021 个 token。
    原因排查:由于你关闭了推理模式,现在的变量只剩下一个——Open WebUI 的 Prompt 组装逻辑。
    检查 WebUI 是否在对话历史中加入了**“随机打乱的上下文”或者“变化的 System Prompt”**。
    如果相似度一直上不去,说明每次发给后端的“文本流”在物理字符级别上有很大差异。
    
    3. SWA 再次“背锅”
    日志再次提到:forcing full prompt re-processing due to lack of cache data (likely due to SWA...)。
    因为相似度低,导致 n_past 匹配到的位置(1021)和你之前存的 Checkpoint 位置(1469, 2493)完全错开了。
    llama.cpp 的策略非常保守:只要没法从 Checkpoint 完美恢复,它就宁愿把 1024 之前的缓存全部清空(memory_seq_rm [0, end)),然后利用你强大的显卡暴力重算。
    
    💡 现在的状态评估
    你的机器现在已经是 "完全体" 状态了:
    生成速度 (Eval):稳定在 72 t/s 左右。
    处理速度 (Prompt Eval):达到了惊人的 752 t/s。
    显存利用:ctx-size 已经放宽到了 51200 (50k),且运行稳定。

日志 - opencode tdd编程任务

  • opencode.json

    "limit": {
        "context": 49152,
        "input": 32768,
        "output": 4096
    }
  • 日志

排查