oc编程llama.cpp 立刻崩

问题进展

  • opencode按照设计.md进行TDD测试驱动开发,llama.cpp 65%显存立刻崩,为exited with status 0正常退出而非崩溃
  • 对比:open-webui对话没问题
  • 对比:llama.cpp和opencode调参优化,无法解决,仍崩溃,日志分析opencode tdd编程一次爆发35万上下文
  • 对比:切换到opencode zen免费算力 gpt-5 nano 40万上下文,任务能顺利执行
  • 新的发现:--flash-attn off 可以解决  https://github.com/ggml-org/llama.cpp/issues/21336 没搞定
  • 绕一下解决:根据@ai-sdk/openai-compatible,nginx增加限制

日志 - open-webui请求-优化前

  • 日志

    
    [42507] reasoning-budget: forced sequence complete, done
    [42507] slot print_timing: id  1 | task 4161 |
    [42507] prompt eval time =    7430.64 ms /  4378 tokens (    1.70 ms per token,   589.18 tokens per second)
    [42507]        eval time =     232.68 ms /    18 tokens (   12.93 ms per token,    77.36 tokens per second)
    [42507]       total time =    7663.32 ms /  4396 tokens
    [42507] slot      release: id  1 | task 4161 | stop processing: n_tokens = 4395, truncated = 0
    [42507] srv  update_slots: all slots are idle
    [42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 42507
    [42507] srv  params_from_: Chat format: peg-native
    [42507] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 702397047188
    [42507] srv  get_availabl: updating prompt cache
    [42507] srv   prompt_save:  - saving prompt with length 4433, total state size = 87.249 MiB
    [42507] srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.002
    [42507] srv        update:  - cache state: 2 prompts, 298.140 MiB (limits: 8192.000 MiB, 204800 tokens, 233719 est)
    [42507] srv        update:    - prompt 0x5b05b9eccbf0:    4073 tokens, checkpoints:  0,    85.265 MiB
    [42507] srv        update:    - prompt 0x5b05b77c6c70:    4433 tokens, checkpoints:  2,   212.876 MiB
    [42507] srv  get_availabl: prompt cache update took 141.89 ms
    [42507] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [42507] slot launch_slot_: id  0 | task 4182 | processing task, is_child = 0
    [42507] slot update_slots: id  0 | task 4182 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 4251
    [42507] slot update_slots: id  0 | task 4182 | n_past = 7, slot.prompt.tokens.size() = 4433, seq_id = 0, pos_min = 4432, n_swa = 0
    [42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [4300, 4300] against 7...
    [42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [3788, 3788] against 7...
    [42507] slot update_slots: id  0 | task 4182 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
    [42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 3788, pos_max = 3788, n_tokens = 3789, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 4300, pos_max = 4300, n_tokens = 4301, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 0, memory_seq_rm [0, end)
    [42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 3735, batch.n_tokens = 3735, progress = 0.878617
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 3735, memory_seq_rm [3735, end)
    [42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 4247, batch.n_tokens = 512, progress = 0.999059
    [42507] slot create_check: id  0 | task 4182 | created context checkpoint 1 of 32 (pos_min = 3734, pos_max = 3734, n_tokens = 3735, size = 62.813 MiB)
    [42507] slot update_slots: id  0 | task 4182 | n_tokens = 4247, memory_seq_rm [4247, end)
    [42507] reasoning-budget: activated, budget=0 tokens
    [42507] reasoning-budget: budget=0, forcing immediately
    [42507] slot init_sampler: id  0 | task 4182 | init sampler, took 1.20 ms, tokens: text = 4251, total = 4251
    [42507] slot update_slots: id  0 | task 4182 | prompt processing done, n_tokens = 4251, batch.n_tokens = 4
    [42507] slot create_check: id  0 | task 4182 | created context checkpoint 2 of 32 (pos_min = 4246, pos_max = 4246, n_tokens = 4247, size = 62.813 MiB)
    [42507] reasoning-budget: forced sequence complete, done
    [42507] slot print_timing: id  0 | task 4182 |
    [42507] prompt eval time =    7171.00 ms /  4251 tokens (    1.69 ms per token,   592.80 tokens per second)
    [42507]        eval time =     230.19 ms /    18 tokens (   12.79 ms per token,    78.20 tokens per second)
    [42507]       total time =    7401.19 ms /  4269 tokens
    [42507] slot      release: id  0 | task 4182 | stop processing: n_tokens = 4268, truncated = 0
    [42507] srv  update_slots: all slots are idle
    [42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
  • gemini分析 - 强制重算 (Full Re-processing)

    forcing full prompt re-processing due to lack of cache data (likely due to SWA...)
    
    ctx-size 设为了 200k,但 Qwen 3.5 模型使用了 SWA(滑动窗口注意力机制)
    发送的新请求与之前的缓存 Checkpoint 稍微有一点不匹配,llama.cpp无法复用之前的缓存
    它擦除了两个旧的 Checkpoint(erased invalidated context checkpoint)。
    全量重算: 它重新处理了 4251 个 token。
    耗时: prompt eval time = 7171.00 ms。这意味着模型在正式说话前,先花了 7 秒钟 在后台狂跑计算。
  • gemini分析 - 推理预算的冲突 (Reasoning Budget)

    reasoning-budget: activated, budget=0 tokens
    reasoning-budget: budget=0, forcing immediately
    
    由于你在 .ini 里设置了 reasoning-budget = 0 且 enable_thinking: false,模型在准备“思考”时被系统强行掐断了。
    这样节省了显存和时间。
    如果 OpenCode 期望模型进行深度推理来解决复杂的 Bug,而你把它掐死在 0 token,模型给出的代码质量可能会下降(因为它被迫直接输出结果,没有经过内部逻辑梳理)。
  • gemini分析 - 性能表现 (Performance Data)

    吞吐量: 592.80 tokens per second (Prompt Eval)。这个速度非常快,说明你的 CUDA 编译和双卡 tensor-split 工作得很好。
    生成速度: 78.20 tokens per second (Eval time)。对于 27B 的模型,每秒 78 个 token 是极高的性能,属于“秒回”级别。

日志 - open-webui重复请求(5万上下文+1024u批次+关闭reason)

  • 日志

    
    -------------------open webui第二次重复请求------
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
    [39875] srv  params_from_: Chat format: peg-native
    [39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707218965739
    [39875] srv  get_availabl: updating prompt cache
    [39875] srv   prompt_save:  - saving prompt with length 2580, total state size = 77.035 MiB
    [39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.004
    [39875] srv          load:  - found better prompt with f_keep = 0.348, sim = 1.000
    [39875] srv        update:  - cache state: 1 prompts, 202.661 MiB (limits: 8192.000 MiB, 51200 tokens, 104289 est)
    [39875] srv        update:    - prompt 0x650c85dfbff0:    2580 tokens, checkpoints:  2,   202.661 MiB
    [39875] srv  get_availabl: prompt cache update took 238.98 ms
    [39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [39875] slot launch_slot_: id  0 | task 1566 | processing task, is_child = 0
    [39875] slot update_slots: id  0 | task 1566 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 789
    [39875] slot update_slots: id  0 | task 1566 | n_past = 789, slot.prompt.tokens.size() = 2266, seq_id = 0, pos_min = 2265, n_swa = 0
    [39875] slot update_slots: id  0 | task 1566 | Checking checkpoint with [784, 784] against 789...
    [39875] slot update_slots: id  0 | task 1566 | restored context checkpoint (pos_min = 784, pos_max = 784, n_tokens = 785, n_past = 785, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 1566 | n_tokens = 785, memory_seq_rm [785, end)
    [39875] slot init_sampler: id  0 | task 1566 | init sampler, took 0.29 ms, tokens: text = 789, total = 789
    [39875] slot update_slots: id  0 | task 1566 | prompt processing done, n_tokens = 789, batch.n_tokens = 4
    [39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    [39875] slot print_timing: id  0 | task 1566 |
    [39875] prompt eval time =      45.13 ms /     4 tokens (   11.28 ms per token,    88.64 tokens per second)
    [39875]        eval time =   25533.13 ms /  1752 tokens (   14.57 ms per token,    68.62 tokens per second)
    [39875]       total time =   25578.26 ms /  1756 tokens
    [39875] slot      release: id  0 | task 1566 | stop processing: n_tokens = 2540, truncated = 0
    [39875] srv  update_slots: all slots are idle
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
    [39875] srv  params_from_: Chat format: peg-native
    [39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707268226515
    [39875] srv  get_availabl: updating prompt cache
    [39875] srv   prompt_save:  - saving prompt with length 2540, total state size = 76.815 MiB
    [39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.001
    [39875] srv          load:  - found better prompt with f_keep = 0.396, sim = 0.368
    [39875] srv        update:  - cache state: 1 prompts, 139.628 MiB (limits: 8192.000 MiB, 51200 tokens, 149022 est)
    [39875] srv        update:    - prompt 0x650c7fe5fda0:    2540 tokens, checkpoints:  1,   139.628 MiB
    [39875] srv  get_availabl: prompt cache update took 133.95 ms
    [39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
    [39875] slot launch_slot_: id  0 | task 3319 | processing task, is_child = 0
    [39875] slot update_slots: id  0 | task 3319 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 2772
    [39875] slot update_slots: id  0 | task 3319 | n_past = 1021, slot.prompt.tokens.size() = 2580, seq_id = 0, pos_min = 2579, n_swa = 0
    [39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [2493, 2493] against 1021...
    [39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [1469, 1469] against 1021...
    [39875] slot update_slots: id  0 | task 3319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
    [39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 1469, pos_max = 1469, n_tokens = 1470, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 2493, pos_max = 2493, n_tokens = 2494, n_swa = 0, pos_next = 0, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 0, memory_seq_rm [0, end)
    [39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 1744, batch.n_tokens = 1744, progress = 0.629149
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 1744, memory_seq_rm [1744, end)
    [39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 2768, batch.n_tokens = 1024, progress = 0.998557
    [39875] slot create_check: id  0 | task 3319 | created context checkpoint 1 of 32 (pos_min = 1743, pos_max = 1743, n_tokens = 1744, size = 62.813 MiB)
    [39875] slot update_slots: id  0 | task 3319 | n_tokens = 2768, memory_seq_rm [2768, end)
    [39875] slot init_sampler: id  0 | task 3319 | init sampler, took 0.88 ms, tokens: text = 2772, total = 2772
    [39875] slot update_slots: id  0 | task 3319 | prompt processing done, n_tokens = 2772, batch.n_tokens = 4
    [39875] slot create_check: id  0 | task 3319 | created context checkpoint 2 of 32 (pos_min = 2767, pos_max = 2767, n_tokens = 2768, size = 62.813 MiB)
    [39875] slot print_timing: id  0 | task 3319 |
    [39875] prompt eval time =    3682.90 ms /  2772 tokens (    1.33 ms per token,   752.67 tokens per second)
    [39875]        eval time =    1357.64 ms /    98 tokens (   13.85 ms per token,    72.18 tokens per second)
    [39875]       total time =    5040.54 ms /  2870 tokens
    [39875] slot      release: id  0 | task 3319 | stop processing: n_tokens = 2869, truncated = 0
    [39875] srv  update_slots: all slots are idle
    [39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
  • gemini分析

    1. 推理速度起飞:752.67 tokens/s (吞吐量峰值)
    看 Task 3319 的 prompt eval time:
    处理 2772 个 token 仅耗时 3.6 秒。
    每 token 耗时降到了 1.33 ms。
    结论:关闭了 Reasoning 后,GPU 的算力被完全释放到了 Prompt 的预处理上。即便发生了全量重算(Cache Miss),你的等待感也从之前的“漫长”变成了“稍微缓口气”。
    
    2. 缓存相似度难题:sim = 0.368
    这是你目前最核心的问题:为什么重复请求,相似度却这么低?
    在第二个请求中,虽然模型是一样的,内容也高度重合,但 llama.cpp 报告相似度只有 36.8%。
    这直接导致了 n_past 只能匹配到 1021 个 token。
    原因排查:由于你关闭了推理模式,现在的变量只剩下一个——Open WebUI 的 Prompt 组装逻辑。
    检查 WebUI 是否在对话历史中加入了**“随机打乱的上下文”或者“变化的 System Prompt”**。
    如果相似度一直上不去,说明每次发给后端的“文本流”在物理字符级别上有很大差异。
    
    3. SWA 再次“背锅”
    日志再次提到:forcing full prompt re-processing due to lack of cache data (likely due to SWA...)。
    因为相似度低,导致 n_past 匹配到的位置(1021)和你之前存的 Checkpoint 位置(1469, 2493)完全错开了。
    llama.cpp 的策略非常保守:只要没法从 Checkpoint 完美恢复,它就宁愿把 1024 之前的缓存全部清空(memory_seq_rm [0, end)),然后利用你强大的显卡暴力重算。
    
    💡 现在的状态评估
    你的机器现在已经是 "完全体" 状态了:
    生成速度 (Eval):稳定在 72 t/s 左右。
    处理速度 (Prompt Eval):达到了惊人的 752 t/s。
    显存利用:ctx-size 已经放宽到了 51200 (50k),且运行稳定。

日志 - opencode tdd编程任务

  • opencode.json

    "limit": {
        "context": 49152,
        "input": 32768,
        "output": 4096
    }
  • llama-server日志

    ---------------opencode tdd start----------------
    srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
    srv    operator(): http client error: Failed to read connection
    srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
    ----------------end---------------
  • opencode服务日志
  • trae分析

    OpenCode 日志里也能看出一个非常关键的差异点:OpenCode 在跑 agent=compaction 时,会发一个 非常巨大的请求体 到 https://v1008080.atibm.com/v1/chat/completions ,里面 messages 带了超长历史(日志里已经出现 [..., 329672 chars omitted ...] ),而且 max_tokens 还给到了 4096 。这种请求形态是 OpenWebUI 基本不会发的。
    
    所以现在最像的根因只有两类(需要用后端最后几行日志一锤定音):
    
    - A:child 进程在处理超大 prompt / tools schema 时触发了 CUDA OOM / illegal access / assert,直接退出
    - B:child 进程被 OS kill(OOM killer)或被父进程策略性回收(但通常你会在父日志里看到 exit command / unload_all 的因果链)
    
    ----------------------------对照---------------------------------------
    OpenCode 换成 GPT-5 nano(40 万上下文)后 TDD 正常推进,说明 OpenCode/TDD 流程本身没问题 ;真正的问题是 本地 llama.cpp 在承接 OpenCode 这类“超重请求”(长 messages + tools schema + compaction/重试)时会触发 segfault ,进程直接挂掉,所以你才看到显存瞬间清空、上层报 Failed to read connection 。
    
    建议3个 A/B 实验
    实验 1:把 batch 峰值先压到“绝对稳” 你现在的 n_batch=2048 / n_ubatch=1024 对“超大 prefill + flash-attn”仍然很激进。先改到:
    
    - --n-batch 512
    - --n-ubatch 128 (或 256)
    - 其它先不动
    如果这样 OpenCode TDD 不再“秒崩”,那就是 A 类(峰值/显存/驱动) ,不是链路。
    
    实验 2:禁用 prompt cache(排除 checkpoint/缓存路径) Qwen3.5 这类带 recurrent/hybrid memory 时,prompt cache 相关路径更复杂;OpenCode 每次 prompt 变化也更大,缓存命中率很差还可能走更多分支。先验证稳定性:
    
    - --cache-ram 0
    如果禁用后稳定很多,说明触发点在“缓存/检查点相关路径 + 大 prompt”。
    
    实验 3:让 OpenCode 不带巨大 tools(最能隔离触发器) 你现在 plugin 有 oh-my-openagent@latest ,它会显著放大 tools schema。先做一次 A/B:
    
    - OpenCode 配置把 plugin 暂时置空
    - 再跑同一个 TDD start
    如果不崩了,基本就是 “tools/schema 触发 child 崩溃或峰值”。
  • trae建议的排查:

    sudo dmesg -T | egrep -i 'killed process|oom|out of memory|xid|nvrm|segfault' | tail -n 200
    [二 4月 21 16:52:25 2026] llama-server[975470]: segfault at 7e7471ffcff8 ip 0000650c5507055d sp 00007e7471ffd000 error 6 in llama-server[14655d,650c54f56000+169000] likely on CPU 25 (core 12, socket 0)
    内核已经记录到的真实崩溃 : llama-server / libllama.so 在内存访问时 segfault 了,所以显存会瞬间释放、上层才会看到 Failed to read connection 。
    
    从你这段 dmesg 可以得出几个很硬的结论:
    
    - 多次 segfault ... in libllama.so.0.0.8742 和 segfault ... in llama-server ,说明不是“nginx/frps/客户端取消”导致的退出,而是 进程崩溃退出 (被 SIGSEGV 干掉)。
    - 没看到典型 OOM killer 行( Killed process ... / Out of memory 针对该 PID),所以也不像是系统把它杀了。
    
    二进制/动态库版本不一致(很高概率)
     你同时出现 “崩在 libllama.so” 和 “崩在 llama-server”,而且地址/偏移很固定( ...b344 、 ...97e6 这种重复),很像 ABI/结构体不匹配或残留旧库被动态链接到了。常见场景:更新了 llama.cpp 后只替换了 llama-server 或只替换了 libllama.so ,或者系统里存在多份 libllama.so 被优先加载。
     ---------------------通过以下命令排除了这个可能-----------------
     ldd "$(which llama-server)" | grep -E 'libllama|ggml'
     ls -l /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server
     sha256sum /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server
  • trae分析
  • 
    本质上是 OC + OMO 这种“自动编程/TDD 多步工作流”把请求强度拉到了两个后端的边界之外 ,但边界表现不一样:
    
    - 在商业算力侧 :边界是“平台限流/并发配额”,所以你看到 too_many_requests (429),重试还能继续跑,只是变慢。
    - 在本地 llama.cpp 侧 :边界是“实现稳定性/极端输入路径”,你已经有内核 segfault 证据,属于进程级崩溃,表现为显存瞬间清空、连接读不到。
    所以不是你“写得不对”,而是这个框架会自然产生:
    
    - 超长上下文(工具 schema + 多文件 + 历史 compaction)
    - 高频请求(多 agent/多轮重试/并发)
    - 长连接流式输出
    可执行的结论
    
    - 想稳:把并发压到 1、减少 tools、限制 output、减少 compaction 频率。
    - 想快:商业算力侧要做速率限制/退避;本地侧要避开已触发 segfault 的路径(先 --cache-ram 0 --parallel 1 ,必要时回退/换 llama.cpp 版本)。