oc编程llama.cpp 立刻崩

问题进展

opencode按照设计.md进行TDD测试驱动开发，llama.cpp 65%显存立刻崩，为exited with status 0正常退出而非崩溃
对比：open-webui对话没问题
对比：llama.cpp和opencode调参优化，无法解决，仍崩溃，日志分析opencode tdd编程一次爆发35万上下文
对比：切换到opencode zen免费算力 gpt-5 nano 40万上下文，任务能顺利执行
新的发现：--flash-attn off 可以解决 https://github.com/ggml-org/llama.cpp/issues/21336 没搞定
绕一下解决：根据@ai-sdk/openai-compatible，nginx增加限制

日志 - open-webui请求-优化前

日志


[42507] reasoning-budget: forced sequence complete, done
[42507] slot print_timing: id  1 | task 4161 |
[42507] prompt eval time =    7430.64 ms /  4378 tokens (    1.70 ms per token,   589.18 tokens per second)
[42507]        eval time =     232.68 ms /    18 tokens (   12.93 ms per token,    77.36 tokens per second)
[42507]       total time =    7663.32 ms /  4396 tokens
[42507] slot      release: id  1 | task 4161 | stop processing: n_tokens = 4395, truncated = 0
[42507] srv  update_slots: all slots are idle
[42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  proxy_reques: proxying request to model Qwen3.5-27B on port 42507
[42507] srv  params_from_: Chat format: peg-native
[42507] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 702397047188
[42507] srv  get_availabl: updating prompt cache
[42507] srv   prompt_save:  - saving prompt with length 4433, total state size = 87.249 MiB
[42507] srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.002
[42507] srv        update:  - cache state: 2 prompts, 298.140 MiB (limits: 8192.000 MiB, 204800 tokens, 233719 est)
[42507] srv        update:    - prompt 0x5b05b9eccbf0:    4073 tokens, checkpoints:  0,    85.265 MiB
[42507] srv        update:    - prompt 0x5b05b77c6c70:    4433 tokens, checkpoints:  2,   212.876 MiB
[42507] srv  get_availabl: prompt cache update took 141.89 ms
[42507] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
[42507] slot launch_slot_: id  0 | task 4182 | processing task, is_child = 0
[42507] slot update_slots: id  0 | task 4182 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 4251
[42507] slot update_slots: id  0 | task 4182 | n_past = 7, slot.prompt.tokens.size() = 4433, seq_id = 0, pos_min = 4432, n_swa = 0
[42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [4300, 4300] against 7...
[42507] slot update_slots: id  0 | task 4182 | Checking checkpoint with [3788, 3788] against 7...
[42507] slot update_slots: id  0 | task 4182 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
[42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 3788, pos_max = 3788, n_tokens = 3789, n_swa = 0, pos_next = 0, size = 62.813 MiB)
[42507] slot update_slots: id  0 | task 4182 | erased invalidated context checkpoint (pos_min = 4300, pos_max = 4300, n_tokens = 4301, n_swa = 0, pos_next = 0, size = 62.813 MiB)
[42507] slot update_slots: id  0 | task 4182 | n_tokens = 0, memory_seq_rm [0, end)
[42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 3735, batch.n_tokens = 3735, progress = 0.878617
[42507] slot update_slots: id  0 | task 4182 | n_tokens = 3735, memory_seq_rm [3735, end)
[42507] slot update_slots: id  0 | task 4182 | prompt processing progress, n_tokens = 4247, batch.n_tokens = 512, progress = 0.999059
[42507] slot create_check: id  0 | task 4182 | created context checkpoint 1 of 32 (pos_min = 3734, pos_max = 3734, n_tokens = 3735, size = 62.813 MiB)
[42507] slot update_slots: id  0 | task 4182 | n_tokens = 4247, memory_seq_rm [4247, end)
[42507] reasoning-budget: activated, budget=0 tokens
[42507] reasoning-budget: budget=0, forcing immediately
[42507] slot init_sampler: id  0 | task 4182 | init sampler, took 1.20 ms, tokens: text = 4251, total = 4251
[42507] slot update_slots: id  0 | task 4182 | prompt processing done, n_tokens = 4251, batch.n_tokens = 4
[42507] slot create_check: id  0 | task 4182 | created context checkpoint 2 of 32 (pos_min = 4246, pos_max = 4246, n_tokens = 4247, size = 62.813 MiB)
[42507] reasoning-budget: forced sequence complete, done
[42507] slot print_timing: id  0 | task 4182 |
[42507] prompt eval time =    7171.00 ms /  4251 tokens (    1.69 ms per token,   592.80 tokens per second)
[42507]        eval time =     230.19 ms /    18 tokens (   12.79 ms per token,    78.20 tokens per second)
[42507]       total time =    7401.19 ms /  4269 tokens
[42507] slot      release: id  0 | task 4182 | stop processing: n_tokens = 4268, truncated = 0
[42507] srv  update_slots: all slots are idle
[42507] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

gemini分析 - 强制重算 (Full Re-processing)

forcing full prompt re-processing due to lack of cache data (likely due to SWA...)

ctx-size 设为了 200k，但 Qwen 3.5 模型使用了 SWA（滑动窗口注意力机制）
发送的新请求与之前的缓存 Checkpoint 稍微有一点不匹配，llama.cpp无法复用之前的缓存
它擦除了两个旧的 Checkpoint（erased invalidated context checkpoint）。
全量重算： 它重新处理了 4251 个 token。
耗时： prompt eval time = 7171.00 ms。这意味着模型在正式说话前，先花了 7 秒钟 在后台狂跑计算。

gemini分析 - 推理预算的冲突 (Reasoning Budget)

reasoning-budget: activated, budget=0 tokens
reasoning-budget: budget=0, forcing immediately

由于你在 .ini 里设置了 reasoning-budget = 0 且 enable_thinking: false，模型在准备“思考”时被系统强行掐断了。
这样节省了显存和时间。
如果 OpenCode 期望模型进行深度推理来解决复杂的 Bug，而你把它掐死在 0 token，模型给出的代码质量可能会下降（因为它被迫直接输出结果，没有经过内部逻辑梳理）。

gemini分析 - 性能表现 (Performance Data)

吞吐量： 592.80 tokens per second (Prompt Eval)。这个速度非常快，说明你的 CUDA 编译和双卡 tensor-split 工作得很好。
生成速度： 78.20 tokens per second (Eval time)。对于 27B 的模型，每秒 78 个 token 是极高的性能，属于“秒回”级别。

日志 - open-webui重复请求（5万上下文+1024u批次+关闭reason）

日志


-------------------open webui第二次重复请求------
srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
[39875] srv  params_from_: Chat format: peg-native
[39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707218965739
[39875] srv  get_availabl: updating prompt cache
[39875] srv   prompt_save:  - saving prompt with length 2580, total state size = 77.035 MiB
[39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.004
[39875] srv          load:  - found better prompt with f_keep = 0.348, sim = 1.000
[39875] srv        update:  - cache state: 1 prompts, 202.661 MiB (limits: 8192.000 MiB, 51200 tokens, 104289 est)
[39875] srv        update:    - prompt 0x650c85dfbff0:    2580 tokens, checkpoints:  2,   202.661 MiB
[39875] srv  get_availabl: prompt cache update took 238.98 ms
[39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
[39875] slot launch_slot_: id  0 | task 1566 | processing task, is_child = 0
[39875] slot update_slots: id  0 | task 1566 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 789
[39875] slot update_slots: id  0 | task 1566 | n_past = 789, slot.prompt.tokens.size() = 2266, seq_id = 0, pos_min = 2265, n_swa = 0
[39875] slot update_slots: id  0 | task 1566 | Checking checkpoint with [784, 784] against 789...
[39875] slot update_slots: id  0 | task 1566 | restored context checkpoint (pos_min = 784, pos_max = 784, n_tokens = 785, n_past = 785, size = 62.813 MiB)
[39875] slot update_slots: id  0 | task 1566 | n_tokens = 785, memory_seq_rm [785, end)
[39875] slot init_sampler: id  0 | task 1566 | init sampler, took 0.29 ms, tokens: text = 789, total = 789
[39875] slot update_slots: id  0 | task 1566 | prompt processing done, n_tokens = 789, batch.n_tokens = 4
[39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
[39875] slot print_timing: id  0 | task 1566 |
[39875] prompt eval time =      45.13 ms /     4 tokens (   11.28 ms per token,    88.64 tokens per second)
[39875]        eval time =   25533.13 ms /  1752 tokens (   14.57 ms per token,    68.62 tokens per second)
[39875]       total time =   25578.26 ms /  1756 tokens
[39875] slot      release: id  0 | task 1566 | stop processing: n_tokens = 2540, truncated = 0
[39875] srv  update_slots: all slots are idle
srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
[39875] srv  params_from_: Chat format: peg-native
[39875] slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 707268226515
[39875] srv  get_availabl: updating prompt cache
[39875] srv   prompt_save:  - saving prompt with length 2540, total state size = 76.815 MiB
[39875] srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.001
[39875] srv          load:  - found better prompt with f_keep = 0.396, sim = 0.368
[39875] srv        update:  - cache state: 1 prompts, 139.628 MiB (limits: 8192.000 MiB, 51200 tokens, 149022 est)
[39875] srv        update:    - prompt 0x650c7fe5fda0:    2540 tokens, checkpoints:  1,   139.628 MiB
[39875] srv  get_availabl: prompt cache update took 133.95 ms
[39875] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
[39875] slot launch_slot_: id  0 | task 3319 | processing task, is_child = 0
[39875] slot update_slots: id  0 | task 3319 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 2772
[39875] slot update_slots: id  0 | task 3319 | n_past = 1021, slot.prompt.tokens.size() = 2580, seq_id = 0, pos_min = 2579, n_swa = 0
[39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [2493, 2493] against 1021...
[39875] slot update_slots: id  0 | task 3319 | Checking checkpoint with [1469, 1469] against 1021...
[39875] slot update_slots: id  0 | task 3319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
[39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 1469, pos_max = 1469, n_tokens = 1470, n_swa = 0, pos_next = 0, size = 62.813 MiB)
[39875] slot update_slots: id  0 | task 3319 | erased invalidated context checkpoint (pos_min = 2493, pos_max = 2493, n_tokens = 2494, n_swa = 0, pos_next = 0, size = 62.813 MiB)
[39875] slot update_slots: id  0 | task 3319 | n_tokens = 0, memory_seq_rm [0, end)
[39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 1744, batch.n_tokens = 1744, progress = 0.629149
[39875] slot update_slots: id  0 | task 3319 | n_tokens = 1744, memory_seq_rm [1744, end)
[39875] slot update_slots: id  0 | task 3319 | prompt processing progress, n_tokens = 2768, batch.n_tokens = 1024, progress = 0.998557
[39875] slot create_check: id  0 | task 3319 | created context checkpoint 1 of 32 (pos_min = 1743, pos_max = 1743, n_tokens = 1744, size = 62.813 MiB)
[39875] slot update_slots: id  0 | task 3319 | n_tokens = 2768, memory_seq_rm [2768, end)
[39875] slot init_sampler: id  0 | task 3319 | init sampler, took 0.88 ms, tokens: text = 2772, total = 2772
[39875] slot update_slots: id  0 | task 3319 | prompt processing done, n_tokens = 2772, batch.n_tokens = 4
[39875] slot create_check: id  0 | task 3319 | created context checkpoint 2 of 32 (pos_min = 2767, pos_max = 2767, n_tokens = 2768, size = 62.813 MiB)
[39875] slot print_timing: id  0 | task 3319 |
[39875] prompt eval time =    3682.90 ms /  2772 tokens (    1.33 ms per token,   752.67 tokens per second)
[39875]        eval time =    1357.64 ms /    98 tokens (   13.85 ms per token,    72.18 tokens per second)
[39875]       total time =    5040.54 ms /  2870 tokens
[39875] slot      release: id  0 | task 3319 | stop processing: n_tokens = 2869, truncated = 0
[39875] srv  update_slots: all slots are idle
[39875] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

gemini分析

1. 推理速度起飞：752.67 tokens/s (吞吐量峰值)
看 Task 3319 的 prompt eval time：
处理 2772 个 token 仅耗时 3.6 秒。
每 token 耗时降到了 1.33 ms。
结论：关闭了 Reasoning 后，GPU 的算力被完全释放到了 Prompt 的预处理上。即便发生了全量重算（Cache Miss），你的等待感也从之前的“漫长”变成了“稍微缓口气”。

2. 缓存相似度难题：sim = 0.368
这是你目前最核心的问题：为什么重复请求，相似度却这么低？
在第二个请求中，虽然模型是一样的，内容也高度重合，但 llama.cpp 报告相似度只有 36.8%。
这直接导致了 n_past 只能匹配到 1021 个 token。
原因排查：由于你关闭了推理模式，现在的变量只剩下一个——Open WebUI 的 Prompt 组装逻辑。
检查 WebUI 是否在对话历史中加入了**“随机打乱的上下文”或者“变化的 System Prompt”**。
如果相似度一直上不去，说明每次发给后端的“文本流”在物理字符级别上有很大差异。

3. SWA 再次“背锅”
日志再次提到：forcing full prompt re-processing due to lack of cache data (likely due to SWA...)。
因为相似度低，导致 n_past 匹配到的位置（1021）和你之前存的 Checkpoint 位置（1469, 2493）完全错开了。
llama.cpp 的策略非常保守：只要没法从 Checkpoint 完美恢复，它就宁愿把 1024 之前的缓存全部清空（memory_seq_rm [0, end)），然后利用你强大的显卡暴力重算。

💡 现在的状态评估
你的机器现在已经是 "完全体" 状态了：
生成速度 (Eval)：稳定在 72 t/s 左右。
处理速度 (Prompt Eval)：达到了惊人的 752 t/s。
显存利用：ctx-size 已经放宽到了 51200 (50k)，且运行稳定。

日志 - opencode tdd编程任务

opencode.json

"limit": {
    "context": 49152,
    "input": 32768,
    "output": 4096
}

llama-server日志

---------------opencode tdd start----------------
srv  proxy_reques: proxying request to model Qwen3.5-27B on port 39875
srv    operator(): http client error: Failed to read connection
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
----------------end---------------

opencode服务日志

trae分析

OpenCode 日志里也能看出一个非常关键的差异点：OpenCode 在跑 agent=compaction 时，会发一个 非常巨大的请求体 到 https://v1008080.atibm.com/v1/chat/completions ，里面 messages 带了超长历史（日志里已经出现 [..., 329672 chars omitted ...] ），而且 max_tokens 还给到了 4096 。这种请求形态是 OpenWebUI 基本不会发的。

所以现在最像的根因只有两类（需要用后端最后几行日志一锤定音）：

- A：child 进程在处理超大 prompt / tools schema 时触发了 CUDA OOM / illegal access / assert，直接退出
- B：child 进程被 OS kill（OOM killer）或被父进程策略性回收（但通常你会在父日志里看到 exit command / unload_all 的因果链）

----------------------------对照---------------------------------------
OpenCode 换成 GPT-5 nano（40 万上下文）后 TDD 正常推进，说明 OpenCode/TDD 流程本身没问题 ；真正的问题是 本地 llama.cpp 在承接 OpenCode 这类“超重请求”（长 messages + tools schema + compaction/重试）时会触发 segfault ，进程直接挂掉，所以你才看到显存瞬间清空、上层报 Failed to read connection 。

建议3个 A/B 实验
实验 1：把 batch 峰值先压到“绝对稳” 你现在的 n_batch=2048 / n_ubatch=1024 对“超大 prefill + flash-attn”仍然很激进。先改到：

- --n-batch 512
- --n-ubatch 128 （或 256）
- 其它先不动
如果这样 OpenCode TDD 不再“秒崩”，那就是 A 类（峰值/显存/驱动） ，不是链路。

实验 2：禁用 prompt cache（排除 checkpoint/缓存路径） Qwen3.5 这类带 recurrent/hybrid memory 时，prompt cache 相关路径更复杂；OpenCode 每次 prompt 变化也更大，缓存命中率很差还可能走更多分支。先验证稳定性：

- --cache-ram 0
如果禁用后稳定很多，说明触发点在“缓存/检查点相关路径 + 大 prompt”。

实验 3：让 OpenCode 不带巨大 tools（最能隔离触发器） 你现在 plugin 有 oh-my-openagent@latest ，它会显著放大 tools schema。先做一次 A/B：

- OpenCode 配置把 plugin 暂时置空
- 再跑同一个 TDD start
如果不崩了，基本就是 “tools/schema 触发 child 崩溃或峰值”。

trae建议的排查：

sudo dmesg -T | egrep -i 'killed process|oom|out of memory|xid|nvrm|segfault' | tail -n 200
[二 4月 21 16:52:25 2026] llama-server[975470]: segfault at 7e7471ffcff8 ip 0000650c5507055d sp 00007e7471ffd000 error 6 in llama-server[14655d,650c54f56000+169000] likely on CPU 25 (core 12, socket 0)
内核已经记录到的真实崩溃 ： llama-server / libllama.so 在内存访问时 segfault 了，所以显存会瞬间释放、上层才会看到 Failed to read connection 。

从你这段 dmesg 可以得出几个很硬的结论：

- 多次 segfault ... in libllama.so.0.0.8742 和 segfault ... in llama-server ，说明不是“nginx/frps/客户端取消”导致的退出，而是 进程崩溃退出 （被 SIGSEGV 干掉）。
- 没看到典型 OOM killer 行（ Killed process ... / Out of memory 针对该 PID），所以也不像是系统把它杀了。

二进制/动态库版本不一致（很高概率）
 你同时出现 “崩在 libllama.so” 和 “崩在 llama-server”，而且地址/偏移很固定（ ...b344 、 ...97e6 这种重复），很像 ABI/结构体不匹配或残留旧库被动态链接到了。常见场景：更新了 llama.cpp 后只替换了 llama-server 或只替换了 libllama.so ，或者系统里存在多份 libllama.so 被优先加载。
 ---------------------通过以下命令排除了这个可能-----------------
 ldd "$(which llama-server)" | grep -E 'libllama|ggml'
 ls -l /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server
 sha256sum /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server

trae分析


本质上是 OC + OMO 这种“自动编程/TDD 多步工作流”把请求强度拉到了两个后端的边界之外 ，但边界表现不一样：

- 在商业算力侧 ：边界是“平台限流/并发配额”，所以你看到 too_many_requests （429），重试还能继续跑，只是变慢。
- 在本地 llama.cpp 侧 ：边界是“实现稳定性/极端输入路径”，你已经有内核 segfault 证据，属于进程级崩溃，表现为显存瞬间清空、连接读不到。
所以不是你“写得不对”，而是这个框架会自然产生：

- 超长上下文（工具 schema + 多文件 + 历史 compaction）
- 高频请求（多 agent/多轮重试/并发）
- 长连接流式输出
可执行的结论

- 想稳：把并发压到 1、减少 tools、限制 output、减少 compaction 频率。
- 想快：商业算力侧要做速率限制/退避；本地侧要避开已触发 segfault 的路径（先 --cache-ram 0 --parallel 1 ，必要时回退/换 llama.cpp 版本）。