问题进展
- opencode按照设计.md进行TDD测试驱动开发,llama.cpp 65%显存立刻崩,为exited with status 0正常退出而非崩溃
- 对比:open-webui对话没问题
- 对比:llama.cpp和opencode调参优化,无法解决,仍崩溃,日志分析opencode tdd编程一次爆发35万上下文
- 对比:切换到opencode zen免费算力 gpt-5 nano 40万上下文,任务能顺利执行
- 新的发现:--flash-attn off 可以解决 https://github.com/ggml-org/llama.cpp/issues/21336 没搞定
- 绕一下解决:根据@ai-sdk/openai-compatible,nginx增加限制
日志 - open-webui请求-优化前
日志
[42507] reasoning-budget: forced sequence complete, done [42507] slot print_timing: id 1 | task 4161 | [42507] prompt eval time = 7430.64 ms / 4378 tokens ( 1.70 ms per token, 589.18 tokens per second) [42507] eval time = 232.68 ms / 18 tokens ( 12.93 ms per token, 77.36 tokens per second) [42507] total time = 7663.32 ms / 4396 tokens [42507] slot release: id 1 | task 4161 | stop processing: n_tokens = 4395, truncated = 0 [42507] srv update_slots: all slots are idle [42507] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv proxy_reques: proxying request to model Qwen3.5-27B on port 42507 [42507] srv params_from_: Chat format: peg-native [42507] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 702397047188 [42507] srv get_availabl: updating prompt cache [42507] srv prompt_save: - saving prompt with length 4433, total state size = 87.249 MiB [42507] srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.002 [42507] srv update: - cache state: 2 prompts, 298.140 MiB (limits: 8192.000 MiB, 204800 tokens, 233719 est) [42507] srv update: - prompt 0x5b05b9eccbf0: 4073 tokens, checkpoints: 0, 85.265 MiB [42507] srv update: - prompt 0x5b05b77c6c70: 4433 tokens, checkpoints: 2, 212.876 MiB [42507] srv get_availabl: prompt cache update took 141.89 ms [42507] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [42507] slot launch_slot_: id 0 | task 4182 | processing task, is_child = 0 [42507] slot update_slots: id 0 | task 4182 | new prompt, n_ctx_slot = 102400, n_keep = 0, task.n_tokens = 4251 [42507] slot update_slots: id 0 | task 4182 | n_past = 7, slot.prompt.tokens.size() = 4433, seq_id = 0, pos_min = 4432, n_swa = 0 [42507] slot update_slots: id 0 | task 4182 | Checking checkpoint with [4300, 4300] against 7... [42507] slot update_slots: id 0 | task 4182 | Checking checkpoint with [3788, 3788] against 7... [42507] slot update_slots: id 0 | task 4182 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) [42507] slot update_slots: id 0 | task 4182 | erased invalidated context checkpoint (pos_min = 3788, pos_max = 3788, n_tokens = 3789, n_swa = 0, pos_next = 0, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | erased invalidated context checkpoint (pos_min = 4300, pos_max = 4300, n_tokens = 4301, n_swa = 0, pos_next = 0, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | n_tokens = 0, memory_seq_rm [0, end) [42507] slot update_slots: id 0 | task 4182 | prompt processing progress, n_tokens = 3735, batch.n_tokens = 3735, progress = 0.878617 [42507] slot update_slots: id 0 | task 4182 | n_tokens = 3735, memory_seq_rm [3735, end) [42507] slot update_slots: id 0 | task 4182 | prompt processing progress, n_tokens = 4247, batch.n_tokens = 512, progress = 0.999059 [42507] slot create_check: id 0 | task 4182 | created context checkpoint 1 of 32 (pos_min = 3734, pos_max = 3734, n_tokens = 3735, size = 62.813 MiB) [42507] slot update_slots: id 0 | task 4182 | n_tokens = 4247, memory_seq_rm [4247, end) [42507] reasoning-budget: activated, budget=0 tokens [42507] reasoning-budget: budget=0, forcing immediately [42507] slot init_sampler: id 0 | task 4182 | init sampler, took 1.20 ms, tokens: text = 4251, total = 4251 [42507] slot update_slots: id 0 | task 4182 | prompt processing done, n_tokens = 4251, batch.n_tokens = 4 [42507] slot create_check: id 0 | task 4182 | created context checkpoint 2 of 32 (pos_min = 4246, pos_max = 4246, n_tokens = 4247, size = 62.813 MiB) [42507] reasoning-budget: forced sequence complete, done [42507] slot print_timing: id 0 | task 4182 | [42507] prompt eval time = 7171.00 ms / 4251 tokens ( 1.69 ms per token, 592.80 tokens per second) [42507] eval time = 230.19 ms / 18 tokens ( 12.79 ms per token, 78.20 tokens per second) [42507] total time = 7401.19 ms / 4269 tokens [42507] slot release: id 0 | task 4182 | stop processing: n_tokens = 4268, truncated = 0 [42507] srv update_slots: all slots are idle [42507] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200gemini分析 - 强制重算 (Full Re-processing)
forcing full prompt re-processing due to lack of cache data (likely due to SWA...) ctx-size 设为了 200k,但 Qwen 3.5 模型使用了 SWA(滑动窗口注意力机制) 发送的新请求与之前的缓存 Checkpoint 稍微有一点不匹配,llama.cpp无法复用之前的缓存 它擦除了两个旧的 Checkpoint(erased invalidated context checkpoint)。 全量重算: 它重新处理了 4251 个 token。 耗时: prompt eval time = 7171.00 ms。这意味着模型在正式说话前,先花了 7 秒钟 在后台狂跑计算。gemini分析 - 推理预算的冲突 (Reasoning Budget)
reasoning-budget: activated, budget=0 tokens reasoning-budget: budget=0, forcing immediately 由于你在 .ini 里设置了 reasoning-budget = 0 且 enable_thinking: false,模型在准备“思考”时被系统强行掐断了。 这样节省了显存和时间。 如果 OpenCode 期望模型进行深度推理来解决复杂的 Bug,而你把它掐死在 0 token,模型给出的代码质量可能会下降(因为它被迫直接输出结果,没有经过内部逻辑梳理)。gemini分析 - 性能表现 (Performance Data)
吞吐量: 592.80 tokens per second (Prompt Eval)。这个速度非常快,说明你的 CUDA 编译和双卡 tensor-split 工作得很好。 生成速度: 78.20 tokens per second (Eval time)。对于 27B 的模型,每秒 78 个 token 是极高的性能,属于“秒回”级别。
日志 - open-webui重复请求(5万上下文+1024u批次+关闭reason)
日志
-------------------open webui第二次重复请求------ srv proxy_reques: proxying request to model Qwen3.5-27B on port 39875 [39875] srv params_from_: Chat format: peg-native [39875] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 707218965739 [39875] srv get_availabl: updating prompt cache [39875] srv prompt_save: - saving prompt with length 2580, total state size = 77.035 MiB [39875] srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.004 [39875] srv load: - found better prompt with f_keep = 0.348, sim = 1.000 [39875] srv update: - cache state: 1 prompts, 202.661 MiB (limits: 8192.000 MiB, 51200 tokens, 104289 est) [39875] srv update: - prompt 0x650c85dfbff0: 2580 tokens, checkpoints: 2, 202.661 MiB [39875] srv get_availabl: prompt cache update took 238.98 ms [39875] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [39875] slot launch_slot_: id 0 | task 1566 | processing task, is_child = 0 [39875] slot update_slots: id 0 | task 1566 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 789 [39875] slot update_slots: id 0 | task 1566 | n_past = 789, slot.prompt.tokens.size() = 2266, seq_id = 0, pos_min = 2265, n_swa = 0 [39875] slot update_slots: id 0 | task 1566 | Checking checkpoint with [784, 784] against 789... [39875] slot update_slots: id 0 | task 1566 | restored context checkpoint (pos_min = 784, pos_max = 784, n_tokens = 785, n_past = 785, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 1566 | n_tokens = 785, memory_seq_rm [785, end) [39875] slot init_sampler: id 0 | task 1566 | init sampler, took 0.29 ms, tokens: text = 789, total = 789 [39875] slot update_slots: id 0 | task 1566 | prompt processing done, n_tokens = 789, batch.n_tokens = 4 [39875] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 [39875] slot print_timing: id 0 | task 1566 | [39875] prompt eval time = 45.13 ms / 4 tokens ( 11.28 ms per token, 88.64 tokens per second) [39875] eval time = 25533.13 ms / 1752 tokens ( 14.57 ms per token, 68.62 tokens per second) [39875] total time = 25578.26 ms / 1756 tokens [39875] slot release: id 0 | task 1566 | stop processing: n_tokens = 2540, truncated = 0 [39875] srv update_slots: all slots are idle srv proxy_reques: proxying request to model Qwen3.5-27B on port 39875 [39875] srv params_from_: Chat format: peg-native [39875] slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 707268226515 [39875] srv get_availabl: updating prompt cache [39875] srv prompt_save: - saving prompt with length 2540, total state size = 76.815 MiB [39875] srv load: - looking for better prompt, base f_keep = 0.001, sim = 0.001 [39875] srv load: - found better prompt with f_keep = 0.396, sim = 0.368 [39875] srv update: - cache state: 1 prompts, 139.628 MiB (limits: 8192.000 MiB, 51200 tokens, 149022 est) [39875] srv update: - prompt 0x650c7fe5fda0: 2540 tokens, checkpoints: 1, 139.628 MiB [39875] srv get_availabl: prompt cache update took 133.95 ms [39875] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist [39875] slot launch_slot_: id 0 | task 3319 | processing task, is_child = 0 [39875] slot update_slots: id 0 | task 3319 | new prompt, n_ctx_slot = 51200, n_keep = 0, task.n_tokens = 2772 [39875] slot update_slots: id 0 | task 3319 | n_past = 1021, slot.prompt.tokens.size() = 2580, seq_id = 0, pos_min = 2579, n_swa = 0 [39875] slot update_slots: id 0 | task 3319 | Checking checkpoint with [2493, 2493] against 1021... [39875] slot update_slots: id 0 | task 3319 | Checking checkpoint with [1469, 1469] against 1021... [39875] slot update_slots: id 0 | task 3319 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055) [39875] slot update_slots: id 0 | task 3319 | erased invalidated context checkpoint (pos_min = 1469, pos_max = 1469, n_tokens = 1470, n_swa = 0, pos_next = 0, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | erased invalidated context checkpoint (pos_min = 2493, pos_max = 2493, n_tokens = 2494, n_swa = 0, pos_next = 0, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | n_tokens = 0, memory_seq_rm [0, end) [39875] slot update_slots: id 0 | task 3319 | prompt processing progress, n_tokens = 1744, batch.n_tokens = 1744, progress = 0.629149 [39875] slot update_slots: id 0 | task 3319 | n_tokens = 1744, memory_seq_rm [1744, end) [39875] slot update_slots: id 0 | task 3319 | prompt processing progress, n_tokens = 2768, batch.n_tokens = 1024, progress = 0.998557 [39875] slot create_check: id 0 | task 3319 | created context checkpoint 1 of 32 (pos_min = 1743, pos_max = 1743, n_tokens = 1744, size = 62.813 MiB) [39875] slot update_slots: id 0 | task 3319 | n_tokens = 2768, memory_seq_rm [2768, end) [39875] slot init_sampler: id 0 | task 3319 | init sampler, took 0.88 ms, tokens: text = 2772, total = 2772 [39875] slot update_slots: id 0 | task 3319 | prompt processing done, n_tokens = 2772, batch.n_tokens = 4 [39875] slot create_check: id 0 | task 3319 | created context checkpoint 2 of 32 (pos_min = 2767, pos_max = 2767, n_tokens = 2768, size = 62.813 MiB) [39875] slot print_timing: id 0 | task 3319 | [39875] prompt eval time = 3682.90 ms / 2772 tokens ( 1.33 ms per token, 752.67 tokens per second) [39875] eval time = 1357.64 ms / 98 tokens ( 13.85 ms per token, 72.18 tokens per second) [39875] total time = 5040.54 ms / 2870 tokens [39875] slot release: id 0 | task 3319 | stop processing: n_tokens = 2869, truncated = 0 [39875] srv update_slots: all slots are idle [39875] srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200gemini分析
1. 推理速度起飞:752.67 tokens/s (吞吐量峰值) 看 Task 3319 的 prompt eval time: 处理 2772 个 token 仅耗时 3.6 秒。 每 token 耗时降到了 1.33 ms。 结论:关闭了 Reasoning 后,GPU 的算力被完全释放到了 Prompt 的预处理上。即便发生了全量重算(Cache Miss),你的等待感也从之前的“漫长”变成了“稍微缓口气”。 2. 缓存相似度难题:sim = 0.368 这是你目前最核心的问题:为什么重复请求,相似度却这么低? 在第二个请求中,虽然模型是一样的,内容也高度重合,但 llama.cpp 报告相似度只有 36.8%。 这直接导致了 n_past 只能匹配到 1021 个 token。 原因排查:由于你关闭了推理模式,现在的变量只剩下一个——Open WebUI 的 Prompt 组装逻辑。 检查 WebUI 是否在对话历史中加入了**“随机打乱的上下文”或者“变化的 System Prompt”**。 如果相似度一直上不去,说明每次发给后端的“文本流”在物理字符级别上有很大差异。 3. SWA 再次“背锅” 日志再次提到:forcing full prompt re-processing due to lack of cache data (likely due to SWA...)。 因为相似度低,导致 n_past 匹配到的位置(1021)和你之前存的 Checkpoint 位置(1469, 2493)完全错开了。 llama.cpp 的策略非常保守:只要没法从 Checkpoint 完美恢复,它就宁愿把 1024 之前的缓存全部清空(memory_seq_rm [0, end)),然后利用你强大的显卡暴力重算。 💡 现在的状态评估 你的机器现在已经是 "完全体" 状态了: 生成速度 (Eval):稳定在 72 t/s 左右。 处理速度 (Prompt Eval):达到了惊人的 752 t/s。 显存利用:ctx-size 已经放宽到了 51200 (50k),且运行稳定。
日志 - opencode tdd编程任务
opencode.json
"limit": { "context": 49152, "input": 32768, "output": 4096 }llama-server日志
---------------opencode tdd start---------------- srv proxy_reques: proxying request to model Qwen3.5-27B on port 39875 srv operator(): http client error: Failed to read connection srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500 ----------------end---------------- opencode服务日志
trae分析
OpenCode 日志里也能看出一个非常关键的差异点:OpenCode 在跑 agent=compaction 时,会发一个 非常巨大的请求体 到 https://v1008080.atibm.com/v1/chat/completions ,里面 messages 带了超长历史(日志里已经出现 [..., 329672 chars omitted ...] ),而且 max_tokens 还给到了 4096 。这种请求形态是 OpenWebUI 基本不会发的。 所以现在最像的根因只有两类(需要用后端最后几行日志一锤定音): - A:child 进程在处理超大 prompt / tools schema 时触发了 CUDA OOM / illegal access / assert,直接退出 - B:child 进程被 OS kill(OOM killer)或被父进程策略性回收(但通常你会在父日志里看到 exit command / unload_all 的因果链) ----------------------------对照--------------------------------------- OpenCode 换成 GPT-5 nano(40 万上下文)后 TDD 正常推进,说明 OpenCode/TDD 流程本身没问题 ;真正的问题是 本地 llama.cpp 在承接 OpenCode 这类“超重请求”(长 messages + tools schema + compaction/重试)时会触发 segfault ,进程直接挂掉,所以你才看到显存瞬间清空、上层报 Failed to read connection 。 建议3个 A/B 实验 实验 1:把 batch 峰值先压到“绝对稳” 你现在的 n_batch=2048 / n_ubatch=1024 对“超大 prefill + flash-attn”仍然很激进。先改到: - --n-batch 512 - --n-ubatch 128 (或 256) - 其它先不动 如果这样 OpenCode TDD 不再“秒崩”,那就是 A 类(峰值/显存/驱动) ,不是链路。 实验 2:禁用 prompt cache(排除 checkpoint/缓存路径) Qwen3.5 这类带 recurrent/hybrid memory 时,prompt cache 相关路径更复杂;OpenCode 每次 prompt 变化也更大,缓存命中率很差还可能走更多分支。先验证稳定性: - --cache-ram 0 如果禁用后稳定很多,说明触发点在“缓存/检查点相关路径 + 大 prompt”。 实验 3:让 OpenCode 不带巨大 tools(最能隔离触发器) 你现在 plugin 有 oh-my-openagent@latest ,它会显著放大 tools schema。先做一次 A/B: - OpenCode 配置把 plugin 暂时置空 - 再跑同一个 TDD start 如果不崩了,基本就是 “tools/schema 触发 child 崩溃或峰值”。trae建议的排查:
sudo dmesg -T | egrep -i 'killed process|oom|out of memory|xid|nvrm|segfault' | tail -n 200 [二 4月 21 16:52:25 2026] llama-server[975470]: segfault at 7e7471ffcff8 ip 0000650c5507055d sp 00007e7471ffd000 error 6 in llama-server[14655d,650c54f56000+169000] likely on CPU 25 (core 12, socket 0) 内核已经记录到的真实崩溃 : llama-server / libllama.so 在内存访问时 segfault 了,所以显存会瞬间释放、上层才会看到 Failed to read connection 。 从你这段 dmesg 可以得出几个很硬的结论: - 多次 segfault ... in libllama.so.0.0.8742 和 segfault ... in llama-server ,说明不是“nginx/frps/客户端取消”导致的退出,而是 进程崩溃退出 (被 SIGSEGV 干掉)。 - 没看到典型 OOM killer 行( Killed process ... / Out of memory 针对该 PID),所以也不像是系统把它杀了。 二进制/动态库版本不一致(很高概率) 你同时出现 “崩在 libllama.so” 和 “崩在 llama-server”,而且地址/偏移很固定( ...b344 、 ...97e6 这种重复),很像 ABI/结构体不匹配或残留旧库被动态链接到了。常见场景:更新了 llama.cpp 后只替换了 llama-server 或只替换了 libllama.so ,或者系统里存在多份 libllama.so 被优先加载。 ---------------------通过以下命令排除了这个可能----------------- ldd "$(which llama-server)" | grep -E 'libllama|ggml' ls -l /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server sha256sum /usr/local/bin/llama-server ~/llama.cpp/build/bin/llama-server- trae分析
本质上是 OC + OMO 这种“自动编程/TDD 多步工作流”把请求强度拉到了两个后端的边界之外 ,但边界表现不一样: - 在商业算力侧 :边界是“平台限流/并发配额”,所以你看到 too_many_requests (429),重试还能继续跑,只是变慢。 - 在本地 llama.cpp 侧 :边界是“实现稳定性/极端输入路径”,你已经有内核 segfault 证据,属于进程级崩溃,表现为显存瞬间清空、连接读不到。 所以不是你“写得不对”,而是这个框架会自然产生: - 超长上下文(工具 schema + 多文件 + 历史 compaction) - 高频请求(多 agent/多轮重试/并发) - 长连接流式输出 可执行的结论 - 想稳:把并发压到 1、减少 tools、限制 output、减少 compaction 频率。 - 想快:商业算力侧要做速率限制/退避;本地侧要避开已触发 segfault 的路径(先 --cache-ram 0 --parallel 1 ,必要时回退/换 llama.cpp 版本)。