llama.cpp

日常维护命令

  • http服务
    • 识别显卡:llama-server
    • 模型:curl http://127.0.0.1:8080/v1/models
    • 状态:curl http://127.0.0.1:8080/health
    • 属性:curl http://127.0.0.1:8080/props
    • 性能:curl http://127.0.0.1:8080/metrics
    • 并发:curl http://127.0.0.1:8080/slots
  • 性能测试
    • llama-bench -ngl 99 -fa 1 -d 0,4096,8192,16384 -m [model]
      • 模型Q4+kvQ4 VS 模型q8+kvQ8和的质量区别不大(PPL 差异在噪声范围内 = 0.42%,消耗几乎无差异,推荐kv Q8)
      • Q4_K_M VS Q4_K_L 质量和速度的最佳平衡点,原因是更大的量化--fit 溢出更多专家层到 CPU(+70%)
      • 去掉 batch-size ubatch-size VS -b 496 -ub 4096 可以为专家层配置最优的显存,比手动调仓速度更快(+10%)
      • 稀疏Moe VS 稠密Dense(PPL -2.8%,激活3B vs 27B, 速度+1000%)
      • 推荐 --no-mmap --fit on -fa on  -t 20  --jinja  -ctk q8_0 -ctv q8_0(移除-ngl 999 --n-cpu-moe 24 -b 4096 -ub 4096 改为--fit on
  • 进程管理
    • 查进程:ps -ef | grep llama-server
    • 关闭进程:pkill -INT llama-server
    • 杀进程:pkill -9 llama-server
  • 显卡
    • 显卡列表:llama-server --list-devices
    • nvidia状态:nvidia-smi -l 1
    • nvidia资源:nvidia-smi dmon -s uc -i 0
    • AMD状态:radeontop
    • 专业监控:nvitop (pip install nvitop)
  • 模型下载(llama模型下载镜像处理

    查询
    https://huggingface.co/
    
    下载(usb无线网卡,限速防掉  --limit-rate=4m 为4MB/s)
    wget -c --limit-rate=4m https://huggingface.co/mradermacher/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic-GGUF/resolve/main/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic.Q4_K_M.gguf
    
    wget -c --limit-rate=4m https://huggingface.co/tensorblock/starcoder2-3b-GGUF/resolve/main/starcoder2-3b-Q3_K_M.gguf
  • 模型加载

    # ---------------- 模型 - 运行列表 ----------------
    ps -ef | grep llama-server
    
    # ---------------- 模型 - 关闭某个模型 ----------------
    pkill -INT -f "Qwen3.5-27B"
    
    # ---------------- 模型操作命令简化 ---------------- 
    sudo mkdir -p /usr/local/bin
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server;sudo chmod +x /usr/local/bin/llama-server
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-bench /usr/local/bin/llama-bench;sudo chmod +x /usr/local/bin/llama-bench
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-batched-bench /usr/local/bin/llama-batched-bench;sudo chmod +x /usr/local/bin/llama-batched-bench
    
    # ---------------- 模型 - 加载 ----------------
    openssl rand -base64 32  # 生成apikey
    llama-server -m ~/gguf/"Qwen3.5-Coder-python-4B.Q3_K_S.gguf" --port 8083 --host 0.0.0.0 --tensor-split 1,1 --parallel 2

    llama-server 参数手册 

    llama.cpp win部署

    llama.cpp ubuntu部署