llama.cpp

日常维护命令

  • http服务
    • 识别显卡:llama-server
    • 模型:curl http://127.0.0.1:8080/v1/models
    • 状态:curl http://127.0.0.1:8080/health
    • 属性:curl http://127.0.0.1:8080/props
    • 性能:curl http://127.0.0.1:8080/metrics
    • 并发:curl http://127.0.0.1:8080/slots
  • 性能测试
    • llama-bench -ngl 99 -fa 1 -d 0,4096,8192,16384 -m [model]
  • 进程管理
    • 查进程:ps -ef | grep llama-server
    • 关闭进程:pkill -INT llama-server
    • 杀进程:pkill -9 llama-server
  • 显卡
    • 显卡列表:llama-server --list-devices
    • nvidia状态:nvidia-smi -l 1
    • AMD状态:radeontop
    • 专业监控:nvitop (pip install nvitop)
  • 模型下载(llama模型下载镜像处理

    查询
    https://huggingface.co/
    
    下载(usb无线网卡,限速防掉  --limit-rate=4m 为4MB/s)
    wget -c --limit-rate=4m https://huggingface.co/mradermacher/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic-GGUF/resolve/main/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic.Q4_K_M.gguf
    
    wget -c --limit-rate=4m https://huggingface.co/tensorblock/starcoder2-3b-GGUF/resolve/main/starcoder2-3b-Q3_K_M.gguf
  • 模型加载

    # ---------------- 模型 - 运行列表 ----------------
    ps -ef | grep llama-server
    
    # ---------------- 模型 - 关闭某个模型 ----------------
    pkill -INT -f "Qwen3.5-27B"
    
    # ---------------- 模型操作命令简化 ---------------- 
    sudo mkdir -p /usr/local/bin
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server;sudo chmod +x /usr/local/bin/llama-server
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-bench /usr/local/bin/llama-bench;sudo chmod +x /usr/local/bin/llama-bench
    sudo ln -s /home/x99/llama.cpp/build/bin/llama-batched-bench /usr/local/bin/llama-batched-bench;sudo chmod +x /usr/local/bin/llama-batched-bench
    
    # ---------------- 模型 - 加载 ----------------
    openssl rand -base64 32  # 生成apikey
    llama-server -m ~/gguf/"Qwen3.5-Coder-python-4B.Q3_K_S.gguf" --port 8083 --host 0.0.0.0 --tensor-split 1,1 --parallel 2

    llama-server 参数手册 

    llama.cpp win部署

    llama.cpp ubuntu部署