日常维护命令
- http服务
- 识别显卡:llama-server
- 模型:curl http://127.0.0.1:8080/v1/models
- 状态:curl http://127.0.0.1:8080/health
- 属性:curl http://127.0.0.1:8080/props
- 性能:curl http://127.0.0.1:8080/metrics
- 并发:curl http://127.0.0.1:8080/slots
- 性能测试
llama-bench -ngl 99 -fa 1 -d 0,4096,8192,16384 -m [model]- 模型Q4+kvQ4 VS 模型q8+kvQ8和的质量区别不大(PPL 差异在噪声范围内 = 0.42%,消耗几乎无差异,推荐kv Q8)
- Q4_K_M VS Q4_K_L 质量和速度的最佳平衡点,原因是更大的量化--fit 溢出更多专家层到 CPU(+70%)
- 去掉 batch-size ubatch-size VS -b 496 -ub 4096 可以为专家层配置最优的显存,比手动调仓速度更快(+10%)
- 稀疏Moe VS 稠密Dense(PPL -2.8%,激活3B vs 27B, 速度+1000%)
- 推荐 --no-mmap --fit on -fa on -t 20 --jinja -ctk q8_0 -ctv q8_0(移除-ngl 999 --n-cpu-moe 24 -b 4096 -ub 4096 改为--fit on )
- 进程管理
- 查进程:ps -ef | grep llama-server
- 关闭进程:pkill -INT llama-server
- 杀进程:pkill -9 llama-server
- 显卡
- 显卡列表:llama-server --list-devices
- nvidia状态:nvidia-smi -l 1
- nvidia资源:nvidia-smi dmon -s uc -i 0
- AMD状态:radeontop
- 专业监控:nvitop (pip install nvitop)
模型下载(llama模型下载镜像处理)
查询 https://huggingface.co/ 下载(usb无线网卡,限速防掉 --limit-rate=4m 为4MB/s) wget -c --limit-rate=4m https://huggingface.co/mradermacher/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic-GGUF/resolve/main/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic.Q4_K_M.gguf wget -c --limit-rate=4m https://huggingface.co/tensorblock/starcoder2-3b-GGUF/resolve/main/starcoder2-3b-Q3_K_M.gguf模型加载
# ---------------- 模型 - 运行列表 ---------------- ps -ef | grep llama-server # ---------------- 模型 - 关闭某个模型 ---------------- pkill -INT -f "Qwen3.5-27B" # ---------------- 模型操作命令简化 ---------------- sudo mkdir -p /usr/local/bin sudo ln -s /home/x99/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server;sudo chmod +x /usr/local/bin/llama-server sudo ln -s /home/x99/llama.cpp/build/bin/llama-bench /usr/local/bin/llama-bench;sudo chmod +x /usr/local/bin/llama-bench sudo ln -s /home/x99/llama.cpp/build/bin/llama-batched-bench /usr/local/bin/llama-batched-bench;sudo chmod +x /usr/local/bin/llama-batched-bench # ---------------- 模型 - 加载 ---------------- openssl rand -base64 32 # 生成apikey llama-server -m ~/gguf/"Qwen3.5-Coder-python-4B.Q3_K_S.gguf" --port 8083 --host 0.0.0.0 --tensor-split 1,1 --parallel 2