日常维护命令
- http服务
- 识别显卡:llama-server
- 模型:curl http://127.0.0.1:8080/v1/models
- 状态:curl http://127.0.0.1:8080/health
- 属性:curl http://127.0.0.1:8080/props
- 性能:curl http://127.0.0.1:8080/metrics
- 并发:curl http://127.0.0.1:8080/slots
- 性能测试
llama-bench -ngl 99 -fa 1 -d 0,4096,8192,16384 -m [model]
- 进程管理
- 查进程:ps -ef | grep llama-server
- 关闭进程:pkill -INT llama-server
- 杀进程:pkill -9 llama-server
- 显卡
- 显卡列表:llama-server --list-devices
- nvidia状态:nvidia-smi -l 1
- AMD状态:radeontop
- 专业监控:nvitop (pip install nvitop)
模型下载(llama模型下载镜像处理)
查询 https://huggingface.co/ 下载(usb无线网卡,限速防掉 --limit-rate=4m 为4MB/s) wget -c --limit-rate=4m https://huggingface.co/mradermacher/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic-GGUF/resolve/main/Qwen3.5-24B-A3B-Claude-Opus-Gemini-3.1-Pro-Reasoning-Distilled-heretic.Q4_K_M.gguf wget -c --limit-rate=4m https://huggingface.co/tensorblock/starcoder2-3b-GGUF/resolve/main/starcoder2-3b-Q3_K_M.gguf模型加载
# ---------------- 模型 - 运行列表 ---------------- ps -ef | grep llama-server # ---------------- 模型 - 关闭某个模型 ---------------- pkill -INT -f "Qwen3.5-27B" # ---------------- 模型操作命令简化 ---------------- sudo mkdir -p /usr/local/bin sudo ln -s /home/x99/llama.cpp/build/bin/llama-server /usr/local/bin/llama-server;sudo chmod +x /usr/local/bin/llama-server sudo ln -s /home/x99/llama.cpp/build/bin/llama-bench /usr/local/bin/llama-bench;sudo chmod +x /usr/local/bin/llama-bench sudo ln -s /home/x99/llama.cpp/build/bin/llama-batched-bench /usr/local/bin/llama-batched-bench;sudo chmod +x /usr/local/bin/llama-batched-bench # ---------------- 模型 - 加载 ---------------- openssl rand -base64 32 # 生成apikey llama-server -m ~/gguf/"Qwen3.5-Coder-python-4B.Q3_K_S.gguf" --port 8083 --host 0.0.0.0 --tensor-split 1,1 --parallel 2