Multiple deployment solutions for efficient MiniCPM-o model deployment across different environments.
📖 中文版本 | Back to Main
| Framework | Performance | Ease of Use | Scalability | Hardware | Best For |
|---|---|---|---|---|---|
| vLLM | High | Medium | High | GPU | Large-scale production services |
| SGLang | High | Medium | High | GPU | Structured generation tasks |
| Ollama | Medium | Excellent | Medium | CPU/GPU | Personal use, rapid prototyping |
| Llama.cpp | Medium | High | Medium | CPU | Edge devices, lightweight deployment |
vLLM (Very Large Language Model)
- High-throughput inference engine with PagedAttention memory management
- Dynamic batching support, OpenAI-compatible API
- Ideal for production API services and large-scale batch inference
- Recommended hardware: GPU with more than 18GB of VRAM
SGLang (Structured Generation Language)
- Structured generation optimization with efficient KV cache management
- Complex control flow and function calling optimization support
- Suitable for complex reasoning chains and structured text generation
- Recommended hardware: GPU with more than 18GB of VRAM
- One-click model management with simple command-line interface
- Automatic quantization support, REST API interface
- Perfect for personal development environments and research prototyping
- Hardware requirements: 8GB+ RAM, supports CPU/GPU
- Pure C++ implementation with CPU-optimized inference
- Multiple quantization support, lightweight deployment
- Ideal for mobile devices and edge computing
- Hardware requirements: 4GB+ RAM, various CPU architectures
- Production Environment (High Concurrency): vLLM - Best performance, optimal scalability
- Complex Reasoning Tasks: SGLang - Structured generation, function calling optimization
- Personal Development: Ollama - Simple to use, quick setup
- Edge Deployment: Llama.cpp - Lightweight, low power consumption
Numbers below are minimums for single-stream inference. Add headroom for higher batch sizes, longer contexts, multi-image inputs, or KV cache on the activation side. Vision tower + KV cache typically add ~2 GB on top of pure weight memory.
| Model | Params | Precision | Backend | GPU VRAM | CPU RAM |
|---|---|---|---|---|---|
| MiniCPM-V 4.6 | 9B | BF16 / FP16 | vLLM / SGLang | ≥ 20 GB | — |
| AWQ / GPTQ (int4) | vLLM / SGLang | ≥ 9 GB | — | ||
| BNB nf4 (int4) | transformers | ≥ 9 GB | — | ||
| GGUF Q4_K_M | llama.cpp / Ollama | ≥ 7 GB | ≥ 8 GB | ||
| GGUF Q8_0 | llama.cpp / Ollama | ≥ 11 GB | ≥ 12 GB | ||
| MiniCPM-V 4.5 | 8B | BF16 / FP16 | vLLM / SGLang | ≥ 18 GB | — |
| AWQ (int4) | vLLM | ≥ 8 GB | — | ||
| GGUF Q4_K_M | llama.cpp / Ollama | ≥ 6 GB | ≥ 7 GB | ||
| MiniCPM-o 4.5 | 9B | BF16 / FP16 | vLLM | ≥ 20 GB | — |
| GGUF Q4_K_M | llama.cpp | ≥ 7 GB | ≥ 8 GB | ||
| MiniCPM-V 4.0 | 4B | BF16 / FP16 | vLLM / SGLang | ≥ 10 GB | — |
| GGUF Q4_K_M | llama.cpp / Ollama | ≥ 3 GB | ≥ 4 GB |
For exact numbers measured on specific hardware (A100, RTX 4090, M-series Mac, …), see each model's HuggingFace card.
For the up-to-date upstream merge status across all MiniCPM-V & o versions
(including v4.6 with MiniCPMV4_6ForConditionalGeneration in transformers,
the in-flight vLLM / SGLang PRs, etc.), see the
Framework Support Matrix in the root README.