A inference service built in Rust for deploying Large Language Models like Mistral 7B. This implementation combines efficient RAG capabilities with comprehensive performance telemetry, providing deep insights into token generation latency and inference characteristics. The service is designed for both practical deployment and performance analysis of LLMs.
This project implements a high-performance inference service in Rust for running Large Language Models locally. Key features include:
- Streaming Inference: Real-time token generation with configurable parameters
- RAG Integration: Enhance responses with relevant context from your knowledge base
- Comprehensive Metrics: Detailed performance monitoring and analysis capabilities
- GPU Acceleration: Optional GPU offloading for improved performance
- OpenAI-Compatible API: Drop-in replacement for applications using OpenAI's chat completions API
The service is built on three main components:
-
Inference Engine
- Built on llama.cpp for optimal performance
- Configurable thread allocation and GPU offloading
- Support for multiple model architectures
-
Knowledge Base
- Vector-based semantic search
- Efficient embedding generation and storage
- Flexible document ingestion pipeline
-
Metrics Collection
- Fine-grained performance monitoring
- Prometheus integration
- Resource utilization tracking
This is for testing only; Use at your own risk! Main purpose is to learn hands-up on how this stuff works and to
intrument and characterize the behaviour of AI LLMs.
The following key metrics are exposed through Prometheus:
- token_creation_duration - Histogram for the time it took to generate the tokens.
- inference_response_duration - Histogram for the time it took to generate the full response (includes tokenization and embedding additional context).
- embedding_duration - Histogram for the time it took to create a vector representation of the query and lookup contextual information in the knowledge base.
- TODO: add more such as time it took to tokenize, read from KV store etc; also check if we can add tracing.
Here is an example dashboard that capture the metrics described as well as some host metrics such as power, CPU utilisation etc.:
The service is optimized for:
- Memory Efficiency: Careful management of model loading and unloading
- CPU Utilization: Thread-pool based processing with configurable worker counts
- GPU Acceleration: Optional offloading of computation-heavy layers
- Response Latency: Streaming responses to minimize time-to-first-token
Typical performance metrics on consumer hardware (with 6-core CPU):
- Time to first token: 100-300ms
- Token generation speed: 20-30 tokens/second
- Memory usage: 5-8GB for base model
You will need to download a model and embedding model:
- This mistral-7b-instruct-v0.2.Q4_K_M.gguf model seems to give reasonable good results. Otherwise, give the Phi-3.5-mini-instruct-Q4_K_S.gguf a try.
- This bge-base-en-v1.5.Q8_0.gguf embedding model seem to work as well.
Best to put both files into a model/ folder as model.gguf and embed.gguf.
This service can be configured through environment variables. The following variables are supported:
| Environment variable | Description | Example/Default |
|---|---|---|
| KNOWLEDGE_BASE_PATH | Path to directory containing knowledge base documents | kb_data |
| EMBEDDING_MODEL | Full path of the embedding model to use. | model/embed.gguf |
| HTTP_ADDRESS | Bind address to use. | 127.0.0.1:8080 |
| HTTP_WORKERS | Number of threads to run with the HTTP server. | 1 |
| MAIN_GPU | Identifies which GPU we should use. | 0 |
| MODEL_GPU_LAYERS | Number of layers to offload to GPU. | 0 |
| MODEL_MAX_TOKEN | Maximum number of tokens to generate. | 128 |
| MODEL_PATH | Full path to the gguf file of the model. | model/model.gguf |
| MODEL_PROMPT_TEMPLATE | A prompt template - should contain {context} and {query} elements. | Mistral prompt |
| MODEL_THREADS | Number of threads we'll use for inference. | 6 |
| PROMETHEUS_HTTP_ADDRESS | Bind address to use for prometheus. | 127.0.0.1:8081 |
Other environment variables such as RUST_LOG can also be used.
The service implements an OpenAI-compatible API endpoint. Example request:
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"stream": true,
"model": "rust-inference-service",
"messages": [{
"role": "user",
"content": "Tell me about Dr. Ada Lovelace II's quantum debugging"
}]
}'
You can test the RAG capabilities by asking about Dr. Ada Lovelace II and her revolutionary Schrödinger Debugger. This example demonstrates how the system retrieves and incorporates technical knowledge with a touch of quantum computing humor.
Minimum recommended specifications:
- CPU: 6+ cores
- RAM: 16GB
- Storage: 10GB for models
- Optional: NVIDIA GPU with 8GB+ VRAM
Reference deployment configuration available in k8s_deployment.yaml.
kubectl apply -f k8s_deployment.yaml
Note: Customize image references and storage paths according to your environment.
- The service does not implement authentication by default
- Consider running behind a reverse proxy for TLS termination
- Implement rate limiting for production deployments
- Monitor resource usage to prevent DOS attacks
Execute tests sequentially to prevent model initialization conflicts:
cargo test --lib api::tests -- --test-threads=1
Contributions are welcome! Please consider:
- Adding support for new model architectures
- Improving metrics collection
- Enhancing RAG capabilities
- Optimizing performance
- Adding new deployment examples
- Single model loading at a time
- No model switching without restart
- Limited to models compatible with llama.cpp
- RAG implementation requires pre-processing of documents
Some of the following links can be useful:
