LLM Inference Service with Instrumentation

A inference service built in Rust for deploying Large Language Models like Mistral 7B. This implementation combines efficient RAG capabilities with comprehensive performance telemetry, providing deep insights into token generation latency and inference characteristics. The service is designed for both practical deployment and performance analysis of LLMs.

Overview

This project implements a high-performance inference service in Rust for running Large Language Models locally. Key features include:

Streaming Inference: Real-time token generation with configurable parameters
RAG Integration: Enhance responses with relevant context from your knowledge base
Comprehensive Metrics: Detailed performance monitoring and analysis capabilities
GPU Acceleration: Optional GPU offloading for improved performance
OpenAI-Compatible API: Drop-in replacement for applications using OpenAI's chat completions API

Architecture

The service is built on three main components:

Inference Engine
- Built on llama.cpp for optimal performance
- Configurable thread allocation and GPU offloading
- Support for multiple model architectures
Knowledge Base
- Vector-based semantic search
- Efficient embedding generation and storage
- Flexible document ingestion pipeline
Metrics Collection
- Fine-grained performance monitoring
- Prometheus integration
- Resource utilization tracking

Warning

This is for testing only; Use at your own risk! Main purpose is to learn hands-up on how this stuff works and to 
intrument and characterize the behaviour of AI LLMs.

Observability

The following key metrics are exposed through Prometheus:

token_creation_duration - Histogram for the time it took to generate the tokens.
inference_response_duration - Histogram for the time it took to generate the full response (includes tokenization and embedding additional context).
embedding_duration - Histogram for the time it took to create a vector representation of the query and lookup contextual information in the knowledge base.
TODO: add more such as time it took to tokenize, read from KV store etc; also check if we can add tracing.

Here is an example dashboard that capture the metrics described as well as some host metrics such as power, CPU utilisation etc.:

Performance Considerations

The service is optimized for:

Memory Efficiency: Careful management of model loading and unloading
CPU Utilization: Thread-pool based processing with configurable worker counts
GPU Acceleration: Optional offloading of computation-heavy layers
Response Latency: Streaming responses to minimize time-to-first-token

Typical performance metrics on consumer hardware (with 6-core CPU):

Time to first token: 100-300ms
Token generation speed: 20-30 tokens/second
Memory usage: 5-8GB for base model

Prerequisites

You will need to download a model and embedding model:

This mistral-7b-instruct-v0.2.Q4_K_M.gguf model seems to give reasonable good results. Otherwise, give the Phi-3.5-mini-instruct-Q4_K_S.gguf a try.
This bge-base-en-v1.5.Q8_0.gguf embedding model seem to work as well.

Best to put both files into a model/ folder as model.gguf and embed.gguf.

Configuration

This service can be configured through environment variables. The following variables are supported:

Environment variable	Description	Example/Default
KNOWLEDGE_BASE_PATH	Path to directory containing knowledge base documents	kb_data
EMBEDDING_MODEL	Full path of the embedding model to use.	model/embed.gguf
HTTP_ADDRESS	Bind address to use.	127.0.0.1:8080
HTTP_WORKERS	Number of threads to run with the HTTP server.	1
MAIN_GPU	Identifies which GPU we should use.	0
MODEL_GPU_LAYERS	Number of layers to offload to GPU.	0
MODEL_MAX_TOKEN	Maximum number of tokens to generate.	128
MODEL_PATH	Full path to the gguf file of the model.	model/model.gguf
MODEL_PROMPT_TEMPLATE	A prompt template - should contain {context} and {query} elements.	Mistral prompt
MODEL_THREADS	Number of threads we'll use for inference.	6
PROMETHEUS_HTTP_ADDRESS	Bind address to use for prometheus.	127.0.0.1:8081

Other environment variables such as RUST_LOG can also be used.

API Interface

The service implements an OpenAI-compatible API endpoint. Example request:

curl -N -X POST http://localhost:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
        "stream": true,
        "model": "rust-inference-service",
        "messages": [{
            "role": "user",
            "content": "Tell me about Dr. Ada Lovelace II's quantum debugging"
        }]
    }'

You can test the RAG capabilities by asking about Dr. Ada Lovelace II and her revolutionary Schrödinger Debugger. This example demonstrates how the system retrieves and incorporates technical knowledge with a touch of quantum computing humor.

Production Deployment

System Requirements

Minimum recommended specifications:

CPU: 6+ cores
RAM: 16GB
Storage: 10GB for models
Optional: NVIDIA GPU with 8GB+ VRAM

Kubernetes

Reference deployment configuration available in k8s_deployment.yaml.

kubectl apply -f k8s_deployment.yaml

Note: Customize image references and storage paths according to your environment.

Security Considerations

The service does not implement authentication by default
Consider running behind a reverse proxy for TLS termination
Implement rate limiting for production deployments
Monitor resource usage to prevent DOS attacks

Development

Testing

Execute tests sequentially to prevent model initialization conflicts:

cargo test --lib api::tests -- --test-threads=1

Contributing

Contributions are welcome! Please consider:

Adding support for new model architectures
Improving metrics collection
Enhancing RAG capabilities
Optimizing performance
Adding new deployment examples

Known Limitations

Single model loading at a time
No model switching without restart
Limited to models compatible with llama.cpp
RAG implementation requires pre-processing of documents

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
misc		misc
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
k8s_deployment.yaml		k8s_deployment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Service with Instrumentation

Overview

Architecture

Warning

Observability

Performance Considerations

Prerequisites

Configuration

API Interface

Production Deployment

System Requirements

Kubernetes

Security Considerations

Development

Testing

Contributing

Known Limitations

Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Service with Instrumentation

Overview

Architecture

Warning

Observability

Performance Considerations

Prerequisites

Configuration

API Interface

Production Deployment

System Requirements

Kubernetes

Security Considerations

Development

Testing

Contributing

Known Limitations

Further Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages