Skip to content

dotstely/rust-inference-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Service with Instrumentation

A inference service built in Rust for deploying Large Language Models like Mistral 7B. This implementation combines efficient RAG capabilities with comprehensive performance telemetry, providing deep insights into token generation latency and inference characteristics. The service is designed for both practical deployment and performance analysis of LLMs.

Overview

This project implements a high-performance inference service in Rust for running Large Language Models locally. Key features include:

  • Streaming Inference: Real-time token generation with configurable parameters
  • RAG Integration: Enhance responses with relevant context from your knowledge base
  • Comprehensive Metrics: Detailed performance monitoring and analysis capabilities
  • GPU Acceleration: Optional GPU offloading for improved performance
  • OpenAI-Compatible API: Drop-in replacement for applications using OpenAI's chat completions API

Architecture

The service is built on three main components:

  1. Inference Engine

    • Built on llama.cpp for optimal performance
    • Configurable thread allocation and GPU offloading
    • Support for multiple model architectures
  2. Knowledge Base

    • Vector-based semantic search
    • Efficient embedding generation and storage
    • Flexible document ingestion pipeline
  3. Metrics Collection

    • Fine-grained performance monitoring
    • Prometheus integration
    • Resource utilization tracking

Warning

This is for testing only; Use at your own risk! Main purpose is to learn hands-up on how this stuff works and to 
intrument and characterize the behaviour of AI LLMs.

Observability

The following key metrics are exposed through Prometheus:

  • token_creation_duration - Histogram for the time it took to generate the tokens.
  • inference_response_duration - Histogram for the time it took to generate the full response (includes tokenization and embedding additional context).
  • embedding_duration - Histogram for the time it took to create a vector representation of the query and lookup contextual information in the knowledge base.
  • TODO: add more such as time it took to tokenize, read from KV store etc; also check if we can add tracing.

Here is an example dashboard that capture the metrics described as well as some host metrics such as power, CPU utilisation etc.:

dashboard

Performance Considerations

The service is optimized for:

  • Memory Efficiency: Careful management of model loading and unloading
  • CPU Utilization: Thread-pool based processing with configurable worker counts
  • GPU Acceleration: Optional offloading of computation-heavy layers
  • Response Latency: Streaming responses to minimize time-to-first-token

Typical performance metrics on consumer hardware (with 6-core CPU):

  • Time to first token: 100-300ms
  • Token generation speed: 20-30 tokens/second
  • Memory usage: 5-8GB for base model

Prerequisites

You will need to download a model and embedding model:

Best to put both files into a model/ folder as model.gguf and embed.gguf.

Configuration

This service can be configured through environment variables. The following variables are supported:

Environment variable Description Example/Default
KNOWLEDGE_BASE_PATH Path to directory containing knowledge base documents kb_data
EMBEDDING_MODEL Full path of the embedding model to use. model/embed.gguf
HTTP_ADDRESS Bind address to use. 127.0.0.1:8080
HTTP_WORKERS Number of threads to run with the HTTP server. 1
MAIN_GPU Identifies which GPU we should use. 0
MODEL_GPU_LAYERS Number of layers to offload to GPU. 0
MODEL_MAX_TOKEN Maximum number of tokens to generate. 128
MODEL_PATH Full path to the gguf file of the model. model/model.gguf
MODEL_PROMPT_TEMPLATE A prompt template - should contain {context} and {query} elements. Mistral prompt
MODEL_THREADS Number of threads we'll use for inference. 6
PROMETHEUS_HTTP_ADDRESS Bind address to use for prometheus. 127.0.0.1:8081

Other environment variables such as RUST_LOG can also be used.

API Interface

The service implements an OpenAI-compatible API endpoint. Example request:

curl -N -X POST http://localhost:8080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
        "stream": true,
        "model": "rust-inference-service",
        "messages": [{
            "role": "user",
            "content": "Tell me about Dr. Ada Lovelace II's quantum debugging"
        }]
    }'

You can test the RAG capabilities by asking about Dr. Ada Lovelace II and her revolutionary Schrödinger Debugger. This example demonstrates how the system retrieves and incorporates technical knowledge with a touch of quantum computing humor.

Production Deployment

System Requirements

Minimum recommended specifications:

  • CPU: 6+ cores
  • RAM: 16GB
  • Storage: 10GB for models
  • Optional: NVIDIA GPU with 8GB+ VRAM

Kubernetes

Reference deployment configuration available in k8s_deployment.yaml.

kubectl apply -f k8s_deployment.yaml

Note: Customize image references and storage paths according to your environment.

Security Considerations

  • The service does not implement authentication by default
  • Consider running behind a reverse proxy for TLS termination
  • Implement rate limiting for production deployments
  • Monitor resource usage to prevent DOS attacks

Development

Testing

Execute tests sequentially to prevent model initialization conflicts:

cargo test --lib api::tests -- --test-threads=1

Contributing

Contributions are welcome! Please consider:

  1. Adding support for new model architectures
  2. Improving metrics collection
  3. Enhancing RAG capabilities
  4. Optimizing performance
  5. Adding new deployment examples

Known Limitations

  • Single model loading at a time
  • No model switching without restart
  • Limited to models compatible with llama.cpp
  • RAG implementation requires pre-processing of documents

Further Reading

Some of the following links can be useful:

About

Inference service built in Rust for deploying Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors