Skip to content

milinddethe15/slm-deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

SLM Deployment — Distributed Inference on AWS

Deploy a small language model behind a distributed worker mesh on AWS. A Python worker runs the model; a TypeScript worker exposes it as a JSON HTTP API. The two workers run on separate VMs in a private subnet, communicate over RPC via the iii engine, and are fronted by a public API gateway.


Architecture

                          ┌──────────────────────────────────────────────────────────┐
                          │                  VPC  10.20.0.0/16                       │
                          │                                                          │
   Internet               │   Public Subnet  10.20.0.0/24                            │
      │                   │   ┌─────────────────────────────────────────┐             │
      │                   │   │  Gateway VM  (c7i-flex.large)          │             │
      │    ┌──────────┐   │   │                                         │             │
      ├───►│   IGW    │───┼──►│  iii engine (HTTP :3111, RPC :49134)   │             │
      │    └──────────┘   │   │  iii-http · iii-queue · iii-state      │             │
      │                   │   │  iii-observability                      │             │
      │                   │   └──────────┬──────────────────────────────┘             │
      │                   │              │  WebSocket RPC (:49134)                    │
      │                   │              │                                            │
      │                   │   Private Subnet  10.20.1.0/24                            │
      │                   │   ┌──────────┴──────────────────────────────┐             │
      │                   │   │                                         │             │
      │                   │   │  ┌───────────────────────────────┐     │             │
      │                   │   │  │  Caller Worker VM             │     │             │
      │                   │   │  │  (TypeScript / Node.js)       │     │             │
      │                   │   │  │  Functions:                    │     │             │
      │                   │   │  │   • http::run_inference_over_http   │             │
      │                   │   │  │   • inference::get_response   │     │             │
      │                   │   │  └───────────────────────────────┘     │             │
      │                   │   │                                         │             │
      │                   │   │  ┌───────────────────────────────┐     │             │
      │                   │   │  │  Inference Worker VM           │     │             │
      │                   │   │  │  (Python / transformers)      │     │             │
      │                   │   │  │  Function:                     │     │             │
      │                   │   │  │   • inference::run_inference   │     │             │
      │                   │   │  │  Model: gemma-3-270m (Q8 GGUF)│     │             │
      │                   │   │  └───────────────────────────────┘     │             │
      │                   │   │                                         │             │
      │    ┌──────────┐   │   └──────────┬──────────────────────────────┘             │
      │    │ NAT GW   │◄──┼─────────────┘  (outbound internet for                   │
      │    └──────────┘   │                  package installs & model download)       │
      │                   └──────────────────────────────────────────────────────────┘

Request flow

curl POST :3111/v1/chat/completions
  → iii-http (gateway)
    → http::run_inference_over_http  (caller-worker, TypeScript)
      → inference::get_response      (caller-worker, TypeScript)
        → inference::run_inference   (inference-worker, Python)
          → gemma-3-270m model
        ← model output
      ← JSON response
    ← HTTP 200
  ← {"result": { ... }}

Network hygiene

Resource Subnet Public IP Inbound from internet
Gateway VM Public (10.20.0.0/24) ✅ Yes Port 3111 (API), 22 (SSH)
Caller Worker VM Private (10.20.1.0/24) ❌ No None
Inference Worker VM Private (10.20.1.0/24) ❌ No None

Workers communicate with the gateway over WebSocket RPC (port 49134) within the VPC. They reach the internet only through the NAT Gateway (for package installs and model downloads during bootstrap).


API Documentation

Endpoint

POST http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completions

Request

curl --max-time 420 \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"what color is apple?"}]}' \
  http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completions

Request body:

{
  "messages": [
    { "role": "user", "content": "what color is apple?" }
  ]
}

The messages array follows the OpenAI chat format — each message has a role (user or assistant) and content (string).

Response

{
  "result": {
    "result": "Apple is typically red."
  }
}

Note: The model is a tiny 270M-parameter SLM. Output quality is limited — this deployment demonstrates the distributed infrastructure, not production-grade inference.


Deploy from Scratch

Prerequisites

  • An AWS account with programmatic access configured (aws configure)
  • Terraform ≥ 1.5
  • An EC2 key pair (optional, for SSH access to debug)

Steps

# 1. Clone the repository
git clone https://github.com/milinddethe15/slm-deployment.git
cd slm-deployment/terraform/aws

# 2. Initialize Terraform
terraform init

# 3. Deploy (takes ~5 minutes for infra + ~5 minutes for cloud-init bootstrap)
terraform apply

# 4. Note the gateway public IP from the output
terraform output gateway_public_ip

# 5. Wait ~5-7 minutes for cloud-init to finish on all VMs, then test (Response takes around 30 seconds)
curl --max-time 420 \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What color is apple?"}]}' \
  http://$(terraform output -raw gateway_public_ip):3111/v1/chat/completions

Tear down

terraform destroy

Terraform variables

Variable Default Description
aws_region us-east-1 AWS region
repo_url this repo Git repo to clone on each VM
ssh_key_name "" EC2 key pair name (optional)
instance_type_gateway c7i-flex.large Gateway instance type
instance_type_worker c7i-flex.large Worker instance type
inference_root_volume_size 50 Root volume GiB for inference worker
allowed_ssh_cidr 0.0.0.0/0 CIDR allowed to SSH

What happens during bootstrap

Each VM's user-data script (cloud-init) automatically:

  1. Installs system packages (git, node, python3.11, etc.)
  2. Installs the iii CLI
  3. Clones this repository to /opt/slm-deployment
  4. Runs the role-specific bootstrap script:
    • Gateway: Installs the iii-engine.service systemd unit, starts the iii engine with the gateway config
    • Caller worker: Runs npm install && npm run build, installs caller-worker.service, connects to gateway via III_URL
    • Inference worker: Installs Python dependencies, installs inference-worker.service, downloads the model from HuggingFace, connects to gateway via III_URL

Debugging

# SSH to gateway (public IP)
ssh -i <key.pem> ec2-user@<GATEWAY_PUBLIC_IP>

# SSH to workers (through gateway as bastion)
ssh -J ec2-user@<GATEWAY_PUBLIC_IP> ec2-user@<WORKER_PRIVATE_IP>

# Check service status
sudo systemctl status iii-engine.service        # gateway
sudo systemctl status caller-worker.service      # caller VM
sudo systemctl status inference-worker.service   # inference VM

# View logs
sudo journalctl -u iii-engine.service -f
sudo journalctl -u caller-worker.service -f
sudo journalctl -u inference-worker.service -f

# Check cloud-init (first-boot) log
sudo cat /var/log/cloud-init-output.log | tail -100

# Verify listeners on gateway
ss -tlnp | grep -E '3111|49134'

# Check registered workers (via engine logs)
sudo journalctl -u iii-engine.service | grep "Worker registered"

Project Structure

slm-deployment/
├── README.md                          ← You are here
├── distributed-inference/
│   ├── config.yaml                    ← iii engine config (local dev)
│   └── workers/
│       ├── caller-worker/             ← TypeScript worker (HTTP trigger + RPC fan-out)
│       │   ├── src/worker.ts
│       │   ├── iii.worker.yaml
│       │   └── package.json
│       └── inference-worker/          ← Python worker (model inference)
│           ├── inference_worker.py
│           ├── iii.worker.yaml
│           └── requirements.txt
└── terraform/
    └── aws/
        ├── main.tf                    ← VPC, subnets, SGs, EC2 instances
        ├── variables.tf               ← Configurable parameters
        ├── outputs.tf                 ← Gateway IP, subnet IDs
        ├── config/
        │   └── gateway.yaml           ← iii engine config for the gateway VM
        ├── scripts/
        │   ├── bootstrap-gateway.sh
        │   ├── bootstrap-caller-worker.sh
        │   ├── bootstrap-inference-worker.sh
        │   └── launch-iii.sh
        ├── systemd/
        │   ├── iii-engine.service
        │   ├── caller-worker.service
        │   └── inference-worker.service
        └── templates/
            ├── bootstrap-gateway.sh.tftpl
            ├── bootstrap-caller-worker.sh.tftpl
            └── bootstrap-inference-worker.sh.tftpl

Production Hardening

If this were going to production, I would address the following:

Security:

  • Place the API behind an ALB with TLS termination (HTTPS) — currently traffic is unencrypted HTTP.
  • Add API authentication (API keys, JWT, or OAuth) — currently the endpoint is completely open.
  • Restrict allowed_ssh_cidr to a VPN or bastion CIDR instead of 0.0.0.0/0.
  • Use AWS Secrets Manager or SSM Parameter Store for sensitive configuration instead of env files on disk.
  • Enable VPC Flow Logs for network audit trails.
  • Run instances with a hardened AMI and apply security patches via SSM Patch Manager.

Reliability:

  • Put each worker behind an Auto Scaling Group with health checks so crashed workers are automatically replaced.
  • Add a health-check endpoint (GET /health) that verifies the full RPC chain is functional.
  • Use a proper process manager (or container orchestration) with liveness probes instead of bare systemd restart loops.
  • Set up CloudWatch alarms for CPU, memory, and disk on each instance.
  • Use EBS snapshots or a shared EFS volume for the model weights so every new inference worker doesn't have to re-download from HuggingFace.

Observability:

  • Export iii-observability spans to an external backend (Jaeger, Datadog, or AWS X-Ray) instead of in-memory.
  • Ship journalctl logs to CloudWatch Logs or an ELK stack for centralized debugging.
  • Add structured metrics for request latency, token throughput, and error rates.

Scaling to a 100× Larger Model

If the model were 100× larger (~27B parameters), the deployment would change significantly:

Compute:

  • GPU instances would be mandatory (e.g., g5.xlarge with A10G or p4d.24xlarge with A100s for very large models). CPU inference at 27B parameters is impractically slow.
  • Model parallelism (tensor or pipeline parallelism) might be needed if the model doesn't fit in a single GPU's VRAM. This would require frameworks like vLLM, TGI (Text Generation Inference), or DeepSpeed.

Architecture:

  • Replace the bare Python worker with a dedicated inference server (vLLM, TGI, or Triton) that handles batching, KV-cache management, and continuous batching for throughput.
  • Add a request queue (SQS or Redis) between the caller-worker and the inference tier to absorb traffic spikes and prevent OOM on the GPU.
  • Separate the model artifact from the instance — store weights in S3 and mount via FSx for Lustre or pre-bake an AMI with the model to avoid long cold-start times.

Cost:

  • Use Spot Instances for the inference tier (with graceful draining) to cut GPU costs by 60-70%.
  • Implement autoscaling based on GPU utilization or queue depth.
  • Consider a serverless GPU platform (AWS Inferentia, SageMaker endpoints) if traffic is bursty to avoid paying for idle GPUs.

Networking:

  • Enable Elastic Fabric Adapter (EFA) or placement groups if doing multi-GPU inference across nodes to minimize inter-node latency.
  • Use larger instance types with higher network bandwidth for model weight transfers.

About

Small Language Model deployment on cloud

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors