SLM Deployment — Distributed Inference on AWS

Deploy a small language model behind a distributed worker mesh on AWS. A Python worker runs the model; a TypeScript worker exposes it as a JSON HTTP API. The two workers run on separate VMs in a private subnet, communicate over RPC via the iii engine, and are fronted by a public API gateway.

Architecture

                          ┌──────────────────────────────────────────────────────────┐
                          │                  VPC  10.20.0.0/16                       │
                          │                                                          │
   Internet               │   Public Subnet  10.20.0.0/24                            │
      │                   │   ┌─────────────────────────────────────────┐             │
      │                   │   │  Gateway VM  (c7i-flex.large)          │             │
      │    ┌──────────┐   │   │                                         │             │
      ├───►│   IGW    │───┼──►│  iii engine (HTTP :3111, RPC :49134)   │             │
      │    └──────────┘   │   │  iii-http · iii-queue · iii-state      │             │
      │                   │   │  iii-observability                      │             │
      │                   │   └──────────┬──────────────────────────────┘             │
      │                   │              │  WebSocket RPC (:49134)                    │
      │                   │              │                                            │
      │                   │   Private Subnet  10.20.1.0/24                            │
      │                   │   ┌──────────┴──────────────────────────────┐             │
      │                   │   │                                         │             │
      │                   │   │  ┌───────────────────────────────┐     │             │
      │                   │   │  │  Caller Worker VM             │     │             │
      │                   │   │  │  (TypeScript / Node.js)       │     │             │
      │                   │   │  │  Functions:                    │     │             │
      │                   │   │  │   • http::run_inference_over_http   │             │
      │                   │   │  │   • inference::get_response   │     │             │
      │                   │   │  └───────────────────────────────┘     │             │
      │                   │   │                                         │             │
      │                   │   │  ┌───────────────────────────────┐     │             │
      │                   │   │  │  Inference Worker VM           │     │             │
      │                   │   │  │  (Python / transformers)      │     │             │
      │                   │   │  │  Function:                     │     │             │
      │                   │   │  │   • inference::run_inference   │     │             │
      │                   │   │  │  Model: gemma-3-270m (Q8 GGUF)│     │             │
      │                   │   │  └───────────────────────────────┘     │             │
      │                   │   │                                         │             │
      │    ┌──────────┐   │   └──────────┬──────────────────────────────┘             │
      │    │ NAT GW   │◄──┼─────────────┘  (outbound internet for                   │
      │    └──────────┘   │                  package installs & model download)       │
      │                   └──────────────────────────────────────────────────────────┘

Request flow

curl POST :3111/v1/chat/completions
  → iii-http (gateway)
    → http::run_inference_over_http  (caller-worker, TypeScript)
      → inference::get_response      (caller-worker, TypeScript)
        → inference::run_inference   (inference-worker, Python)
          → gemma-3-270m model
        ← model output
      ← JSON response
    ← HTTP 200
  ← {"result": { ... }}

Network hygiene

Resource	Subnet	Public IP	Inbound from internet
Gateway VM	Public (10.20.0.0/24)	✅ Yes	Port 3111 (API), 22 (SSH)
Caller Worker VM	Private (10.20.1.0/24)	❌ No	None
Inference Worker VM	Private (10.20.1.0/24)	❌ No	None

Workers communicate with the gateway over WebSocket RPC (port 49134) within the VPC. They reach the internet only through the NAT Gateway (for package installs and model downloads during bootstrap).

API Documentation

Endpoint

POST http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completions

Request

curl --max-time 420 \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"what color is apple?"}]}' \
  http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completions

Request body:

{
  "messages": [
    { "role": "user", "content": "what color is apple?" }
  ]
}

The messages array follows the OpenAI chat format — each message has a role (user or assistant) and content (string).

Response

{
  "result": {
    "result": "Apple is typically red."
  }
}

Note: The model is a tiny 270M-parameter SLM. Output quality is limited — this deployment demonstrates the distributed infrastructure, not production-grade inference.

Deploy from Scratch

Prerequisites

An AWS account with programmatic access configured (aws configure)
Terraform ≥ 1.5
An EC2 key pair (optional, for SSH access to debug)

Steps

# 1. Clone the repository
git clone https://github.com/milinddethe15/slm-deployment.git
cd slm-deployment/terraform/aws

# 2. Initialize Terraform
terraform init

# 3. Deploy (takes ~5 minutes for infra + ~5 minutes for cloud-init bootstrap)
terraform apply

# 4. Note the gateway public IP from the output
terraform output gateway_public_ip

# 5. Wait ~5-7 minutes for cloud-init to finish on all VMs, then test (Response takes around 30 seconds)
curl --max-time 420 \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What color is apple?"}]}' \
  http://$(terraform output -raw gateway_public_ip):3111/v1/chat/completions

Tear down

terraform destroy

Terraform variables

Variable	Default	Description
`aws_region`	`us-east-1`	AWS region
`repo_url`	this repo	Git repo to clone on each VM
`ssh_key_name`	`""`	EC2 key pair name (optional)
`instance_type_gateway`	`c7i-flex.large`	Gateway instance type
`instance_type_worker`	`c7i-flex.large`	Worker instance type
`inference_root_volume_size`	`50`	Root volume GiB for inference worker
`allowed_ssh_cidr`	`0.0.0.0/0`	CIDR allowed to SSH

What happens during bootstrap

Each VM's user-data script (cloud-init) automatically:

Installs system packages (git, node, python3.11, etc.)
Installs the iii CLI
Clones this repository to /opt/slm-deployment
Runs the role-specific bootstrap script:
- Gateway: Installs the iii-engine.service systemd unit, starts the iii engine with the gateway config
- Caller worker: Runs npm install && npm run build, installs caller-worker.service, connects to gateway via III_URL
- Inference worker: Installs Python dependencies, installs inference-worker.service, downloads the model from HuggingFace, connects to gateway via III_URL

Debugging

# SSH to gateway (public IP)
ssh -i <key.pem> ec2-user@<GATEWAY_PUBLIC_IP>

# SSH to workers (through gateway as bastion)
ssh -J ec2-user@<GATEWAY_PUBLIC_IP> ec2-user@<WORKER_PRIVATE_IP>

# Check service status
sudo systemctl status iii-engine.service        # gateway
sudo systemctl status caller-worker.service      # caller VM
sudo systemctl status inference-worker.service   # inference VM

# View logs
sudo journalctl -u iii-engine.service -f
sudo journalctl -u caller-worker.service -f
sudo journalctl -u inference-worker.service -f

# Check cloud-init (first-boot) log
sudo cat /var/log/cloud-init-output.log | tail -100

# Verify listeners on gateway
ss -tlnp | grep -E '3111|49134'

# Check registered workers (via engine logs)
sudo journalctl -u iii-engine.service | grep "Worker registered"

Project Structure

slm-deployment/
├── README.md                          ← You are here
├── distributed-inference/
│   ├── config.yaml                    ← iii engine config (local dev)
│   └── workers/
│       ├── caller-worker/             ← TypeScript worker (HTTP trigger + RPC fan-out)
│       │   ├── src/worker.ts
│       │   ├── iii.worker.yaml
│       │   └── package.json
│       └── inference-worker/          ← Python worker (model inference)
│           ├── inference_worker.py
│           ├── iii.worker.yaml
│           └── requirements.txt
└── terraform/
    └── aws/
        ├── main.tf                    ← VPC, subnets, SGs, EC2 instances
        ├── variables.tf               ← Configurable parameters
        ├── outputs.tf                 ← Gateway IP, subnet IDs
        ├── config/
        │   └── gateway.yaml           ← iii engine config for the gateway VM
        ├── scripts/
        │   ├── bootstrap-gateway.sh
        │   ├── bootstrap-caller-worker.sh
        │   ├── bootstrap-inference-worker.sh
        │   └── launch-iii.sh
        ├── systemd/
        │   ├── iii-engine.service
        │   ├── caller-worker.service
        │   └── inference-worker.service
        └── templates/
            ├── bootstrap-gateway.sh.tftpl
            ├── bootstrap-caller-worker.sh.tftpl
            └── bootstrap-inference-worker.sh.tftpl

Production Hardening

If this were going to production, I would address the following:

Security:

Place the API behind an ALB with TLS termination (HTTPS) — currently traffic is unencrypted HTTP.
Add API authentication (API keys, JWT, or OAuth) — currently the endpoint is completely open.
Restrict allowed_ssh_cidr to a VPN or bastion CIDR instead of 0.0.0.0/0.
Use AWS Secrets Manager or SSM Parameter Store for sensitive configuration instead of env files on disk.
Enable VPC Flow Logs for network audit trails.
Run instances with a hardened AMI and apply security patches via SSM Patch Manager.

Reliability:

Put each worker behind an Auto Scaling Group with health checks so crashed workers are automatically replaced.
Add a health-check endpoint (GET /health) that verifies the full RPC chain is functional.
Use a proper process manager (or container orchestration) with liveness probes instead of bare systemd restart loops.
Set up CloudWatch alarms for CPU, memory, and disk on each instance.
Use EBS snapshots or a shared EFS volume for the model weights so every new inference worker doesn't have to re-download from HuggingFace.

Observability:

Export iii-observability spans to an external backend (Jaeger, Datadog, or AWS X-Ray) instead of in-memory.
Ship journalctl logs to CloudWatch Logs or an ELK stack for centralized debugging.
Add structured metrics for request latency, token throughput, and error rates.

Scaling to a 100× Larger Model

If the model were 100× larger (~27B parameters), the deployment would change significantly:

Compute:

GPU instances would be mandatory (e.g., g5.xlarge with A10G or p4d.24xlarge with A100s for very large models). CPU inference at 27B parameters is impractically slow.
Model parallelism (tensor or pipeline parallelism) might be needed if the model doesn't fit in a single GPU's VRAM. This would require frameworks like vLLM, TGI (Text Generation Inference), or DeepSpeed.

Architecture:

Replace the bare Python worker with a dedicated inference server (vLLM, TGI, or Triton) that handles batching, KV-cache management, and continuous batching for throughput.
Add a request queue (SQS or Redis) between the caller-worker and the inference tier to absorb traffic spikes and prevent OOM on the GPU.
Separate the model artifact from the instance — store weights in S3 and mount via FSx for Lustre or pre-bake an AMI with the model to avoid long cold-start times.

Cost:

Use Spot Instances for the inference tier (with graceful draining) to cut GPU costs by 60-70%.
Implement autoscaling based on GPU utilization or queue depth.
Consider a serverless GPU platform (AWS Inferentia, SageMaker endpoints) if traffic is bursty to avoid paying for idle GPUs.

Networking:

Enable Elastic Fabric Adapter (EFA) or placement groups if doing multi-GPU inference across nodes to minimize inter-node latency.
Use larger instance types with higher network bandwidth for model weight transfers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLM Deployment — Distributed Inference on AWS

Architecture

Request flow

Network hygiene

API Documentation

Endpoint

Request

Response

Deploy from Scratch

Prerequisites

Steps

Tear down

Terraform variables

What happens during bootstrap

Debugging

Project Structure

Production Hardening

Scaling to a 100× Larger Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
distributed-inference		distributed-inference
terraform/aws		terraform/aws
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SLM Deployment — Distributed Inference on AWS

Architecture

Request flow

Network hygiene

API Documentation

Endpoint

Request

Response

Deploy from Scratch

Prerequisites

Steps

Tear down

Terraform variables

What happens during bootstrap

Debugging

Project Structure

Production Hardening

Scaling to a 100× Larger Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages