Deploy a small language model behind a distributed worker mesh on AWS. A Python worker runs the model; a TypeScript worker exposes it as a JSON HTTP API. The two workers run on separate VMs in a private subnet, communicate over RPC via the iii engine, and are fronted by a public API gateway.
┌──────────────────────────────────────────────────────────┐
│ VPC 10.20.0.0/16 │
│ │
Internet │ Public Subnet 10.20.0.0/24 │
│ │ ┌─────────────────────────────────────────┐ │
│ │ │ Gateway VM (c7i-flex.large) │ │
│ ┌──────────┐ │ │ │ │
├───►│ IGW │───┼──►│ iii engine (HTTP :3111, RPC :49134) │ │
│ └──────────┘ │ │ iii-http · iii-queue · iii-state │ │
│ │ │ iii-observability │ │
│ │ └──────────┬──────────────────────────────┘ │
│ │ │ WebSocket RPC (:49134) │
│ │ │ │
│ │ Private Subnet 10.20.1.0/24 │
│ │ ┌──────────┴──────────────────────────────┐ │
│ │ │ │ │
│ │ │ ┌───────────────────────────────┐ │ │
│ │ │ │ Caller Worker VM │ │ │
│ │ │ │ (TypeScript / Node.js) │ │ │
│ │ │ │ Functions: │ │ │
│ │ │ │ • http::run_inference_over_http │ │
│ │ │ │ • inference::get_response │ │ │
│ │ │ └───────────────────────────────┘ │ │
│ │ │ │ │
│ │ │ ┌───────────────────────────────┐ │ │
│ │ │ │ Inference Worker VM │ │ │
│ │ │ │ (Python / transformers) │ │ │
│ │ │ │ Function: │ │ │
│ │ │ │ • inference::run_inference │ │ │
│ │ │ │ Model: gemma-3-270m (Q8 GGUF)│ │ │
│ │ │ └───────────────────────────────┘ │ │
│ │ │ │ │
│ ┌──────────┐ │ └──────────┬──────────────────────────────┘ │
│ │ NAT GW │◄──┼─────────────┘ (outbound internet for │
│ └──────────┘ │ package installs & model download) │
│ └──────────────────────────────────────────────────────────┘
curl POST :3111/v1/chat/completions
→ iii-http (gateway)
→ http::run_inference_over_http (caller-worker, TypeScript)
→ inference::get_response (caller-worker, TypeScript)
→ inference::run_inference (inference-worker, Python)
→ gemma-3-270m model
← model output
← JSON response
← HTTP 200
← {"result": { ... }}
| Resource | Subnet | Public IP | Inbound from internet |
|---|---|---|---|
| Gateway VM | Public (10.20.0.0/24) | ✅ Yes | Port 3111 (API), 22 (SSH) |
| Caller Worker VM | Private (10.20.1.0/24) | ❌ No | None |
| Inference Worker VM | Private (10.20.1.0/24) | ❌ No | None |
Workers communicate with the gateway over WebSocket RPC (port 49134) within the VPC. They reach the internet only through the NAT Gateway (for package installs and model downloads during bootstrap).
POST http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completions
curl --max-time 420 \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"what color is apple?"}]}' \
http://<GATEWAY_PUBLIC_IP>:3111/v1/chat/completionsRequest body:
{
"messages": [
{ "role": "user", "content": "what color is apple?" }
]
}The messages array follows the OpenAI chat format — each message has a role (user or assistant) and content (string).
{
"result": {
"result": "Apple is typically red."
}
}Note: The model is a tiny 270M-parameter SLM. Output quality is limited — this deployment demonstrates the distributed infrastructure, not production-grade inference.
- An AWS account with programmatic access configured (
aws configure) - Terraform ≥ 1.5
- An EC2 key pair (optional, for SSH access to debug)
# 1. Clone the repository
git clone https://github.com/milinddethe15/slm-deployment.git
cd slm-deployment/terraform/aws
# 2. Initialize Terraform
terraform init
# 3. Deploy (takes ~5 minutes for infra + ~5 minutes for cloud-init bootstrap)
terraform apply
# 4. Note the gateway public IP from the output
terraform output gateway_public_ip
# 5. Wait ~5-7 minutes for cloud-init to finish on all VMs, then test (Response takes around 30 seconds)
curl --max-time 420 \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What color is apple?"}]}' \
http://$(terraform output -raw gateway_public_ip):3111/v1/chat/completionsterraform destroy| Variable | Default | Description |
|---|---|---|
aws_region |
us-east-1 |
AWS region |
repo_url |
this repo | Git repo to clone on each VM |
ssh_key_name |
"" |
EC2 key pair name (optional) |
instance_type_gateway |
c7i-flex.large |
Gateway instance type |
instance_type_worker |
c7i-flex.large |
Worker instance type |
inference_root_volume_size |
50 |
Root volume GiB for inference worker |
allowed_ssh_cidr |
0.0.0.0/0 |
CIDR allowed to SSH |
Each VM's user-data script (cloud-init) automatically:
- Installs system packages (
git,node,python3.11, etc.) - Installs the iii CLI
- Clones this repository to
/opt/slm-deployment - Runs the role-specific bootstrap script:
- Gateway: Installs the
iii-engine.servicesystemd unit, starts the iii engine with the gateway config - Caller worker: Runs
npm install && npm run build, installscaller-worker.service, connects to gateway viaIII_URL - Inference worker: Installs Python dependencies, installs
inference-worker.service, downloads the model from HuggingFace, connects to gateway viaIII_URL
- Gateway: Installs the
# SSH to gateway (public IP)
ssh -i <key.pem> ec2-user@<GATEWAY_PUBLIC_IP>
# SSH to workers (through gateway as bastion)
ssh -J ec2-user@<GATEWAY_PUBLIC_IP> ec2-user@<WORKER_PRIVATE_IP>
# Check service status
sudo systemctl status iii-engine.service # gateway
sudo systemctl status caller-worker.service # caller VM
sudo systemctl status inference-worker.service # inference VM
# View logs
sudo journalctl -u iii-engine.service -f
sudo journalctl -u caller-worker.service -f
sudo journalctl -u inference-worker.service -f
# Check cloud-init (first-boot) log
sudo cat /var/log/cloud-init-output.log | tail -100
# Verify listeners on gateway
ss -tlnp | grep -E '3111|49134'
# Check registered workers (via engine logs)
sudo journalctl -u iii-engine.service | grep "Worker registered"slm-deployment/
├── README.md ← You are here
├── distributed-inference/
│ ├── config.yaml ← iii engine config (local dev)
│ └── workers/
│ ├── caller-worker/ ← TypeScript worker (HTTP trigger + RPC fan-out)
│ │ ├── src/worker.ts
│ │ ├── iii.worker.yaml
│ │ └── package.json
│ └── inference-worker/ ← Python worker (model inference)
│ ├── inference_worker.py
│ ├── iii.worker.yaml
│ └── requirements.txt
└── terraform/
└── aws/
├── main.tf ← VPC, subnets, SGs, EC2 instances
├── variables.tf ← Configurable parameters
├── outputs.tf ← Gateway IP, subnet IDs
├── config/
│ └── gateway.yaml ← iii engine config for the gateway VM
├── scripts/
│ ├── bootstrap-gateway.sh
│ ├── bootstrap-caller-worker.sh
│ ├── bootstrap-inference-worker.sh
│ └── launch-iii.sh
├── systemd/
│ ├── iii-engine.service
│ ├── caller-worker.service
│ └── inference-worker.service
└── templates/
├── bootstrap-gateway.sh.tftpl
├── bootstrap-caller-worker.sh.tftpl
└── bootstrap-inference-worker.sh.tftpl
If this were going to production, I would address the following:
Security:
- Place the API behind an ALB with TLS termination (HTTPS) — currently traffic is unencrypted HTTP.
- Add API authentication (API keys, JWT, or OAuth) — currently the endpoint is completely open.
- Restrict
allowed_ssh_cidrto a VPN or bastion CIDR instead of0.0.0.0/0. - Use AWS Secrets Manager or SSM Parameter Store for sensitive configuration instead of env files on disk.
- Enable VPC Flow Logs for network audit trails.
- Run instances with a hardened AMI and apply security patches via SSM Patch Manager.
Reliability:
- Put each worker behind an Auto Scaling Group with health checks so crashed workers are automatically replaced.
- Add a health-check endpoint (
GET /health) that verifies the full RPC chain is functional. - Use a proper process manager (or container orchestration) with liveness probes instead of bare systemd restart loops.
- Set up CloudWatch alarms for CPU, memory, and disk on each instance.
- Use EBS snapshots or a shared EFS volume for the model weights so every new inference worker doesn't have to re-download from HuggingFace.
Observability:
- Export iii-observability spans to an external backend (Jaeger, Datadog, or AWS X-Ray) instead of in-memory.
- Ship journalctl logs to CloudWatch Logs or an ELK stack for centralized debugging.
- Add structured metrics for request latency, token throughput, and error rates.
If the model were 100× larger (~27B parameters), the deployment would change significantly:
Compute:
- GPU instances would be mandatory (e.g.,
g5.xlargewith A10G orp4d.24xlargewith A100s for very large models). CPU inference at 27B parameters is impractically slow. - Model parallelism (tensor or pipeline parallelism) might be needed if the model doesn't fit in a single GPU's VRAM. This would require frameworks like vLLM, TGI (Text Generation Inference), or DeepSpeed.
Architecture:
- Replace the bare Python worker with a dedicated inference server (vLLM, TGI, or Triton) that handles batching, KV-cache management, and continuous batching for throughput.
- Add a request queue (SQS or Redis) between the caller-worker and the inference tier to absorb traffic spikes and prevent OOM on the GPU.
- Separate the model artifact from the instance — store weights in S3 and mount via FSx for Lustre or pre-bake an AMI with the model to avoid long cold-start times.
Cost:
- Use Spot Instances for the inference tier (with graceful draining) to cut GPU costs by 60-70%.
- Implement autoscaling based on GPU utilization or queue depth.
- Consider a serverless GPU platform (AWS Inferentia, SageMaker endpoints) if traffic is bursty to avoid paying for idle GPUs.
Networking:
- Enable Elastic Fabric Adapter (EFA) or placement groups if doing multi-GPU inference across nodes to minimize inter-node latency.
- Use larger instance types with higher network bandwidth for model weight transfers.