This repository contains the complete, production-grade Infrastructure-as-Code (IaC), worker implementations, and service orchestration files for deploying a cross-language (Python + TypeScript) secure inference pipeline on AWS.
The architecture isolates the worker virtual machines inside an entirely private subnet with no public IP allocation, securing communication over dynamic WebSocket RPC with the iii orchestration engine situated on a public-facing Gateway VM, and exposing model queries securely via a JSON HTTP API.
Below is the distributed routing design of the system, showing network flow, security group barriers, and private-to-public mapping.
flowchart TD
subgraph WAN [Public Internet]
User([Public Client])
end
subgraph VPC [AWS VPC - 10.0.0.0/16]
subgraph PublicSubnet [Public Subnet - 10.0.1.0/24]
GatewayVM["Gateway VM (Engine)<br/>Private: 10.0.1.x<br/>Public: 54.x.x.x"]
NAT["AWS NAT Gateway"]
end
subgraph PrivateSubnet [Private Subnet - 10.0.2.0/24]
TSWorker["TypeScript Caller Worker<br/>Private: 10.0.2.x<br/>(No Public IP)"]
PyWorker["Python Math Worker<br/>Private: 10.0.2.x<br/>(No Public IP)"]
end
end
subgraph InternetGW [AWS Internet Gateway]
IGW["IGW"]
end
%% Network Flow Mappings
User <-->|1. HTTP POST Port 3111| IGW
IGW <--> GatewayVM
%% RPC Websockets
TSWorker <-->|2. ws://10.0.1.x:49134| GatewayVM
PyWorker <-->|3. ws://10.0.1.x:49134| GatewayVM
%% NAT Gateways for Private Subnet Outbound
TSWorker -.->|Outbound package download| NAT
PyWorker -.->|Outbound package download| NAT
NAT --> IGW
- Gateway VM (Public Subnet,
10.0.1.x):- Runs the central
iiiorchestration engine. - Opens port
3111to the internet to listen for JSON HTTP API requests. - Binds port
49134privately within the VPC to receive WebSocket connection requests from workers in the private subnet. - Supervises the built-in
iii-state,iii-http, andiii-queuetasks.
- Runs the central
- TypeScript Caller Worker VM (Private Subnet,
10.0.2.x):- Has no public IP and cannot be reached from the internet.
- Connects outbound to
ws://<Gateway_Private_IP>:49134. - Registers the trigger
/math/add-two-numbers(HTTP Trigger) and the functionmath::add_two_numbers(RPC).
- Python Math Worker VM (Private Subnet,
10.0.2.x):- Has no public IP and cannot be reached from the internet.
- Connects outbound to
ws://<Gateway_Private_IP>:49134. - Registers the function
math::add(RPC) which handles math/inference. - Performs key-value lookups and updates via the Gateway's
iii-stateworker to maintainrunning_totalpersistent state.
.
├── README.md # Architecture diagram, instructions, and scaling essay
├── workers/
│ ├── math-worker/ # Python worker wrapping state persistence
│ │ ├── math_worker.py # Main execution file
│ │ ├── requirements.txt # Dependencies (iii-sdk, watchfiles)
│ │ └── iii.worker.yaml # Worker configuration
│ └── caller-worker/ # TypeScript worker exposing the http trigger
│ ├── src/
│ │ └── worker.ts # Main script implementing cross-worker triggering
│ ├── package.json # TS packages & dependencies
│ ├── tsconfig.json # TS compiler rules
│ └── iii.worker.yaml # Worker configuration
├── infra/ # Infrastructure-as-Code (Terraform)
│ ├── main.tf # Core VPC, subnetting, EIP, routing, and NAT GW
│ ├── security_groups.tf # Strictly isolated firewall groups
│ ├── instances.tf # Compute instances & automatic user_data bootstrapping
│ ├── variables.tf # Region, CIDRs, and size variables
│ └── outputs.tf # Public IPs and auto-generated curl command outputs
└── deployment/ # Local deployment scripts and systemd templates
├── config.yaml # Engine YAML config
├── systemd/
│ ├── iii-engine.service # Systemd daemon for Gateway Engine
│ ├── caller-worker.service # Systemd daemon for TS Worker
│ └── math-worker.service # Systemd daemon for Python Worker
└── bootstrap.sh # Independent multi-cloud bootstrap automation script
Deployment is fully automated using Terraform. Bootstrapping commands, dependency installs, systemd service configuration, and code mounting are automatically injected during instance boot via user_data shell scripts.
- AWS CLI configured on your host machine with valid administrative permissions.
- Terraform (
>= 1.0.0) installed locally.
Navigate to the infra/ folder and initialize the required provider plugins:
cd infra
terraform initPreview the AWS resources to be created:
terraform planRun the apply command. The script will dynamically generate a unique SSH RSA Key Pair, configure the network VPC, route tables, security groups, and launch all three instances:
terraform apply -auto-approveUpon successful deployment, Terraform will output key values:
gateway_public_ip = "54.210.45.89"
gateway_private_ip = "10.0.1.200"
ts_worker_private_ip = "10.0.2.14"
python_worker_private_ip = "10.0.2.220"
curl_test_command = "curl -X POST http://54.210.45.89:3111/math/add-two-numbers -H 'Content-Type: application/json' -d '{\"a\": 100, \"b\": 200}'"
To extract the generated SSH Private Key to easily SSH into the Gateway VM:
terraform output -raw ssh_private_key_pem > id_rsa
chmod 400 id_rsa
ssh -i id_rsa ubuntu@<gateway_public_ip>Once the system completes its startup bootstrap (allow 1–2 minutes after VM launch for npm install and virtual environment generation), verify connectivity and functionality.
Hit the JSON HTTP API exposed by the Gateway VM using the curl test command output by Terraform:
curl -X POST http://<GATEWAY_PUBLIC_IP>:3111/math/add-two-numbers \
-H 'Content-Type: application/json' \
-d '{"a": 100, "b": 200}'You should receive a successful JSON response matching the schema below:
{
"c": 300,
"running_total": 300,
"success": "You've connected two workers and they're interoperating seamlessly, now let's add a few more workers to expand this project's functionality."
}If you repeat the command with different values:
curl -X POST http://<GATEWAY_PUBLIC_IP>:3111/math/add-two-numbers \
-H 'Content-Type: application/json' \
-d '{"a": 50, "b": 50}'You should see:
{
"c": 100,
"running_total": 400,
"success": "You've connected two workers and they're interoperating seamlessly, now let's add a few more workers to expand this project's functionality."
}The stateful running_total resides inside the central Gateway's SQLite-backed engine (iii-state) and increments correctly across cross-language RPC boundaries.
Log in to the Gateway VM via SSH and check the active engine traces:
journalctl -u iii-engine -f -n 50To troubleshoot the TypeScript or Python workers in the private subnet, jump into them from the Gateway VM:
# From Gateway VM, SSH into TypeScript worker
ssh -i id_rsa ubuntu@10.0.2.14
journalctl -u caller-worker -fBefore deploying this system in a critical environment, implement the following architectural enhancements:
-
Security & Transport Encryption:
- HTTPS/WSS: Expose the Gateway API through a load balancer (such as an AWS ALB) with an ACM SSL/TLS certificate. Enable
wss://(secure WebSockets) so that all RPC communication within the VPC is fully encrypted. - Authentication and Authorization: Add a custom
iiimiddleware worker to inspect incoming HTTP headers and validate JWT (JSON Web Tokens) or API tokens before letting triggers reach the TS worker. - VPC Endpoints: Establish AWS PrivateLink VPC endpoints for standard AWS services (such as SSM, CloudWatch, and S3) so that worker VMs never need to cross the NAT gateway for cloud operations.
- HTTPS/WSS: Expose the Gateway API through a load balancer (such as an AWS ALB) with an ACM SSL/TLS certificate. Enable
-
High Availability & State Offloading:
- Externalized Distributed State: Currently,
iii-stateuses a local SQLite database (state_store.db) on the Gateway VM. For high availability, configure the state worker adapter to back its key-value store using AWS ElastiCache for Redis or a cluster-replicated key-value store. This ensures the Gateway VM can crash or scale horizontally without losing state. - Gateway Load Balancing: Deploy the central
iiiengine inside an Autoscaling Group behind an Application Load Balancer (ALB).
- Externalized Distributed State: Currently,
-
Observability & Logging:
- Centralized Telemetry: Leverage the engine's built-in
iii-observabilityworker. Configure its exporter to pipe OpenTelemetry logs, metrics, and traces into AWS CloudWatch, Datadog, or a private Prometheus + Jaeger stack. - Structured Tracing: Map individual HTTP requests to unique trace IDs that follow the RPC context through the TS caller and the Python worker, creating easy-to-read latency flame graphs.
- Centralized Telemetry: Leverage the engine's built-in
If the lightweight quickstart SLM were replaced with a model 100x larger (e.g., Llama-3-70B, Qwen-2.5-72B, or DeepSeek-V3/R1), the current architecture would immediately crash due to CPU and system RAM exhaustion. Scaling up to a model of this magnitude requires re-engineering the compute, model serving, and request scheduling architectures.
- GPU Instances: Standard CPU-only instances are incapable of running massive LLM inference at acceptable latencies (token-per-second). The Python worker must run on dedicated GPU-accelerated instances (e.g., AWS
g5.12xlargewith 4x NVIDIA A10G, orp4d.24xlargewith 8x NVIDIA A100s). - Quantization: Run models using advanced quantization formats (such as GPTQ, AWQ, or FP8) to reduce VRAM requirements by 50% to 75% with negligible degradation in model accuracy.
- Distributed Execution: Use Tensor Parallelism (splitting the model weights horizontally across multiple GPUs in the same server via NVLink) and Pipeline Parallelism (splitting model layers sequentially across different nodes) to fit and run models that exceed the capacity of a single GPU.
Instead of wrapping model execution inside a basic Python/pip execution loop (which blocks on single-threaded execution due to Python's Global Interpreter Lock), decouple inference using a production-grade inference server like vLLM or Hugging Face Text Generation Inference (TGI):
- PagedAttention: Prevents severe VRAM fragmentation by storing KV (Key-Value) cache in non-contiguous physical memory blocks, enabling a 10x increase in throughput.
- Continuous Batching: Dynamically groups incoming token requests mid-execution rather than waiting for entire sequences to complete, significantly improving GPU utilization.
- Decoupled API Routing: The Python Worker VM should act solely as a lightweight coordinator client. It receives the RPC request from the
iiiengine and makes a low-latency gRPC or HTTP call to a vLLM server cluster running inside the private subnet.
flowchart TD
Engine[iii Engine Gateway] <-->|RPC Websocket| PyWorker[Python Coordinator Worker]
PyWorker <-->|gRPC Load Balancer| vLLMCluster{vLLM Server Cluster}
vLLMCluster <--> GPU1[(Node 1: GPU A100 80GB)]
vLLMCluster <--> GPU2[(Node 2: GPU A100 80GB)]
- Worker Pools: In
iii, the central engine supports dynamic RPC routing to multiple worker instances registering the same function signature. We can provision an Autoscaling Group of Python Worker VMs inside the private subnet. When multiple workers registermath::add, theiiiengine automatically load-balances incoming RPC executions across the available worker WebSockets. - Cold-Start Mitigation: Maintain a hot pool of GPU instances and utilize a shared Model Registry (such as AWS S3 or a local network EBS volume) with pre-cached model weights so that scaling up a new worker node does not require downloading hundreds of gigabytes of weights over the network.
- Queue and Rate Limiting: Configure the built-in
iii-queueworker to queue incoming API requests during high-traffic spikes, ensuring that the GPU cluster never experiences Out-Of-Memory (OOM) failures under extreme load.