Skip to content

Harsh-1165/iii-rpc-multi-vm-deploy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Alchemyst AI / iii.dev DevOps Inference Architecture

This repository contains the complete, production-grade Infrastructure-as-Code (IaC), worker implementations, and service orchestration files for deploying a cross-language (Python + TypeScript) secure inference pipeline on AWS.

The architecture isolates the worker virtual machines inside an entirely private subnet with no public IP allocation, securing communication over dynamic WebSocket RPC with the iii orchestration engine situated on a public-facing Gateway VM, and exposing model queries securely via a JSON HTTP API.


1. System Architecture

Below is the distributed routing design of the system, showing network flow, security group barriers, and private-to-public mapping.

flowchart TD
    subgraph WAN [Public Internet]
        User([Public Client])
    end

    subgraph VPC [AWS VPC - 10.0.0.0/16]
        subgraph PublicSubnet [Public Subnet - 10.0.1.0/24]
            GatewayVM["Gateway VM (Engine)<br/>Private: 10.0.1.x<br/>Public: 54.x.x.x"]
            NAT["AWS NAT Gateway"]
        end

        subgraph PrivateSubnet [Private Subnet - 10.0.2.0/24]
            TSWorker["TypeScript Caller Worker<br/>Private: 10.0.2.x<br/>(No Public IP)"]
            PyWorker["Python Math Worker<br/>Private: 10.0.2.x<br/>(No Public IP)"]
        end
    end

    subgraph InternetGW [AWS Internet Gateway]
        IGW["IGW"]
    end

    %% Network Flow Mappings
    User <-->|1. HTTP POST Port 3111| IGW
    IGW <--> GatewayVM
    
    %% RPC Websockets
    TSWorker <-->|2. ws://10.0.1.x:49134| GatewayVM
    PyWorker <-->|3. ws://10.0.1.x:49134| GatewayVM

    %% NAT Gateways for Private Subnet Outbound
    TSWorker -.->|Outbound package download| NAT
    PyWorker -.->|Outbound package download| NAT
    NAT --> IGW
Loading

Component Roles & Communication Paths

  1. Gateway VM (Public Subnet, 10.0.1.x):
    • Runs the central iii orchestration engine.
    • Opens port 3111 to the internet to listen for JSON HTTP API requests.
    • Binds port 49134 privately within the VPC to receive WebSocket connection requests from workers in the private subnet.
    • Supervises the built-in iii-state, iii-http, and iii-queue tasks.
  2. TypeScript Caller Worker VM (Private Subnet, 10.0.2.x):
    • Has no public IP and cannot be reached from the internet.
    • Connects outbound to ws://<Gateway_Private_IP>:49134.
    • Registers the trigger /math/add-two-numbers (HTTP Trigger) and the function math::add_two_numbers (RPC).
  3. Python Math Worker VM (Private Subnet, 10.0.2.x):
    • Has no public IP and cannot be reached from the internet.
    • Connects outbound to ws://<Gateway_Private_IP>:49134.
    • Registers the function math::add (RPC) which handles math/inference.
    • Performs key-value lookups and updates via the Gateway's iii-state worker to maintain running_total persistent state.

2. Codebase Directory Map

.
├── README.md                      # Architecture diagram, instructions, and scaling essay
├── workers/
│   ├── math-worker/               # Python worker wrapping state persistence
│   │   ├── math_worker.py         # Main execution file
│   │   ├── requirements.txt       # Dependencies (iii-sdk, watchfiles)
│   │   └── iii.worker.yaml        # Worker configuration
│   └── caller-worker/             # TypeScript worker exposing the http trigger
│       ├── src/
│       │   └── worker.ts          # Main script implementing cross-worker triggering
│       ├── package.json           # TS packages & dependencies
│       ├── tsconfig.json          # TS compiler rules
│       └── iii.worker.yaml        # Worker configuration
├── infra/                         # Infrastructure-as-Code (Terraform)
│   ├── main.tf                    # Core VPC, subnetting, EIP, routing, and NAT GW
│   ├── security_groups.tf         # Strictly isolated firewall groups
│   ├── instances.tf               # Compute instances & automatic user_data bootstrapping
│   ├── variables.tf               # Region, CIDRs, and size variables
│   └── outputs.tf                 # Public IPs and auto-generated curl command outputs
└── deployment/                    # Local deployment scripts and systemd templates
    ├── config.yaml                # Engine YAML config
    ├── systemd/
    │   ├── iii-engine.service     # Systemd daemon for Gateway Engine
    │   ├── caller-worker.service  # Systemd daemon for TS Worker
    │   └── math-worker.service    # Systemd daemon for Python Worker
    └── bootstrap.sh               # Independent multi-cloud bootstrap automation script

3. Quick Deployment Guide (IaC)

Deployment is fully automated using Terraform. Bootstrapping commands, dependency installs, systemd service configuration, and code mounting are automatically injected during instance boot via user_data shell scripts.

Prerequisites

  • AWS CLI configured on your host machine with valid administrative permissions.
  • Terraform (>= 1.0.0) installed locally.

Step 1: Initialize Terraform

Navigate to the infra/ folder and initialize the required provider plugins:

cd infra
terraform init

Step 2: Plan and Verify the Deployment

Preview the AWS resources to be created:

terraform plan

Step 3: Spin Up the Infrastructure

Run the apply command. The script will dynamically generate a unique SSH RSA Key Pair, configure the network VPC, route tables, security groups, and launch all three instances:

terraform apply -auto-approve

Step 4: Extract the Outputs

Upon successful deployment, Terraform will output key values:

gateway_public_ip       = "54.210.45.89"
gateway_private_ip      = "10.0.1.200"
ts_worker_private_ip    = "10.0.2.14"
python_worker_private_ip = "10.0.2.220"
curl_test_command       = "curl -X POST http://54.210.45.89:3111/math/add-two-numbers -H 'Content-Type: application/json' -d '{\"a\": 100, \"b\": 200}'"

To extract the generated SSH Private Key to easily SSH into the Gateway VM:

terraform output -raw ssh_private_key_pem > id_rsa
chmod 400 id_rsa
ssh -i id_rsa ubuntu@<gateway_public_ip>

4. Operational Validation

Once the system completes its startup bootstrap (allow 1–2 minutes after VM launch for npm install and virtual environment generation), verify connectivity and functionality.

Command Execution

Hit the JSON HTTP API exposed by the Gateway VM using the curl test command output by Terraform:

curl -X POST http://<GATEWAY_PUBLIC_IP>:3111/math/add-two-numbers \
  -H 'Content-Type: application/json' \
  -d '{"a": 100, "b": 200}'

Expected Response

You should receive a successful JSON response matching the schema below:

{
  "c": 300,
  "running_total": 300,
  "success": "You've connected two workers and they're interoperating seamlessly, now let's add a few more workers to expand this project's functionality."
}

If you repeat the command with different values:

curl -X POST http://<GATEWAY_PUBLIC_IP>:3111/math/add-two-numbers \
  -H 'Content-Type: application/json' \
  -d '{"a": 50, "b": 50}'

You should see:

{
  "c": 100,
  "running_total": 400,
  "success": "You've connected two workers and they're interoperating seamlessly, now let's add a few more workers to expand this project's functionality."
}

The stateful running_total resides inside the central Gateway's SQLite-backed engine (iii-state) and increments correctly across cross-language RPC boundaries.

Troubleshooting Logs via Systemd

Log in to the Gateway VM via SSH and check the active engine traces:

journalctl -u iii-engine -f -n 50

To troubleshoot the TypeScript or Python workers in the private subnet, jump into them from the Gateway VM:

# From Gateway VM, SSH into TypeScript worker
ssh -i id_rsa ubuntu@10.0.2.14
journalctl -u caller-worker -f

5. Production Hardening Strategy

Before deploying this system in a critical environment, implement the following architectural enhancements:

  1. Security & Transport Encryption:

    • HTTPS/WSS: Expose the Gateway API through a load balancer (such as an AWS ALB) with an ACM SSL/TLS certificate. Enable wss:// (secure WebSockets) so that all RPC communication within the VPC is fully encrypted.
    • Authentication and Authorization: Add a custom iii middleware worker to inspect incoming HTTP headers and validate JWT (JSON Web Tokens) or API tokens before letting triggers reach the TS worker.
    • VPC Endpoints: Establish AWS PrivateLink VPC endpoints for standard AWS services (such as SSM, CloudWatch, and S3) so that worker VMs never need to cross the NAT gateway for cloud operations.
  2. High Availability & State Offloading:

    • Externalized Distributed State: Currently, iii-state uses a local SQLite database (state_store.db) on the Gateway VM. For high availability, configure the state worker adapter to back its key-value store using AWS ElastiCache for Redis or a cluster-replicated key-value store. This ensures the Gateway VM can crash or scale horizontally without losing state.
    • Gateway Load Balancing: Deploy the central iii engine inside an Autoscaling Group behind an Application Load Balancer (ALB).
  3. Observability & Logging:

    • Centralized Telemetry: Leverage the engine's built-in iii-observability worker. Configure its exporter to pipe OpenTelemetry logs, metrics, and traces into AWS CloudWatch, Datadog, or a private Prometheus + Jaeger stack.
    • Structured Tracing: Map individual HTTP requests to unique trace IDs that follow the RPC context through the TS caller and the Python worker, creating easy-to-read latency flame graphs.

6. Scaling to 100x Larger Models (Engineering Essay)

If the lightweight quickstart SLM were replaced with a model 100x larger (e.g., Llama-3-70B, Qwen-2.5-72B, or DeepSeek-V3/R1), the current architecture would immediately crash due to CPU and system RAM exhaustion. Scaling up to a model of this magnitude requires re-engineering the compute, model serving, and request scheduling architectures.

A. Compute & VRAM Scaling

  • GPU Instances: Standard CPU-only instances are incapable of running massive LLM inference at acceptable latencies (token-per-second). The Python worker must run on dedicated GPU-accelerated instances (e.g., AWS g5.12xlarge with 4x NVIDIA A10G, or p4d.24xlarge with 8x NVIDIA A100s).
  • Quantization: Run models using advanced quantization formats (such as GPTQ, AWQ, or FP8) to reduce VRAM requirements by 50% to 75% with negligible degradation in model accuracy.
  • Distributed Execution: Use Tensor Parallelism (splitting the model weights horizontally across multiple GPUs in the same server via NVLink) and Pipeline Parallelism (splitting model layers sequentially across different nodes) to fit and run models that exceed the capacity of a single GPU.

B. Dedicated Model Servers (vLLM / TGI)

Instead of wrapping model execution inside a basic Python/pip execution loop (which blocks on single-threaded execution due to Python's Global Interpreter Lock), decouple inference using a production-grade inference server like vLLM or Hugging Face Text Generation Inference (TGI):

  • PagedAttention: Prevents severe VRAM fragmentation by storing KV (Key-Value) cache in non-contiguous physical memory blocks, enabling a 10x increase in throughput.
  • Continuous Batching: Dynamically groups incoming token requests mid-execution rather than waiting for entire sequences to complete, significantly improving GPU utilization.
  • Decoupled API Routing: The Python Worker VM should act solely as a lightweight coordinator client. It receives the RPC request from the iii engine and makes a low-latency gRPC or HTTP call to a vLLM server cluster running inside the private subnet.
flowchart TD
    Engine[iii Engine Gateway] <-->|RPC Websocket| PyWorker[Python Coordinator Worker]
    PyWorker <-->|gRPC Load Balancer| vLLMCluster{vLLM Server Cluster}
    vLLMCluster <--> GPU1[(Node 1: GPU A100 80GB)]
    vLLMCluster <--> GPU2[(Node 2: GPU A100 80GB)]
Loading

C. Load Balancing & RPC Replication

  • Worker Pools: In iii, the central engine supports dynamic RPC routing to multiple worker instances registering the same function signature. We can provision an Autoscaling Group of Python Worker VMs inside the private subnet. When multiple workers register math::add, the iii engine automatically load-balances incoming RPC executions across the available worker WebSockets.
  • Cold-Start Mitigation: Maintain a hot pool of GPU instances and utilize a shared Model Registry (such as AWS S3 or a local network EBS volume) with pre-cached model weights so that scaling up a new worker node does not require downloading hundreds of gigabytes of weights over the network.
  • Queue and Rate Limiting: Configure the built-in iii-queue worker to queue incoming API requests during high-traffic spikes, ensuring that the GPU cluster never experiences Out-Of-Memory (OOM) failures under extreme load.

About

Secure, multi-VM cross-language inference architecture orchestrated over WebSocket RPC using the iii engine. Fully automated with Terraform (VPC, private subnet isolation, NAT Gateway) and systemd services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors