Trajecta stack — operating guide

Two deploy modes for the TSLM

Both modes share the FastAPI orchestrator + agent loop + RAG + frontend; they differ only in where /predict actually runs the trained model.

Mode	`INFERENCE_BACKEND`	Model runs on	Setup complexity	Cost when idle	UX
A — local (default)	`local`	The inference container (needs a GPU host + mounted `./checkpoints/`)	Low — just `make up`	$0 (your own GPU)	~2-3 s predict
B — SageMaker (recommended for demos)	`sagemaker`	A SageMaker endpoint deployed via `sagemaker-deploy/`	Medium — 3 commands in Code Editor	~$0/hr async, ~$1.21/hr realtime g5.xlarge	~2-3 s realtime; async adds polling

The local FastAPI container is identical in both modes — same image, same routes, same agent loop. The only thing that changes is what /predict does internally: load + forward locally, or boto3.invoke_endpoint against the SageMaker endpoint.

Switching modes is one env-var change + restart:

# Mode A → Mode B
sed -i 's/INFERENCE_BACKEND=local/INFERENCE_BACKEND=sagemaker/' .env
echo SAGEMAKER_ENDPOINT_NAME=trajecta-tslm >> .env
make restart

/health reports inference_backend + endpoint info so the frontend can warn if the SM endpoint is misconfigured.

Architecture

The demo is two Docker services:

┌──────────────────────────────────────────────────────────────────┐
│                       USER BROWSER                              │
│   nginx (trajecta/frontend) → static React bundle on :3000    │
└──────────────────────────┬──────────────────────────────────────┘
                           │  /api/* (same-origin proxy)
                           ▼
┌──────────────────────────────────────────────────────────────────┐
│              FastAPI (trajecta/inference) :8000               │
│                                                                  │
│   /predict, /predict/batch, /pdb_string, /pdb_ids, /health      │
│   /evaluate, /evaluate/agent, /failure_modes                    │
│                                                                  │
│   Components:                                                    │
│     - TSLM v1a + v1b loader (OpenTSLMSP)                        │
│     - Regex verifier (vendored verify_rationale.py)             │
│     - HDF5 → multi-MODEL PDB reconstruction                     │
│     - Agent orchestrator (Claude via OpenRouter, 8 step loop)   │
│     - 9 tools: splits, coords, chemistry, physics (Vina), rag   │
│     - Embedded ChromaDB + OpenAI text-embedding-3-small         │
│     - Persistent eval cache + daily USD cap                     │
└──────────────────────────────────────────────────────────────────┘

First-time setup (Mode A — local checkpoints)

Copy .env.example to .env and fill in real keys:

cp .env.example .env
# edit OPENROUTER_API_KEY, OPENAI_API_KEY, HUGGING_FACE_HUB_TOKEN

Make sure the assets the inference container mounts actually exist locally:

./MD.hdf5                              124 GB MISATO trajectory
./misato-affinity/data/Maps/           atoms_*_map.pickle files
./misato-affinity/data/affinity_data.csv
./preprocessed/features_test.npz       per-channel training features
./preprocessed/samples_test.jsonl      per-PDB facts for the verifier
./checkpoints/v1a/ckpt_ep1.pt          (or ckpt_final.pt) — the trained TSLM
./checkpoints/v1b/ckpt_final.pt

Anything missing will cause the inference container to boot in degraded mode (visible in /health) — endpoints that need the missing data return 503 with a clear message; the rest of the stack keeps running.

Boot the stack:

make up                # builds both images, starts in background
make logs              # tail logs
make ps                # confirm both services healthy

One-time RAG corpus build (cost: ~$0.50 in OpenAI embeddings, ~3 min):
```
make ingest
```
Optional: precompute worked examples + failure modes so the live UI mostly serves cached responses (cost: ~$15, ~30 min):
```
make precompute
```
Open the frontend at http://localhost:3000.

First-time setup (Mode B — SageMaker endpoint)

Full step-by-step in sagemaker-deploy/README.md. TL;DR from a Code Editor terminal in SageMaker Studio:

cd sagemaker-deploy
python build_model_tarball.py \
    --v1a-ckpt /opt/ml/checkpoints/v1a/ckpt_ep1.pt \
    --v1b-ckpt /opt/ml/checkpoints/v1b/ckpt_final.pt \
    --preprocessed /home/sagemaker-user/preprocessed \
    --s3-uri s3://<your-bucket>/trajecta/model.tar.gz

python deploy.py \
    --model-data s3://<your-bucket>/trajecta/model.tar.gz \
    --endpoint-name trajecta-tslm \
    --mode realtime

Then on the machine running this stack:

# .env additions
INFERENCE_BACKEND=sagemaker
SAGEMAKER_ENDPOINT_NAME=trajecta-tslm
SAGEMAKER_REGION=us-west-2
# Either provide explicit AWS keys OR run the host with an instance role that
# has 'sagemaker:InvokeEndpoint' permission on the endpoint ARN.
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

# Then:
make restart inference
make smoketest      # verifies /predict round-trips to SageMaker

The local container does NOT need OpenTSLM, a GPU, or the checkpoint mount in SageMaker mode — those concerns move into the SM container. The MD.hdf5 mount is still required because /pdb_string (the 3D viewer trajectory) runs locally.

When you're done:

cd sagemaker-deploy
python deploy.py --endpoint-name trajecta-tslm --delete

Health check + smoke test

make smoketest           # 7-check end-to-end test, see scripts/smoketest.py
curl http://localhost:8000/health

A green smoke test means:

both variants loaded
predict is deterministic
/pdb_string parses
/evaluate + /evaluate/agent return valid verdicts

Common operations

Command	What it does
`make up`	Build + start the stack
`make down`	Stop containers (keeps the `trajecta_inference_data` volume)
`make restart`	Restart both services
`make logs`	Tail logs from both services
`make shell-inference`	Drop into a bash inside the inference container
`make ingest`	Run the RAG ingest pipeline
`make precompute`	Run the worked-example + failure-mode precompute script
`make test`	Run the inference-service pytest suite (label-filter regression, etc.)
`make clean`	`make down` + delete the persistent data volume

Dev mode without Docker

Run the backend and frontend in separate terminals:

# terminal 1 — backend
cd inference-service
pip install -r requirements.txt
pip install -e ../OpenTSLM
export $(grep -v '^#' ../.env | xargs)
uvicorn app:app --reload --port 8000

# terminal 2 — frontend
cd trajecta
npm install
npm run dev

Vite proxies /api/* to http://localhost:8000, so the same api.ts client works in both dev and prod. Override the proxy target with DEV_API_URL=....

Money + safety

The agent loop calls OpenRouter (Claude Opus 4.7). One run ≈ $0.20–0.50.
OPENROUTER_DAILY_USD_CAP (default $20) is enforced in-process; once exhausted, /evaluate/agent returns HTTP 429.
Predict responses are deterministic (temperature=0, fixed seeds) and free — call them as much as you want.
Cached agent verdicts are returned for free; the frontend shows a "(cached)" pill when this happens.

Troubleshooting

Symptom	Probable cause	Fix
`/health` returns `status: degraded`	no checkpoint in `./checkpoints/v{1a,1b}/`	drop a `ckpt_*.pt` in there and `make restart`
`/predict` returns 404 "not in test split"	the PDB isn't in `preprocessed/features_test.npz`	use one of the PDBs from `/pdb_ids`
`/evaluate/agent` returns 429	daily cap reached	wait for midnight UTC reset, or raise `OPENROUTER_DAILY_USD_CAP` in `.env`
`/evaluate/agent` returns "OPENROUTER_API_KEY not set"	missing env	edit `.env`, `make restart`
3D viewer empty	`/pdb_string` 503 — HDF5 not mounted	check `./MD.hdf5` exists and `docker compose config` shows it mounted
Failure-modes tab shows "no precomputed"	`make precompute` not run	run it; the JSON lands in the `trajecta_inference_data` volume

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trajecta stack — operating guide

Two deploy modes for the TSLM

Architecture

First-time setup (Mode A — local checkpoints)

First-time setup (Mode B — SageMaker endpoint)

Health check + smoke test

Common operations

Dev mode without Docker

Money + safety

Troubleshooting

FilesExpand file tree

STACK.md

Latest commit

History

STACK.md

File metadata and controls

Trajecta stack — operating guide

Two deploy modes for the TSLM

Architecture

First-time setup (Mode A — local checkpoints)

First-time setup (Mode B — SageMaker endpoint)

Health check + smoke test

Common operations

Dev mode without Docker

Money + safety

Troubleshooting