End-to-end runbook to build both container images and deploy the Aerele Proctor platform to Google Cloud Platform (Cloud Run + Cloud Storage + Firestore + Cloud Build + Artifact Registry) from scratch.
This runbook is self-contained: every command and behavior below tracks the
actual repo — backend/deploy-gcp.sh, frontend/deploy-gcp.sh,
video-worker/deploy-gcp.sh, backend/src/config.mjs, backend/src/handler.mjs,
backend/src/lib/clients.mjs, backend/src/lib/auth.mjs,
backend/src/routes/healthCheck.mjs, frontend/src/api.ts, .env.deploy.example,
backend/gcs-lifecycle.json, and backend/gcs-cors.json. For current deployed
revisions, run gcloud run services list.
The committed deploy scripts now reproduce the COMPLETE live config. This is the most important recent change to internalize:
backend/deploy-gcp.shhas two modes (DEPLOY_MODE=full/image-only— see §Deploy modes). Infullmode (the default) it sets the entire env map —ADMIN_PASSWORD,INVIGILATOR_PASSWORD,ALERTS_INGEST_API_KEY,RETENTION_SWEEP_API_KEY, theJUDGE0_*keys, optionalEXEC_*tuning — and mounts the recording-signing key from Secret Manager (proctor-signer-key→/secrets/signer-key.json,SIGNER_KEY_FILE), all in one atomic deploy. A pre-flight gate aborts loudly if any required secret is missing.image-onlymode ships a new image while preserving the live env + secret mounts.frontend/deploy-gcp.shbakes bothVITE_ADMIN_PASSWORD_HASHandVITE_INVIGILATOR_PASSWORD_HASHand verifies them post-build (verify_dist_has_hashes) — it is the only sanctioned frontend deploy path; do not build/submit the frontend by hand.The standard operator workflow is the STAGED ZERO-DOWNTIME DEPLOY in §Staged zero-downtime deploy: build → deploy as a no-traffic tagged revision → verify on the tag URL (admin pre-flight health-check + smoke) → only then cut traffic → keep the previous revision at 0% for instant rollback. See also the 2026-06-19 incident learnings.
The from-scratch GCP bootstrap (project create → billing → enable APIs → deployer SA → key → handoff env file) is described below. Run it first if the project does not yet exist. The hard rules:
gcloudinstalled and authenticated as a user who can create projects and link billing.- Brand-new ISOLATED project. Do NOT reuse any existing or production project.
- The deployer service account is a member of only that one project —
roles/owneron the isolated, deletable project (or the tighter role list in the doc:run.admin,cloudbuild.builds.editor,artifactregistry.admin,storage.admin,datastore.owner,serviceusage.serviceUsageAdmin,iam.serviceAccountAdmin,iam.serviceAccountUser,resourcemanager.projectIamAdmin). - No org-level or folder-level roles. Budget-capped and deletable.
The APIs the platform needs (also enabled idempotently by the deploy scripts):
run, cloudbuild, artifactregistry, firestore, storage, iamcredentials
(the setup doc additionally enables cloudresourcemanager).
| Fact | Value |
|---|---|
| Project | your-gcp-project-id |
| Region | asia-south1 (example region) |
| Deployer SA | proctor-deployer@your-gcp-project-id.iam.gserviceaccount.com |
| SA key + GCP env | monitoring/.data/gcp-dev.env (gitignored: GCP_PROJECT_ID / GCP_REGION / GOOGLE_APPLICATION_CREDENTIALS) |
| gcloud binaries | ~/google-cloud-sdk/bin |
To deploy as the scoped deployer (instead of an interactive login):
source monitoring/.data/gcp-dev.env
gcloud auth activate-service-account --key-file="$GOOGLE_APPLICATION_CREDENTIALS"
gcloud config set project "$GCP_PROJECT_ID"cp .env.deploy.example .env.deploy.local # gitignored — keep it private
# edit .env.deploy.local, then source it for the deploy scripts:
set -a; source .env.deploy.local; set +aFields in .env.deploy.example (verified):
| Field | Notes |
|---|---|
PROJECT_ID |
GCP project ID (not display name). |
REGION |
One region for everything (e.g. asia-south1). |
REPOSITORY |
Artifact Registry repo. Template default proctor. Note: the deploy scripts default to aerele-proctor when unset, so set this explicitly. |
ADMIN_PASSWORD |
/admin password. openssl rand -base64 24. Backend secret AND its sha256 is embedded in the frontend bundle (plain value never shipped). |
ALERTS_INGEST_API_KEY |
Shared secret for POST /api/alerts. openssl rand -base64 32. Closed-by-default: unset ⇒ ingest rejects everything. |
RETENTION_SWEEP_API_KEY |
Daily retention sweep key. openssl rand -base64 32. Closed-by-default: unset ⇒ /api/admin/retention-sweep rejects the x-api-key path (the admin password still triggers a manual sweep). |
ALERTS_COLLECTION |
Firestore alerts collection. Default proctor_alerts. |
PUBLIC_APP_ORIGIN |
CORS origin. Start *; tighten to the frontend URL later (§5). |
EVIDENCE_BUCKET |
Globally-unique GCS bucket for evidence. |
SOURCE_BUCKET |
Video-worker source — usually equal to EVIDENCE_BUCKET. |
DEST_BUCKET |
Merged-review-video bucket (video-worker only). |
BACKEND_SERVICE_NAME |
proctor-api. |
FRONTEND_SERVICE_NAME |
proctor-web. |
VIDEO_WORKER_SERVICE_NAME |
proctor-video-worker. |
API_URL |
Backend Cloud Run URL — fill AFTER the backend deploy (§2). |
WORKER_TOKEN |
Protects the video-worker /merge endpoint. openssl rand -base64 32. |
MAX_USERNAMES_PER_REQUEST |
Local merge-helper batch cap. Default 25. |
These are read by backend/src/config.mjs / the deploy script but are absent from
the committed .env.deploy.example template. A full backend deploy DOES set
them — but only if they are present in your environment / .env.deploy.local, so
add them there. (INVIGILATOR_PASSWORD, ALERTS_INGEST_API_KEY,
RETENTION_SWEEP_API_KEY, ADMIN_PASSWORD, EVIDENCE_BUCKET, and JUDGE0_API_KEY
are the required secrets the full-mode pre-flight gate enforces.)
| Var | Why |
|---|---|
INVIGILATOR_PASSWORD |
Backend invigilator auth (requireInvigilator → 401 when wrong/unset). Also baked as a frontend hash by frontend/deploy-gcp.sh. Required (full-mode gate). |
JUDGE0_API_KEY |
RapidAPI key for live Run/Submit. The script defaults JUDGE0_MODE=rapidapi, JUDGE0_BASE_URL=https://judge0-ce.p.rapidapi.com, JUDGE0_RAPIDAPI_HOST=judge0-ce.p.rapidapi.com. Keep the key in a gitignored env file (e.g. monitoring/.data/judge0.env). Required (full-mode gate). |
EXEC_SUBMIT_COOLDOWN_SECONDS |
≈ 20 for a real exam (default 20). Passed through only when set. |
EXEC_MAX_SUBMISSIONS_PER_SESSION |
≈ 200 for a real exam (default 50). Passed through only when set. |
EXEC_RUN_CONCURRENCY / EXEC_SUBMIT_CONCURRENCY / EXEC_POLL_CONCURRENCY / EXEC_MAX_QUEUE |
Generous lane concurrency for capacity (defaults 2/4/16/200). Passed through only when set. |
set -a; source .env.deploy.local; set +a
# DEPLOY_MODE defaults to `full`; the script also sources .env.deploy.local itself.
SERVICE_NAME="$BACKEND_SERVICE_NAME" ./backend/deploy-gcp.shbackend/deploy-gcp.sh does, idempotently (verified):
- Sources
.env.deploy.localif present (env you already exported wins), then selects the deploy mode fromDEPLOY_MODE(defaultfull). - Pre-flight gate (full mode only): asserts every exam-critical secret is set —
ADMIN_PASSWORD,INVIGILATOR_PASSWORD,ALERTS_INGEST_API_KEY,RETENTION_SWEEP_API_KEY,JUDGE0_API_KEY,EVIDENCE_BUCKET— and aborts before any build if one is missing (mirrors the frontend hash gate; never ship a half-configured exam backend silently). gcloud services enable run cloudbuild artifactregistry firestore storage iamcredentials secretmanager.- Creates Firestore
(default)in$REGIONif missing. - Creates the composite index on
proctor_sessions(username_norm ASC, contest_slug ASC)--async(non-blocking; also declared inbackend/firestore.indexes.json). The index builds in the background and never blocks the deploy. - Creates
EVIDENCE_BUCKET(uniform bucket-level access) if missing. - Applies
backend/gcs-cors.json(browser PUT/GET/HEAD, origin*) andbackend/gcs-lifecycle.json(the two-rule retention split — see §2b). - Creates the Artifact Registry Docker repo if missing.
- Grants the runtime SA (
<projectNumber>-compute@developer.gserviceaccount.com): projectroles/datastore.user, bucketroles/storage.objectAdmin, androles/iam.serviceAccountTokenCreatoron itself (needed to sign GCS URLs). - Signer key (full mode only): verifies the Secret Manager secret
proctor-signer-keyexists and grants the runtime SAroles/secretmanager.secretAccessoron it — see §2c. gcloud builds submit backend --tag $IMAGE.gcloud run deploy— port8080,256Mi, cpu1,--min-instances 0,--max-instances 20,--concurrency 100,--timeout 120s(/api/exec/*blocks while the Judge0 adapter polls — a 30s timeout killed requests mid-poll). In full mode this carries--set-env-vars(the complete env map) and--set-secrets=/secrets/signer-key.json=proctor-signer-key:latest; in image-only mode it carries neither (preserving the live config).
Then capture the backend URL into API_URL for the frontend build:
export API_URL="$(gcloud run services describe "$BACKEND_SERVICE_NAME" \
--region "$REGION" --format='value(status.url)')"
fullmode sets the ENTIRE env map + mounts the signer key — atomically. The script builds the full--set-env-varslist itself (using gcloud's^@^custom-delimiter form so a secret value containing a comma can't corrupt the parse):EVIDENCE_BUCKET, ADMIN_PASSWORD, INVIGILATOR_PASSWORD, ALERTS_INGEST_API_KEY, ALERTS_COLLECTION, RETENTION_SWEEP_API_KEY, PUBLIC_APP_ORIGIN, SESSION_COLLECTION, SETTINGS_COLLECTION, URL_EXPIRY_SECONDS, JUDGE0_MODE, JUDGE0_BASE_URL, JUDGE0_RAPIDAPI_HOST, JUDGE0_API_KEY, SIGNER_KEY_FILE. The optional tunablesEXEC_RUN_COOLDOWN_SECONDS, EXEC_SUBMIT_COOLDOWN_SECONDS, EXEC_MAX_SUBMISSIONS_PER_SESSION, EXEC_RUN_CONCURRENCY, EXEC_SUBMIT_CONCURRENCY, EXEC_POLL_CONCURRENCY, EXEC_MAX_QUEUE, EVALUATE_BATCH_LIMIT, EVALUATE_TIME_BUDGET_MS, EVAL_LEASE_MS, JUDGE0_AUTH_TOKENare added only when set, so a full deploy never silently resets a tuned limit to a code default. You no longer hand---update-env-varsthe Judge0/invigilator/sweep keys — set them in.env.deploy.localand run the script.
Recording-signing must sign GCS v4 URLs locally off a mounted service-account
key. The mechanism (verified backend/src/lib/clients.mjs):
- The backend keeps a main Storage client on metadata ADC for all token-bearing
work (
getFiles,save, Firestore, every API call). - It builds a separate signing client only when
SIGNER_KEY_FILEpoints at a mounted key (new Storage({ keyFilename: SIGNER_KEY_FILE })). v4 signing off that key is a local crypto operation — no token, no network. - If
SIGNER_KEY_FILEis unset,signingBucket()falls back to the main client, which then has to sign via the remote IAMsignBlobtoken endpoint — the flaky path that degrades/fails under real-exam token-endpoint load. That fallback is exactly the 2026-06-19 recording-signing outage (see §incident learnings).
So a deploy must keep the signer key mounted. The wiring:
| Piece | Value |
|---|---|
| Secret Manager secret | proctor-signer-key (the signer SA JSON key; created out-of-band, never committed) |
| Mount path | /secrets/signer-key.json (--set-secrets=/secrets/signer-key.json=proctor-signer-key:latest) |
| Backend env | SIGNER_KEY_FILE=/secrets/signer-key.json (read by config.mjs → configureClients) |
DEPLOY_MODE=full sets all three (and grants the runtime SA
secretmanager.secretAccessor); DEPLOY_MODE=image-only preserves the mount
(it passes neither --set-secrets nor --set-env-vars). The signer secret itself
is created once, out of band — the deploy script never creates or prints it:
# One-time, out-of-band (NOT in the repo; the SA key value is a secret):
gcloud secrets create proctor-signer-key --replication-policy=automatic
gcloud secrets versions add proctor-signer-key --data-file=<path-to-signer-sa-key.json>The full-mode script aborts with that exact hint if proctor-signer-key is absent.
config.mjs is the single env source besides handler.mjs. Unset collections fall
back to proctor_* defaults; the four credentials are closed-by-default when unset.
Collections (Firestore collection-name overrides; all default to the value shown):
SESSION_COLLECTION (proctor_sessions), SETTINGS_COLLECTION (proctor_settings),
ALERTS_COLLECTION (proctor_alerts),
SUBMISSION_EVENTS_COLLECTION (proctor_submission_events),
LIVE_LOCK_COLLECTION (proctor_live_locks),
REVIEW_STATE_COLLECTION (proctor_review_state),
REVIEW_COLLECTION (proctor_reviews),
REVIEW_CLAIMS_COLLECTION (proctor_review_claims),
SUBMISSIONS_COLLECTION (proctor_submissions),
PROBLEMS_COLLECTION (proctor_problems),
EDITOR_EVENTS_COLLECTION (editor-events, a GCS sub-prefix label),
ROSTER_COLLECTION (proctor_roster),
ROOM_GATES_COLLECTION (proctor_room_gates),
CONTESTS_COLLECTION (proctor_contests),
COLLEGES_COLLECTION (proctor_colleges),
PERSONS_COLLECTION (proctor_persons),
ENROLLMENTS_COLLECTION (proctor_enrollments),
ADMIN_AUDIT_COLLECTION (proctor_admin_audit),
TEMPLATES_COLLECTION (proctor_templates).
Storage / Judge0:
| Var | Default | Notes |
|---|---|---|
EVIDENCE_BUCKET |
(none) | Required for evidence uploads + signed URLs. |
JUDGE0_BASE_URL |
https://judge0-ce.p.rapidapi.com |
|
JUDGE0_MODE |
rapidapi |
|
JUDGE0_API_KEY |
(none) | RapidAPI key — required for live Run/Submit. |
JUDGE0_AUTH_TOKEN |
(none) | Alternate auth (self-host token mode). |
URL_EXPIRY_SECONDS |
900 |
Signed-URL TTL. |
Credentials (closed-by-default when unset):
| Var | Effect when unset |
|---|---|
ADMIN_PASSWORD |
requireAdmin → 401 (admin routes inaccessible). |
INVIGILATOR_PASSWORD |
requireInvigilator → 401. |
ALERTS_INGEST_API_KEY |
POST /api/alerts rejects all. |
RETENTION_SWEEP_API_KEY |
/api/admin/retention-sweep rejects the x-api-key path (admin password still works). |
Tunables: EDITOR_EVENTS_INGEST_LIMIT (5000),
EXEC_RUN_COOLDOWN_SECONDS (5), EXEC_SUBMIT_COOLDOWN_SECONDS (20),
EXEC_MAX_SUBMISSIONS_PER_SESSION (50),
EXEC_RUN_CONCURRENCY (2), EXEC_SUBMIT_CONCURRENCY (4),
EXEC_POLL_CONCURRENCY (16), EXEC_MAX_QUEUE (200),
DISCONNECTED_STALENESS_MS (45000), PUBLIC_APP_ORIGIN (*),
GATE_ATTEMPT_LIMIT (20).
backend/gcs-lifecycle.json is two prefix-scoped rules (verified):
- Delete objects under
contests/andsessions/at age 3 days (per-session evidence). - Delete objects under
exports/at age 11 days (export recovery zips).
The split is load-bearing: a single blanket age:3 rule would delete export
recovery archives 7 days early. The /api/admin/retention-sweep endpoint owns the
canonical 10-day deletion of export zips; the GCS age:11 rule is only a backstop
just past that window.
To run the sweep daily, create a Cloud Scheduler job that POSTs to the endpoint with
the sweep key in the x-api-key header (the handler's requireSweepAuth accepts
the x-api-key or the admin password):
gcloud scheduler jobs create http proctor-retention-sweep \
--location "$REGION" \
--schedule "0 3 * * *" \
--uri "${API_URL}/api/admin/retention-sweep" \
--http-method POST \
--headers "x-api-key=${RETENTION_SWEEP_API_KEY}"Watch for a Firestore composite-index prompt the first time a big export/purge runs. (Cloud Scheduler API enablement / job creation is not exercised by the repo scripts.)
# API_URL must already be exported from §2
SERVICE_NAME="$FRONTEND_SERVICE_NAME" ./frontend/deploy-gcp.sh
frontend/deploy-gcp.shis the ONLY sanctioned frontend deploy path. Ad-hocnpm run build+gcloud builds submitare forbidden — they skip the password-hash bake and the post-build verification gate, which is exactly how a deploy once shipped an emptyVITE_ADMIN_PASSWORD_HASHand broke admin /invigilator login before a ~700-student exam. Always run the script.
frontend/deploy-gcp.sh:
- Enables
run,cloudbuild,artifactregistry; creates the Artifact Registry repo if missing. - Asserts
PROJECT_ID,API_URL,ADMIN_PASSWORD, andINVIGILATOR_PASSWORDare set (fails fast otherwise). - Computes
sha256hexof both passwords — the plain passwords are never put in the bundle; the unlock gates hash the typed password and compare to the embedded hash (frontend/src/api.ts). - Builds:
VITE_API_BASE_URL=$API_URL VITE_ADMIN_PASSWORD_HASH=… VITE_INVIGILATOR_PASSWORD_HASH=… npm --workspace frontend run build. - Post-build verification gate: greps
frontend/distfor both expected hash strings; if either is missing it prints a loud error andexit 1to abort the deploy (so a hash-less bundle can never ship). gcloud builds submit frontend --tag $IMAGE.gcloud run deploy— port8080,128Mi, cpu1,--min-instances 0,--max-instances 3,--concurrency 1000.
The admin console is the same frontend URL at /admin; the invigilator portal
is at /invigilator (routed in frontend/src/App.tsx). The invigilator portal can
also be entered via a tokenized ?contest=…&key=… link, and the admin password
also unlocks it — InvigilatorApp.tsx accepts the admin hash as a fallback.
| Var | Purpose |
|---|---|
VITE_API_BASE_URL |
Backend base URL the app calls (= API_URL). |
VITE_ADMIN_PASSWORD_HASH |
sha256 hex of ADMIN_PASSWORD; admin unlock gate compares against it. |
VITE_INVIGILATOR_PASSWORD_HASH |
sha256 hex (lowercase) of INVIGILATOR_PASSWORD; invigilator unlock gate. Baked + verified by the script. |
VITE_ADMIN_PASSWORD / VITE_INVIGILATOR_PASSWORD |
Plain passwords — used only by demo-mode local builds; do NOT pass for production. |
VITE_DEMO_MODE |
true runs the whole UI on a localStorage fake (no backend) — local demo only. |
SERVICE_NAME="$VIDEO_WORKER_SERVICE_NAME" ./video-worker/deploy-gcp.shvideo-worker/deploy-gcp.sh (verified): creates DEST_BUCKET + applies
backend/gcs-lifecycle.json; grants the runtime SA storage.objectViewer on
SOURCE_BUCKET, storage.objectAdmin on DEST_BUCKET, and project
datastore.user (the worker writes merged_video_key back to the session doc);
deploys with 1Gi, --concurrency 1, --timeout 3600s (ffmpeg/ffprobe come
from its Dockerfile). Env set by the script: SOURCE_BUCKET, DEST_BUCKET,
SESSION_COLLECTION, MAX_USERNAMES_PER_REQUEST, WORKER_TOKEN.
CAVEAT (
video-worker/README.md, untested vs real GCP): ifDEST_BUCKET≠EVIDENCE_BUCKET, the backend signs the alertvideo_keyagainst the evidence bucket and the deep-link can 404. The video-worker is NOT deployed on the dev stack — the alert→recording deep-link currently has no merged video; admin recording review plays raw chunks directly (the player builds a playlist fromscreen/chunk-*.webm). (unverified against a real GCP run.)
proctor-eval is the same backend/ source as proctor-api, a different
entrypoint — it is built from backend/Dockerfile.eval (functions-framework
--target=evalApi) and runs as its OWN Cloud Run service so the evaluation
engine can be redeployed without touching the live exam path (see
backend/src/eval-server.mjs). It shares the SAME env + signer-key secret as
proctor-api (separation is at the deploy boundary, not the data boundary).
Build + deploy it like the backend, but with the eval Dockerfile, the eval service name, and the eval image tag:
IMAGE="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/eval:latest"
gcloud builds submit backend --config=- <<'YAML' # or: --tag with Dockerfile.eval
# build backend/ using Dockerfile.eval, push to ${IMAGE}
YAML
# Set the COMPLETE env + signer mount, exactly as proctor-api does — the eval
# service reads the same vars (JUDGE0_*, EVIDENCE_BUCKET, collections, SIGNER_KEY_FILE…).
gcloud run deploy proctor-eval \
--image "$IMAGE" --region "$REGION" --allow-unauthenticated --port 8080 \
--set-env-vars="<the SAME complete env map proctor-api uses>" \
--set-secrets="/secrets/signer-key.json=proctor-signer-key:latest"Same env-replacement hazard as the backend (see §Deploy modes):
gcloud run deploy --set-env-varsREPLACES the whole env map. Aproctor-evaldeploy MUST carry the complete env + signer mount (it shares all of proctor-api's vars), or it ships a half-configured eval service. For routine code-only redeploys, omit--set-env-vars/--set-secretsso Cloud Run preserves the existing config — theimage-onlydiscipline applies here too.
After the frontend is up, tighten PUBLIC_APP_ORIGIN from * to the exact
frontend URL and redeploy the backend:
export PUBLIC_APP_ORIGIN="$(gcloud run services describe "$FRONTEND_SERVICE_NAME" \
--region "$REGION" --format='value(status.url)')"
SERVICE_NAME="$BACKEND_SERVICE_NAME" ./backend/deploy-gcp.shSet
PUBLIC_APP_ORIGINin.env.deploy.localto the frontend URL and re-run the backend infullmode — it rebuilds the complete env (so the locked CORS origin ships alongside every other live var, no merge gymnastics). If you only want to flip CORS without a rebuild, a one-key merge still works:gcloud run services update "$BACKEND_SERVICE_NAME" --region "$REGION" --update-env-vars="PUBLIC_APP_ORIGIN=${PUBLIC_APP_ORIGIN}"(merge preserves the other env, but does NOT touch the signer secret mount).
backend/deploy-gcp.sh is governed by DEPLOY_MODE (default full).
| Mode | When to use | What it does |
|---|---|---|
full (default) |
From-scratch deploys and any config-authoritative deploy where the env/secrets are the thing you're changing (new secret, rotated password, locked CORS, tuned EXEC_*). |
Builds the image and sets the complete env map and mounts the signer key (--set-env-vars + --set-secrets), atomically. Runs the pre-flight gate first. The resulting revision is the full, correct production config. |
image-only |
Routine code redeploys — you changed app code, the live service already holds the full env + signer mount, and you only want to ship the new build. | Builds + deploys the image only (no --set-env-vars, no --set-secrets), so Cloud Run preserves the existing env + secret mounts. Skips the secret-existence pre-flight. |
# From-scratch / config change (default):
SERVICE_NAME="$BACKEND_SERVICE_NAME" ./backend/deploy-gcp.sh
# Routine new-code redeploy that preserves the live env + signer mount:
DEPLOY_MODE=image-only SERVICE_NAME="$BACKEND_SERVICE_NAME" ./backend/deploy-gcp.shWhy this matters (the morning incident):
gcloud run deploy --set-env-varsREPLACES the whole env map and--set-secretsREPLACES all secret mounts. An older script that set only ~8 env vars and no secret mount silently dropped the Judge0 keys,INVIGILATOR_PASSWORD,RETENTION_SWEEP_API_KEY, and the signer key on every re-run. The mode split is the fix:fullreproduces everything,image-onlytouches nothing but the image. Do not hand-craft a partial--set-env-varsdeploy.
This is how every production backend (and ideally frontend) change should go
out. Build the image, deploy it as a no-traffic tagged revision, verify on the
tag URL, and only then cut traffic — keeping the previous revision live at 0% so a
rollback is instant. The current deploy-gcp.sh deploys with live traffic by
default; for a staged cut, build the image with the script's pre-flight + full env
intent, then drive the traffic split explicitly with the commands below.
Run the normal full deploy but with --no-traffic --tag. The tag must be ≥3 chars,
lowercase alphanumerics/dashes (e.g. a short date or change id):
set -a; source .env.deploy.local; set +a
TAG="rel0619" # ≥3 chars; pick a date/change id
IMAGE="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/api:latest"
gcloud builds submit backend --tag "$IMAGE" # see the build-gotcha note below
# Deploy as a tagged revision that takes NO traffic yet. In full mode also pass
# the complete env + signer mount so the staged revision is the real config:
gcloud run deploy "$BACKEND_SERVICE_NAME" \
--image "$IMAGE" --region "$REGION" \
--no-traffic --tag "$TAG" \
--allow-unauthenticated --port 8080 --memory 256Mi --cpu 1 \
--min-instances 0 --max-instances 20 --concurrency 100 --timeout 120s \
--set-secrets="/secrets/signer-key.json=proctor-signer-key:latest" \
--set-env-vars="<the full env map — easiest: run ./backend/deploy-gcp.sh once to a tagged rev, or reuse the env from .env.deploy.local>"Cloud Run gives the tagged revision its own URL: https://<TAG>---<service>-<hash>.a.run.app.
Capture it:
TAG_URL="$(gcloud run services describe "$BACKEND_SERVICE_NAME" --region "$REGION" \
--format="value(status.traffic[].url)" | grep -i "$TAG" || true)"
# (also visible in: gcloud run services describe ... --format='yaml(status.traffic)')Run the admin pre-flight health-check (the standard stack probe — see §Admin pre-flight health check) against the tag URL, plus a quick smoke:
# 1. Pre-flight health-check (light mode is safe; admin password required):
curl -s -X POST "$TAG_URL/api/admin/health-check" \
-H "x-admin-password: $ADMIN_PASSWORD" \
-H 'Content-Type: application/json' -d '{"mode":"light"}' | jq '.overall, .checks[].status'
# -> overall must be "green" (every non-skip check green)
# 2. Public exam-config responds 200 with JSON:
curl -s -o /dev/null -w '%{http_code}\n' "$TAG_URL/api/exam-config" # -> 200
# 3. Admin login works on the staged revision (an admin route returns 200 with the
# right password, 401 without):
curl -s -o /dev/null -w '%{http_code}\n' "$TAG_URL/api/admin/roster" \
-H "x-admin-password: $ADMIN_PASSWORD" # -> 200
curl -s -o /dev/null -w '%{http_code}\n' "$TAG_URL/api/admin/roster" # -> 401
# 4. (frontend staged rev) the served bundle carries the password-hash gate:
curl -s "$FRONTEND_TAG_URL/" | grep -o 'src="[^"]*\.js"' # then grep the JS for
# VITE_ADMIN_PASSWORD_HASH / VITE_INVIGILATOR_PASSWORD_HASH (the health-check's
# bundle_hashes probe does this automatically when PUBLIC_APP_ORIGIN is concrete)Only proceed if the health-check overall is green and the smoke passes.
Find the verified revision name, then send it 100% of traffic:
REV="$(gcloud run services describe "$BACKEND_SERVICE_NAME" --region "$REGION" \
--format='value(status.latestCreatedRevisionName)')" # or pick by the tag
gcloud run services update-traffic "$BACKEND_SERVICE_NAME" --region "$REGION" \
--to-revisions="${REV}=100"The previously-serving revision stays deployed at 0% — that is your instant rollback.
# List revisions + their current traffic split:
gcloud run revisions list --service "$BACKEND_SERVICE_NAME" --region "$REGION"
# Send 100% back to the previous (known-good) revision — instant, no rebuild:
gcloud run services update-traffic "$BACKEND_SERVICE_NAME" --region "$REGION" \
--to-revisions="<PREVIOUS_GOOD_REVISION>=100"Because the old revision was never deleted, rollback is a single traffic flip (seconds), not a rebuild.
gcloud builds submit can exit 1 while the build itself SUCCEEDED. The
common cause is a benign VPC-SC / log-streaming error — gcloud fails to tail
the build log (e.g. logs sink behind a service-perimeter) and returns nonzero even
though Cloud Build finished the image. Do not assume the build failed on a
nonzero exit. Confirm via gcloud builds describe and deploy the resolved
digest (sha256:…), never a moving :latest tag:
# 1. Find the build and confirm it actually succeeded:
BUILD_ID="$(gcloud builds list --limit=1 --format='value(id)')"
gcloud builds describe "$BUILD_ID" --format='value(status)' # -> SUCCESS
# 2. Resolve the immutable digest the build produced and deploy THAT (not :latest):
DIGEST="$(gcloud builds describe "$BUILD_ID" \
--format='value(results.images[0].digest)')" # sha256:...
IMAGE_BY_DIGEST="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/api@${DIGEST}"
gcloud run deploy "$BACKEND_SERVICE_NAME" --image "$IMAGE_BY_DIGEST" --region "$REGION" ...Deploying the digest guarantees you ship the exact image the build produced and
verified — immune to a :latest tag being moved by a concurrent build.
Frontend: never use a bare
gcloud builds submit frontend/ hand-deploy. It skips the password-hash bake +verify_dist_has_hashesgate and is what broke admin login before a ~700-student exam. Always deploy the frontend viafrontend/deploy-gcp.sh(§3); apply the staged-traffic flip (above) on the resultingproctor-webrevisions if you want a zero-downtime frontend cut.
POST /api/admin/health-check (admin-only, x-admin-password header) is the
standard pre-deploy / pre-exam stack verification — one button that proves every
load-bearing dependency works from this deployment's runtime. Verified against
backend/src/routes/healthCheck.mjs.
It stands up its own ephemeral, fully-namespaced canary contest + session, runs the probes against that canary, and always tears the canary down — it never touches real contest data.
| Mode | Cost | What it probes |
|---|---|---|
light (default) |
No Judge0 billing — safe mid-exam. | Firestore write/read/delete; GCS signed write/read (signer + bucket); served-bundle password-hash gate; admin auth + candidate session-start; exam-config for the canary; signed chunk-upload PUT; recordings list + signed read; telemetry .jsonl write; Judge0 reachability (/languages, no submission). |
full |
2 metered Judge0 submissions (one sum-two 2-case batch). |
Everything in light plus a real Judge0 execution of the seed problem. |
# Light pre-flight (safe to run any time, including during an exam):
curl -s -X POST "$API_URL/api/admin/health-check" \
-H "x-admin-password: $ADMIN_PASSWORD" \
-H 'Content-Type: application/json' -d '{"mode":"light"}' \
| jq '{overall, checks: [.checks[] | {id, status, detail}], cleanup}'
# overall == "green" means the whole stack (signing, upload, read, telemetry,
# bundle gate, Judge0, Firestore) is healthy from this deployment.Response: { overall, mode, ran_at, duration_ms, checks[], cleanup }. overall is
"red" if any non-skip check is red. The bundle_hashes probe is skip unless
PUBLIC_APP_ORIGIN is a concrete origin (not *). The admin console exposes the
same probe as a one-button "pre-flight" action. Run light pre-flight on the tag
URL before every traffic cut, and again before every exam.
A real-exam morning, recording-signing went down. Root cause: a backend deploy
re-ran an older deploy-gcp.sh that set only a subset of env vars and no signer
secret mount. Without SIGNER_KEY_FILE + the mounted proctor-signer-key, the
backend's signing client fell back to the main metadata-ADC client, which signs v4
URLs via the remote IAM signBlob token endpoint — a path that degrades/fails
under real-exam token-endpoint load. The same partial re-run also risked dropping
the Judge0 keys, INVIGILATOR_PASSWORD, and RETENTION_SWEEP_API_KEY.
What changed, and is now the standing practice:
- Signing is LOCAL. The signer key lives in Secret Manager
(
proctor-signer-key), is mounted at/secrets/signer-key.json, andSIGNER_KEY_FILEpoints the backend at it so v4 signing is a local crypto op (no per-request remotesignBlob). A deploy must keep it mounted —fullsets it;image-onlypreserves it. See §2c. - The deploy script reproduces the COMPLETE live config.
fullmode sets the entire env map + the signer mount atomically, with a pre-flight gate that aborts before build if any required secret is missing. No more partial--set-env-vars. - Staged deploy + instant rollback is the standard. Build → no-traffic tagged revision → verify on the tag URL → cut traffic → keep the prior revision at 0% for an instant traffic-flip rollback. See §Staged zero-downtime deploy.
- Pre-flight before every exam (and every cut). Run
POST /api/admin/health-check(light) — it exercises exactly the paths that broke (local signing, chunk upload, recordings read, telemetry, bundle hash-gate, Judge0 reachability) from the live runtime. Green before you cut traffic; green again before the exam opens.
Revision names and service URLs are point-in-time — they roll forward on every deploy. Verify the current revisions and traffic split with
gcloud run revisions list --service proctor-api --region "$REGION"(andgcloud run services listfor URLs) rather than hard-coding values.
| Project / region | your-gcp-project-id / asia-south1 (example region) |
| Backend service | proctor-api (revision names are point-in-time) |
| Frontend service | proctor-web (revision names are point-in-time) |
API root / |
Returns 404 by design — all routes are /api/*. |
| min-instances | 0 for testing; set 1 for a real exam (cold-start avoidance). |
First, run the admin pre-flight health-check — it is the canonical full-stack probe (see §Admin pre-flight health check). The curl smoke below is the lightweight no-auth complement. In the staged workflow, run both against the tag URL before cutting traffic.
Run after both services are up. All three checks are verified against
handler.mjs / auth.mjs.
WEB_URL="$(gcloud run services describe "$FRONTEND_SERVICE_NAME" --region "$REGION" --format='value(status.url)')"
API_URL="$(gcloud run services describe "$BACKEND_SERVICE_NAME" --region "$REGION" --format='value(status.url)')"
# 1. Frontend serves (expect 200):
curl -s -o /dev/null -w '%{http_code}\n' "$WEB_URL"
# 2. Public exam-config responds with JSON (no auth — student form renders pre-session):
curl -s "$API_URL/api/exam-config"
# -> JSON with roster_required, unique_id_label, rooms, enforcement, camera_recording
# 3. An admin route rejects with no/invalid password (expect 401 "Unauthorized"):
curl -s -o /dev/null -w '%{http_code}\n' "$API_URL/api/admin/roster"
# -> 401 (requireAdmin checks the x-admin-password header; missing => 401)
# (sanity) API root returns 404 by design:
curl -s -o /dev/null -w '%{http_code}\n' "$API_URL/"
# -> 404Expected: 200, a JSON exam-config body, 401, 404. The Wave-6/7 admin routes
/api/admin/{people,contest-results,contest-export,retention-sweep} should all
return 401 unauthenticated.
For a real exam, also drive the deployed stack in a browser as Admin / Candidate / Invigilator and confirm the happy path.