A minimal todo-list PWA designed as an agent evaluation environment (RL gym). Agents interact with the UI to complete tasks and are scored by a deterministic eval harness.
Use this as a lightweight reference for:
- How to build a gym-compatible PWA installable on Android as a TWA
- How to send state in/out via the gym API for automated evaluation
- How to seed per-task state deterministically
pnpm install
pnpm dev # http://localhost:3000Or with Docker:
docker build -t todo-gym .
docker run -p 3000:3000 todo-gymThe app ships with a /.well-known/assetlinks.json route and is ready for Bubblewrap or sb.pwa_install:
npm i -g @bubblewrap/cli
bubblewrap init --manifest https://<your-host>/manifest.json
bubblewrap build
adb install app-release-signed.apkSet these env vars before deploying so the assetlinks fingerprint matches your keystore:
TWA_SHA256_FINGERPRINT=AA:BB:CC:... # from `keytool -list -v -keystore android.keystore`
TWA_PACKAGE_NAME=com.cuaai.gymtodoThe Cua Android image has built-in PWA installation support — it handles keystore generation, Bubblewrap build, assetlinks trust, and APK install in one call:
async with Sandbox.ephemeral(Image.android().pwa_install(
url="http://10.0.2.2:3000",
package_name="com.cuaai.gymtodo",
)) as sb:
...Same shape as slack-env — works with any cua-bench eval harness:
| Method | Endpoint | Description |
|---|---|---|
GET |
/gym/tasks |
List all tasks |
POST |
/gym/start/:taskId |
Seed DB and start task → { success, prompt, task } |
GET |
/gym/evaluate |
Evaluate current state → { success, reward, message, subResults } |
POST |
/gym/evaluate |
Retrieval eval — body: { agentAnswer } |
POST |
/gym/reset |
Reset DB to shared seed |
POST |
/gym/session |
Create isolated session → { sessionId } |
DELETE |
/gym/session |
Destroy session |
Pass X-Session-Id: <id> (or cookie gym_session) for isolated parallel sessions.
docker run -p 3000:3000 -e EVAL_SERVER=http://your-server:8080 todo-gymPosts start, evaluate, and reset events to your server — same payload schema as slack-env.
tasks/
primitives/ # atomic single-action or retrieval tasks
add_item/
complete_item/
delete_item/
edit_item/
clear_completed/
set_filter_active/
set_filter_completed/
add_three_items/
count_items/ # retrieval — answer: number of items
advanced/ # multi-step composed tasks
add_and_complete/
full_workflow/
_shared/
seed.sql # shared baseline (empty list, filter=all)
Each task.json:
{
"id": "add_item",
"description": "Add a new todo item with the text: 'Buy groceries'.",
"evalFunc": "check_item_exists",
"weight": 1,
"defaultParams": { "keywords": ["buy groceries"] }
}Retrieval tasks also carry json_schema + expected_value.
Advanced tasks reference primitives by ID:
{
"id": "add_and_complete",
"description": "Add 'Schedule dentist appointment', then mark it completed.",
"steps": [
["add_item", { "keywords": ["schedule dentist appointment"] }],
["complete_item", { "keywords": ["schedule dentist appointment"] }]
]
}- Desktop: press
↑to open the floating gym panel - Mobile/Android: tap the 🏋️ FAB (bottom-left) to open the panel
The panel lets you select a task, start it (seeds the DB), optionally type an agent answer for retrieval tasks, evaluate, and reset.
- Create
tasks/primitives/<id>/task.jsonwithid,description,evalFunc,defaultParams - Create
tasks/primitives/<id>/seed.sqlwith the starting DB state - If a new eval function is needed, add it to
eval.mjs - For advanced tasks, create
tasks/advanced/<id>/task.jsonwith astepsarray
| Function | Checks |
|---|---|
check_item_exists |
Item with matching text exists |
check_item_done |
Item with matching text is completed |
check_item_deleted |
Item with matching text no longer exists |
check_no_completed |
No completed items remain |
check_filter |
Active filter matches expected value |
check_item_count_gte |
At least N items in the list |
check_retrieval |
Agent's answer matches expected_value |
compose_steps |
Orchestrates multi-step advanced tasks |