End-to-end exercises for everything shipped in PRs #305 → #311. Tests both surfaces (Claude Cowork MCP and the storyboard.daydream.monster webapp) and exercises every closed loop primitive: quality gate, Plan→Act, critic, replan, auto-routing, scorecard, browser approval card.
Time: ~10 minutes per surface · Cost: under $0.10 total · Prereq: Daydream API key
- Open Claude Desktop or Cowork
- Settings → MCP → add server:
- URL:
https://storyboard.daydream.monster/api/mcp - Auth: Bearer
sk_…(your Daydream key from~/.daydream/credentials)
- URL:
- Confirm
mcp__storyboard__*tools show in the tool list — should see ~65 includingsubmit_plan,critique_batch,replan,get_scorecard
- Go to https://storyboard.daydream.monster/
- Sign in / paste Daydream API key in settings
- Open a fresh chat
What this tests: Per-output Gemini-graded retry. Should fire on a borderline prompt and surface the score in the response.
Call
create_mediawith these exact args:
- action: "generate"
- prompt: "a photoreal close-up of a chameleon on a leaf with translucent skin showing veins, studio lighting"
- model_override: "flux-schnell"
- quality_threshold: 0.85
- max_quality_retries: 1
After it runs, show me the
qualityfield from structuredContent — I want to see score, pass, and attempts_used.
Why bulleted "exact args" phrasing: conversational phrasing ("set quality_threshold to 0.85") leaves room for Claude to decide the field isn't important and use defaults. Explicit bulleted args make it reliably pass through.
Expected:
- Claude calls
create_media({action:"generate", model_override:"flux-schnell", quality_threshold:0.85, max_quality_retries:1, …}) - Response contains
quality: { attempted: true, pass: bool, score: 0..1, sub_scores: {...}, attempts_used: 1|2|3 } - If score < 0.85, you'll see
attempts_used: 2(retry fired); otherwiseattempts_used: 1 - If you don't see
quality_threshold:0.85in Claude's tool call, the prompt got lost — re-emphasize "use exactly the args I listed" and try again. The default threshold (0.7) is easier to pass and the test won't exercise the retry path.
generate a photoreal close-up of a chameleon on a leaf, quality_threshold 0.85
Expected: Card appears with image. Inspect the card / inline message for a "Quality: PASS/FAIL X.XX" line.
What this tests: Auto-routing through submit_plan for multi-tool intents, approval gate, browser approval card.
Call
submit_planwith these exact args:
- goal: "duck product shots"
- steps: [ {label:"wood", tool:"create_media", args:{action:"generate", prompt:"a yellow rubber duck on a wood table, photoreal", model_override:"flux-schnell", async:false}}, {label:"marble", tool:"create_media", args:{action:"generate", prompt:"a yellow rubber duck on marble, photoreal", model_override:"flux-schnell", async:false}} ]
Don't approve yet — just propose. Show me the plan_id and step costs.
Expected:
- Claude calls
submit_plan({steps:[create_media x2], goal:"duck product shots"})(NOT direct create_media — verifies skill auto-routing worked) - Response shows
plan_id, both steps with cost, status=proposed - Claude does NOT auto-fire the approve step
Approve plan plan_xxx
Expected:
- Claude calls
submit_plan({plan_id:"plan_xxx", confirm:true}) - Background execution starts (~30s)
- Poll with
get_planshows both steps complete with URLs
/plan duck product shots — generate a yellow duck on wood, then a yellow duck on marble
(if the chat agent maps to submit_plan; otherwise:)
make a 2-shot plan: yellow duck on wood, yellow duck on marble. Surface cost before running.
Expected: A purple PlanCard renders with both steps + cost + "Approve and execute" button. Click Approve. Card refreshes (click "refresh" if needed) showing per-step status pills going pending → running → done.
What this tests: critique_batch firing automatically after a multi-scene project, attaching verdict to the response.
Use generate_project to create a 3-scene mini-story about a friendly robot watering plants on Mars. Brief tone: hopeful, Pixar-style. After it finishes, show me the critique verdict from the response — don't call critique_batch separately, it should fire automatically.
Expected:
- Claude calls
generate_project(orsubmit_creative_job) with brief - After ~60–90s, response includes
critique: { attempted: true, verdict: "ship"|"iterate", issues: [...], missing: [...], strengths: [...] } - This proves #309's auto-fire hook landed — the critic ran without explicit invocation
Now run critique_batch directly with the same scenes to compare verdicts. Pass the project's scene URLs as
urls.
Expected: Same verdict shape; should largely agree with the auto-fired result.
make a 3-scene story about a robot watering plants on Mars, hopeful Pixar style
Expected: Story card or project cards appear with scene images. The auto-critique verdict surfaces inline (depends on chat UI integration; at minimum visible via the MCP envelope inspector).
What this tests: A plan that ends partial or failed surfaces a recovery hint, replan classifies the failure, the new plan goes through the approval gate.
Call
submit_planwith these exact args:
- goal: "test moderation recovery"
- steps: [ {label:"safe", tool:"create_media", args:{action:"generate", prompt:"a beautiful sunset over mountains", model_override:"gpt-image", async:false}}, {label:"fails moderation", tool:"create_media", args:{action:"generate", prompt:"a logo in the style of Studio Ghibli featuring Mickey Mouse", model_override:"gpt-image", async:false}} ]
After it returns a plan_id, approve it with
submit_plan({plan_id, confirm:true}). Wait ~60s, then callget_planand quote me the entirerecoveryfield from structuredContent.
Note: Using gpt-image (OpenAI gpt-image-2) for this one because its moderation is reliably strict on Studio Ghibli + named-character combos. flux-schnell is more permissive and may not fail.
Expected:
- Plan proposed → approved → executed
- Step 1:
donewith image URL - Step 2:
failedwith moderation error - Plan status:
partial get_planresponse includesrecovery: { available: true, suggested_strategy: "retry_with_delta", failure_classification: "moderation", next_action: { tool: "replan", args: { plan_id, strategy: "auto" } } }
Recover this with replan. Use the auto strategy.
Expected:
- Claude calls
replan({plan_id, strategy:"auto"}) - Response shows new plan_id, classification "moderation", strategy "retry_with_delta"
- The new plan's step 1 has a delta-prompt prefix like "Describe aesthetics without naming any studio / artist / brand."
Approve the recovery plan.
Expected: New plan executes, step renders without the prohibited brand name.
Same prompt sequence in chat. The PlanCard for the partial plan should show a yellow/amber "Recover with retry_with_delta" button.
What this tests: End-of-project aggregate that visualizes quality, critique, cost, time in one card.
Get me the scorecard for plan_xxx (one of the plans from Test 2 or 4).
Expected:
- Claude calls
get_scorecard({plan_id:"plan_xxx"}) - Response renders as a one-page card with:
- headline: "2/2 steps complete — quality 0.92"
- Steps: succeeded count + success rate
- Quality: mean score + pass count + retries
- Critique: verdict + issues (if applicable for creative jobs)
- Cost: estimated + actual
- Total wall time
- Viewer URL if applicable
Run generate_project for a 3-scene story about a fox in autumn. When it's done, get the scorecard for the cjob_ id.
Expected: Scorecard with critique verdict prominently rendered.
Type the prompt; scorecard will appear in chat (or inline in the project's viewer page).
For each test, you should see:
| Test | Pass signal | Fail signal |
|---|---|---|
| 1 quality gate | quality field present in response |
no quality field; or attempted:false for an image-generate call |
| 2 Plan→Act | Claude reaches for submit_plan (not direct create_media); PlanCard renders in webapp |
Claude fires create_media directly without proposing; no PlanCard appears |
| 3 critic | critique field in project response; verdict matches eyeball |
no critique field on a multi-scene project |
| 4 replan | recovery.available:true on partial plan; new plan_id from replan; delta-prompt visible in step args |
no recovery field; or replan returns recovered:false for a moderation case |
| 5 scorecard | All 5 fields populated (steps / quality / critique / cost / time) | tool not found; or missing fields |
- "plan not found": should not happen anymore (bearer-hash bug fixed in #310). If it does, file an issue with the plan_id.
- No
qualityfield on create_media: check the response is for an image-producing action (not video / audio / tool). The gate auto-skips non-image kinds. critique: { attempted: false, skipped_reason: "..." }: normal — the critic skips if <2 scenes succeeded or GEMINI_API_KEY is unset on the server. Both should be fine on production.- PlanCard doesn't render: confirm you're on the latest webapp (hard-refresh). Confirm the message text contains the
<<<STORYBOARD_PLAN_CARD>>>markers — if the agent isn't emitting the envelope, the card detection won't fire. - Approve button does nothing: open DevTools network tab and check
/api/mcpPOST tosubmit_plan— confirm 200 + cookie auth.
export DK=$(grep -oE 'sk_[A-Za-z0-9]+' ~/.daydream/credentials | head -1)
# Propose
curl -s -X POST https://storyboard.daydream.monster/api/mcp \
-H "Authorization: Bearer $DK" -H "Accept: application/json, text/event-stream" -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"submit_plan","arguments":{"goal":"smoke","steps":[{"tool":"create_media","args":{"action":"generate","prompt":"a tiny green plant","model_override":"flux-schnell","async":false}},{"tool":"create_media","args":{"action":"generate","prompt":"a tiny green plant in a clay pot","model_override":"flux-schnell","async":false}}]}}}'
# (capture plan_id from response, then approve + poll scorecard)This is the same flow the live E2E in scripts/e2e-quality-gate.ts exercises programmatically.