Skip to content

Latest commit

 

History

History
235 lines (156 loc) · 11 KB

File metadata and controls

235 lines (156 loc) · 11 KB

Demo Script — Closed Agent Loop

End-to-end exercises for everything shipped in PRs #305 → #311. Tests both surfaces (Claude Cowork MCP and the storyboard.daydream.monster webapp) and exercises every closed loop primitive: quality gate, Plan→Act, critic, replan, auto-routing, scorecard, browser approval card.

Time: ~10 minutes per surface · Cost: under $0.10 total · Prereq: Daydream API key


Setup (once)

Claude Cowork (MCP)

  1. Open Claude Desktop or Cowork
  2. Settings → MCP → add server:
    • URL: https://storyboard.daydream.monster/api/mcp
    • Auth: Bearer sk_… (your Daydream key from ~/.daydream/credentials)
  3. Confirm mcp__storyboard__* tools show in the tool list — should see ~65 including submit_plan, critique_batch, replan, get_scorecard

Webapp

  1. Go to https://storyboard.daydream.monster/
  2. Sign in / paste Daydream API key in settings
  3. Open a fresh chat

Test 1 — Single-output quality gate (S1)

What this tests: Per-output Gemini-graded retry. Should fire on a borderline prompt and surface the score in the response.

MCP — paste this prompt to Claude

Call create_media with these exact args:

  • action: "generate"
  • prompt: "a photoreal close-up of a chameleon on a leaf with translucent skin showing veins, studio lighting"
  • model_override: "flux-schnell"
  • quality_threshold: 0.85
  • max_quality_retries: 1

After it runs, show me the quality field from structuredContent — I want to see score, pass, and attempts_used.

Why bulleted "exact args" phrasing: conversational phrasing ("set quality_threshold to 0.85") leaves room for Claude to decide the field isn't important and use defaults. Explicit bulleted args make it reliably pass through.

Expected:

  • Claude calls create_media({action:"generate", model_override:"flux-schnell", quality_threshold:0.85, max_quality_retries:1, …})
  • Response contains quality: { attempted: true, pass: bool, score: 0..1, sub_scores: {...}, attempts_used: 1|2|3 }
  • If score < 0.85, you'll see attempts_used: 2 (retry fired); otherwise attempts_used: 1
  • If you don't see quality_threshold:0.85 in Claude's tool call, the prompt got lost — re-emphasize "use exactly the args I listed" and try again. The default threshold (0.7) is easier to pass and the test won't exercise the retry path.

Webapp — type in chat

generate a photoreal close-up of a chameleon on a leaf, quality_threshold 0.85

Expected: Card appears with image. Inspect the card / inline message for a "Quality: PASS/FAIL X.XX" line.


Test 2 — Multi-tool Plan→Act (S3 + #1 auto-routing + #5 PlanCard)

What this tests: Auto-routing through submit_plan for multi-tool intents, approval gate, browser approval card.

MCP — paste

Call submit_plan with these exact args:

  • goal: "duck product shots"
  • steps: [ {label:"wood", tool:"create_media", args:{action:"generate", prompt:"a yellow rubber duck on a wood table, photoreal", model_override:"flux-schnell", async:false}}, {label:"marble", tool:"create_media", args:{action:"generate", prompt:"a yellow rubber duck on marble, photoreal", model_override:"flux-schnell", async:false}} ]

Don't approve yet — just propose. Show me the plan_id and step costs.

Expected:

  • Claude calls submit_plan({steps:[create_media x2], goal:"duck product shots"}) (NOT direct create_media — verifies skill auto-routing worked)
  • Response shows plan_id, both steps with cost, status=proposed
  • Claude does NOT auto-fire the approve step

Then paste

Approve plan plan_xxx

Expected:

  • Claude calls submit_plan({plan_id:"plan_xxx", confirm:true})
  • Background execution starts (~30s)
  • Poll with get_plan shows both steps complete with URLs

Webapp — type in chat

/plan duck product shots — generate a yellow duck on wood, then a yellow duck on marble

(if the chat agent maps to submit_plan; otherwise:)

make a 2-shot plan: yellow duck on wood, yellow duck on marble. Surface cost before running.

Expected: A purple PlanCard renders with both steps + cost + "Approve and execute" button. Click Approve. Card refreshes (click "refresh" if needed) showing per-step status pills going pending → running → done.


Test 3 — Cross-asset critic (S2 + #2 auto-fire)

What this tests: critique_batch firing automatically after a multi-scene project, attaching verdict to the response.

MCP — paste

Use generate_project to create a 3-scene mini-story about a friendly robot watering plants on Mars. Brief tone: hopeful, Pixar-style. After it finishes, show me the critique verdict from the response — don't call critique_batch separately, it should fire automatically.

Expected:

  • Claude calls generate_project (or submit_creative_job) with brief
  • After ~60–90s, response includes critique: { attempted: true, verdict: "ship"|"iterate", issues: [...], missing: [...], strengths: [...] }
  • This proves #309's auto-fire hook landed — the critic ran without explicit invocation

Then paste

Now run critique_batch directly with the same scenes to compare verdicts. Pass the project's scene URLs as urls.

Expected: Same verdict shape; should largely agree with the auto-fired result.

Webapp

make a 3-scene story about a robot watering plants on Mars, hopeful Pixar style

Expected: Story card or project cards appear with scene images. The auto-critique verdict surfaces inline (depends on chat UI integration; at minimum visible via the MCP envelope inspector).


Test 4 — Replan recovery (L1 + #3 recovery hint)

What this tests: A plan that ends partial or failed surfaces a recovery hint, replan classifies the failure, the new plan goes through the approval gate.

MCP — paste

Call submit_plan with these exact args:

  • goal: "test moderation recovery"
  • steps: [ {label:"safe", tool:"create_media", args:{action:"generate", prompt:"a beautiful sunset over mountains", model_override:"gpt-image", async:false}}, {label:"fails moderation", tool:"create_media", args:{action:"generate", prompt:"a logo in the style of Studio Ghibli featuring Mickey Mouse", model_override:"gpt-image", async:false}} ]

After it returns a plan_id, approve it with submit_plan({plan_id, confirm:true}). Wait ~60s, then call get_plan and quote me the entire recovery field from structuredContent.

Note: Using gpt-image (OpenAI gpt-image-2) for this one because its moderation is reliably strict on Studio Ghibli + named-character combos. flux-schnell is more permissive and may not fail.

Expected:

  • Plan proposed → approved → executed
  • Step 1: done with image URL
  • Step 2: failed with moderation error
  • Plan status: partial
  • get_plan response includes recovery: { available: true, suggested_strategy: "retry_with_delta", failure_classification: "moderation", next_action: { tool: "replan", args: { plan_id, strategy: "auto" } } }

Then paste

Recover this with replan. Use the auto strategy.

Expected:

  • Claude calls replan({plan_id, strategy:"auto"})
  • Response shows new plan_id, classification "moderation", strategy "retry_with_delta"
  • The new plan's step 1 has a delta-prompt prefix like "Describe aesthetics without naming any studio / artist / brand."

Then paste

Approve the recovery plan.

Expected: New plan executes, step renders without the prohibited brand name.

Webapp

Same prompt sequence in chat. The PlanCard for the partial plan should show a yellow/amber "Recover with retry_with_delta" button.


Test 5 — Scorecard (#4)

What this tests: End-of-project aggregate that visualizes quality, critique, cost, time in one card.

MCP — paste

Get me the scorecard for plan_xxx (one of the plans from Test 2 or 4).

Expected:

  • Claude calls get_scorecard({plan_id:"plan_xxx"})
  • Response renders as a one-page card with:
    • headline: "2/2 steps complete — quality 0.92"
    • Steps: succeeded count + success rate
    • Quality: mean score + pass count + retries
    • Critique: verdict + issues (if applicable for creative jobs)
    • Cost: estimated + actual
    • Total wall time
    • Viewer URL if applicable

Or for a creative job

Run generate_project for a 3-scene story about a fox in autumn. When it's done, get the scorecard for the cjob_ id.

Expected: Scorecard with critique verdict prominently rendered.

Webapp

Type the prompt; scorecard will appear in chat (or inline in the project's viewer page).


Acceptance criteria — green = ship

For each test, you should see:

Test Pass signal Fail signal
1 quality gate quality field present in response no quality field; or attempted:false for an image-generate call
2 Plan→Act Claude reaches for submit_plan (not direct create_media); PlanCard renders in webapp Claude fires create_media directly without proposing; no PlanCard appears
3 critic critique field in project response; verdict matches eyeball no critique field on a multi-scene project
4 replan recovery.available:true on partial plan; new plan_id from replan; delta-prompt visible in step args no recovery field; or replan returns recovered:false for a moderation case
5 scorecard All 5 fields populated (steps / quality / critique / cost / time) tool not found; or missing fields

Troubleshooting

  • "plan not found": should not happen anymore (bearer-hash bug fixed in #310). If it does, file an issue with the plan_id.
  • No quality field on create_media: check the response is for an image-producing action (not video / audio / tool). The gate auto-skips non-image kinds.
  • critique: { attempted: false, skipped_reason: "..." }: normal — the critic skips if <2 scenes succeeded or GEMINI_API_KEY is unset on the server. Both should be fine on production.
  • PlanCard doesn't render: confirm you're on the latest webapp (hard-refresh). Confirm the message text contains the <<<STORYBOARD_PLAN_CARD>>> markers — if the agent isn't emitting the envelope, the card detection won't fire.
  • Approve button does nothing: open DevTools network tab and check /api/mcp POST to submit_plan — confirm 200 + cookie auth.

One-liner curl smoke (works anywhere with bash + python)

export DK=$(grep -oE 'sk_[A-Za-z0-9]+' ~/.daydream/credentials | head -1)

# Propose
curl -s -X POST https://storyboard.daydream.monster/api/mcp \
  -H "Authorization: Bearer $DK" -H "Accept: application/json, text/event-stream" -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"submit_plan","arguments":{"goal":"smoke","steps":[{"tool":"create_media","args":{"action":"generate","prompt":"a tiny green plant","model_override":"flux-schnell","async":false}},{"tool":"create_media","args":{"action":"generate","prompt":"a tiny green plant in a clay pot","model_override":"flux-schnell","async":false}}]}}}'

# (capture plan_id from response, then approve + poll scorecard)

This is the same flow the live E2E in scripts/e2e-quality-gate.ts exercises programmatically.