Skip to content

Latest commit

 

History

History
253 lines (172 loc) · 18 KB

File metadata and controls

253 lines (172 loc) · 18 KB

Verification by Explanation: Human Auditing Through Structured Output

Summary

A recurring pattern in these transcripts is the user asking Claude Code to produce structured explanations, tables, comparisons, and verification reports that make it easy for a human to spot-check correctness. Rather than blindly trusting agent output, the user forces Claude to re-present its work in formats optimized for human review: verification tables, pass/fail matrices, scorecard comparisons, claim-by-claim audits, and plain-English summaries. This pattern appears across three main categories:

  1. Claim verification tables -- The user pastes a description and asks Claude to verify each claim against the codebase, producing a table with verdicts.
  2. Structured comparison outputs -- The user asks for scorecards, side-by-side tables, and migration matrices that compress complex information into scannable formats.
  3. Agent-produced audit reports -- The user asks agents to produce structured findings (pass/fail, severity, numbered bugs) that serve as human-reviewable artifacts.

This is distinct from agent-based auditing (where one agent checks another). Here, the human is the auditor, and the structured output is the artifact that makes human auditing feasible.


Table of All Examples Found

# User Request (summarized) Output Format Caught Error / Confirmed Correctness? Location
1 "verify this is true" -- paste of system setup description Claim-by-verdict table (Correct / Wrong / Not confirmed) Yes -- caught 2 factual errors in the description cc-transcript.md ~line 1034
2 "give me a report in table format" -- deployment research Summary table (Approach 1 vs 2 vs 3) with 7 comparison dimensions Confirmed correctness of recommendation; enabled decision-making cc-transcript.md ~line 1081
3 TTS voice bake-off scorecard Side-by-side table (Ryan vs VoiceDesign) with audio length, gen time, RTF, size Enabled user to compare and choose; user said "wow both are really good" cc-transcript.md ~line 672
4 Agent script organization debate Agree/Disagree table + Final Migration Table Caught redundancy between run/up and scripts/start.sh; confirmed both agents agreed on core issues cc-transcript.md ~line 2359
5 VoCA requirements list 72 requirements (62 MUST, 10 SHOULD) in numbered sections Served as the gate for what must work before demo ships; referenced throughout QA cc-transcript.md ~line 1931
6 QA Worker bug reports Numbered bugs with severity (Critical/Major/Minor), steps, screenshots Caught Riley "Needs Review" instead of "Denied" (critical scoring bug) cc-transcript.md ~line 2148
7 QA Admin bug reports Per-test pass/fail table (3 pass, 9 fail) + numbered bugs Distinguished real bugs from Chrome session crash noise; enabled targeted fixes cc-transcript.md ~line 2098
8 Backend scoring analysis Root cause report with 4 numbered findings Identified 3 data contract mismatches causing wrong angel scores cc-transcript.md ~line 1961
9 Deployment gap comparison "Local run/up vs Remote deploy-2-concept" step-by-step comparison table Made the missing seed step visually obvious cc-transcript-filtered.md ~line 597
10 Researcher-auditor-summarizer pipeline for "what is a hard gate?" 1-page report -> audit (23/27 claims correct, 4 errors found) -> 2-paragraph summary Caught 3 off-by-one line numbers and 1 date discrepancy before answer was shared externally evidence/evidence_qa3.md, evidence/evidence_agent1.md
11 Slack research summary Chronological table with Date/Who/What columns Confirmed no one had discussed the missing seed data on Slack -- validating the deployment fix was needed cc-transcript.md ~line 1930
12 Service health & port mapping Table of Service/Port/Endpoint/Type Made health check gaps visible (Orchestrator: NO HEALTH ENDPOINT) cc-transcript-filtered.md ~line 650
13 "ok fine do your best and create a professional short bright speech that's short but let me audit the quality" Audio files for A/B comparison User audited TTS quality by listening; rejected Sesame CSM as "downsampled", chose Qwen3 cc-transcript.md ~line 651
14 Agent status dashboard during QA Status table (Agent / Role / Status: Done/Running) Let user see which agents were complete vs still working; enabled prioritization cc-transcript.md ~line 1499
15 Hard gate Q&A audit chain Auditor verified 27 claims: "23/27 fully accurate, 3 off-by-one line numbers, 1 minor date field discrepancy" Caught citation errors before answer was shared with external listener evidence/evidence_agent1.md ~line 11

Detailed Examples

Example 1: Claim Verification Table ("verify this is true")

What the user asked: The user pasted a multi-paragraph system setup description (ports, services, startup procedures, TTS details) and asked Claude to "verify this is true."

"verify this is true: --- System Setup for Angelic Intelligence v0.5. The platform runs 4 services: API (port 8000), Gateway (port 8001), Orchestrator (port 8002), and the UI (Vite dev server on port 3000)..."

What Claude produced: A claim-by-verdict table:

Claim Verdict
4 services: API (8000), Gateway (8001), Orchestrator (8002), UI (3000) Correct
UI proxies /api/* to port 8000 Correct
Docker Compose runs MinIO + PostgreSQL Correct (also Redis + LiteLLM)
Python services started individually via python src/*/main.py Wrong
"Fast" VPE mode skips debate transcripts Not confirmed

Did it catch an error? Yes -- two factual errors in the user's own description:

  1. Python services are NOT started individually on the host; they run via uvicorn inside Docker containers.
  2. The "fast VPE mode" claim could not be verified in the codebase.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 1034-1078


Example 2: Deployment Approach Comparison Table

What the user asked:

"startt 2 agents to look at this issue of how the system is setup and how it inittilzieds it'self and hwo to deploy it I want them indidependently resarch this and then audit each others work and then meet and debate waht the best 2 coures and action and give me a report in table format"

What Claude produced: Two independent agents each produced comprehensive reports. These were then synthesized into:

  • An Agreement/Disagreement Audit table (8 findings with Agree/N/A verdicts)
  • An Approach Comparison table (7 dimensions across 3 approaches)
  • A Summary Table with complexity, speed, freshness, repo size, debuggability, tested status, and risk ratings

Did it catch an error? The comparison table made a critical disagreement visible: Agent A assumed python was available on the remote host, while Agent B correctly identified that the remote host has no Python environment. The table format made this easy to spot: "Agent B is more careful here."

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 1081-1866


Example 3: TTS Voice Bake-Off Scorecard

What the user asked: The user wanted to compare two TTS voice approaches (CustomVoice with Ryan vs. VoiceDesign 1.7B) and asked Claude to play them back-to-back.

What Claude produced: A side-by-side scorecard:

Ryan (CustomVoice 0.6B) VoiceDesign (1.7B)
Audio 19.8s 18.2s
Gen time 19.5s 21.8s
RTF ~1x real-time ~0.83x
Size 930KB 851KB
Voice Preset "Ryan" + instruct Described voice

Did it help? Yes -- the user was able to make a quality decision: "wow both are really good." The structured comparison combined with audio playback enabled informed A/B testing. This ultimately led through several iterations (male vs female, professional vs sultry) to a final voice selection.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 672-682


Example 4: Agent Script Organization Debate with Migration Table

What the user asked:

"start 2 agents to look through the whole code base and examine the different folders that have deployment scripts i need them to debate with each other to come up with a simplified and clarified single directory"

What Claude produced:

  • Each agent produced a full inventory table (current files, purposes, dependencies)
  • A "Where They Disagree" comparison table
  • A Final Migration Table (13 rows: Current Location / Action / New Location)
  • A resulting directory structure diagram

Did it catch an error? The structured tables made two problems immediately visible that would have been hard to see in prose:

  1. run/up and scripts/start.sh doing the same thing
  2. BOOTSTRAP.md duplicating what the scripts already encode

The user then dug deeper based on the tables, asking "what's the difference between seed_demo_data.py and seed_employee_profiles.py?" -- which revealed duplicated employee data across two files.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 2359-2457


Example 5: VoCA Non-Negotiable Requirements List

What the user asked: The user's QA setup instructions specified: "VoCA must produce a non-negotiable requirements list before QA begins -- things that ARE the demo."

What Claude produced: 72 requirements across 6 sections (62 MUST, 10 SHOULD), covering:

  • End-to-end user flows (27 requirements)
  • Demo scenario correctness (18 requirements)
  • Backend pipeline correctness (13 requirements)
  • UI rendering correctness (12 requirements)
  • Demo infrastructure (7 requirements)
  • Demo narrative beats (12 requirements)

Did it help? This list became the reference standard for the entire QA process. Every bug was triaged against it (major = on the list, minor = park it). It made the scope of "what must work" explicit and scannable, preventing scope creep. The Wisdom agent used it to kill low-priority work.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 1931-1949


Example 6: QA Bug Reports with Structured Severity

What the user asked: QA agents were instructed to "submit bug reports with steps, screenshots, severity."

What Claude produced: Structured per-test pass/fail tables:

# Test Result Notes
01 Dashboard Initial Load FAIL 7 console 404 errors for avatar images
02 Run List Content PASS Badge variants correct
03 Dashboard Filters FAIL Search for "alex.santos" returns "Unavailable"
... ... ... ...

Plus numbered bugs with severity: BUG-001 (Major), BUG-002 (Major), etc.

Did it catch errors? Yes -- the structured format made it immediately clear that 7 of 9 admin test failures were from a single Chrome session crash (not real bugs), while only 2 were actual application bugs. This prevented wasting time investigating phantom failures. The Worker QA also caught Riley's "Needs Review" instead of "Denied" as a critical scoring bug.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 2098-2197


Example 7: Researcher-Auditor-Summarizer Pipeline for External Sharing

What the user asked:

"DON'T GUESS START AN AGENT TO LOOK THROUGH THE TEXT OF THE SCRIPT AND THEN DO A DEEP DIVE IN THE CODE TO PRODUCE A WRITTEN 1 PAGE REPORT... AFTER THAT HAVE A DIFFERENT AUDIT AGENT READ THE REPORT AND VERIFY THE FACTS... FINALLY HAVE A SUMMARIZER GIVE A 1-2 PARAGRAPH CLEAR ANSWER"

What Claude produced:

  1. Researcher: full-page report with code references
  2. Auditor: "23/27 claims fully accurate, 3 off-by-one line numbers, 1 minor date field discrepancy. All substantive facts verified."
  3. Summarizer: 2-paragraph plain-English answer suitable for sharing with an external listener

Did it catch errors? Yes -- the auditor caught 4 factual errors (15% of claims) that would have been presented as truth without the verification step. The errors were citation-level (wrong line numbers, date discrepancy) rather than substantive, but for an answer being shared externally, citation accuracy matters.

The user trusted this enough to immediately request the same workflow again: "do the same work flow as above researcher->auditor-> summarizer ALL are agents not yu."

Location: /Users/husainal-mohssen/src/effective_claude_code/evidence/evidence_qa3.md, /Users/husainal-mohssen/src/effective_claude_code/evidence/evidence_agent1.md


Example 8: Audio Quality Audit

What the user asked:

"ok fine do your best and create a professional short bright speech that's short but let me audit the quality"

What Claude produced: Multiple audio files played back-to-back with metadata (duration, generation time, voice parameters) so the user could listen and judge.

Did it catch errors? Yes -- the user's own ears caught quality problems that no automated test would find:

  • Sesame CSM: "I'M confused the audio is not one consistent voice wtf???" (caught inconsistent voice across chunks)
  • Sesame CSM v2: "this is stupid it's downsampled" (caught fundamental model quality limitation)
  • Pronunciation test: "i'm surprised how bad the audio in the fact that it pronounces things wrong" (caught pronunciation failures)

Each of these human audit rounds led to pivoting to better approaches. The structured A/B comparison format made it possible to iterate rapidly through options.

Location: /Users/husainal-mohssen/src/effective_claude_code/cc-transcript.md, lines 497-600, 651-800


Variations of the Pattern

The "verification by explanation" pattern appears in several distinct forms:

1. Claim-by-Claim Verification

The user pastes a description and asks Claude to check each statement against reality. Output is a table with verdicts (Correct / Wrong / Not confirmed). This is the most direct form of human auditing -- the user already has beliefs and wants them validated or corrected.

2. Comparison Matrices

The user asks for side-by-side comparisons of approaches, agents' recommendations, or before/after states. The tabular format compresses information that would be hard to absorb in prose. This pattern is especially effective when choosing between alternatives (deployment approaches, TTS models, directory structures).

3. Pass/Fail Test Reporting

QA agents produce structured test results with per-test status, numbered bugs, and severity levels. This format enables the human to quickly distinguish signal from noise (e.g., "7 of 9 failures are Chrome crashes, not real bugs"). The structure also enables delegation -- specific fixers can be assigned to specific numbered bugs.

4. Audit Pipeline with Quantified Accuracy

The researcher-auditor-summarizer pattern produces a report with an explicit accuracy metric (23/27 claims verified). This gives the human a confidence calibration -- not just "the answer is X" but "we checked 27 facts and 4 had errors, all minor."

5. Audio/Visual A/B Testing

For outputs that cannot be verified by reading (audio quality, UI appearance), the pattern involves generating multiple variants with metadata and presenting them for human sensory evaluation. The human's ears/eyes are the auditor.

6. Status Dashboards

During long-running multi-agent workflows, the orchestrator produces status tables showing which agents are done, which are running, and what they found. This makes the workflow's state legible to the human, preventing the "are you stuck?" problem.

7. Verification by Graphing Understanding

Ask the agent to draw a diagram of how it thinks something works -- a flowchart, architecture diagram, data flow, or dependency graph. This is not decorative. If the agent's diagram doesn't match your mental model, the agent has the wrong understanding and will produce wrong code. This is one of the fastest ways to catch a misunderstanding BEFORE it becomes a bug. Examples:

  • "Draw me the data flow from API request to database write"
  • "Show me a diagram of how these three services interact"
  • "Graph the dependency chain between these modules"

The diagram IS the verification. If it's wrong, the agent's model is wrong. Fix the understanding before you fix the code.

Conclusion

The "verification by explanation" pattern is one of the most consistently used human auditing techniques in these transcripts. It appears in at least 15 distinct instances across all sessions. The pattern works because it shifts the cognitive burden: instead of the human reading raw code or logs, Claude presents its work in a format optimized for human spot-checking -- tables with clear verdicts, numbered items with severity, side-by-side comparisons, and quantified accuracy metrics.

The pattern was most effective when it caught real errors: the wrong system setup description, the missing seed step in deployment, the Chrome crash masking real bugs, and the citation errors in the hard gate explanation. In each case, the structured format made the error visible in a way that prose would have obscured.

The pattern was least effective when it produced clean results -- several cross-audits returned "PASS WITH WARNINGS" with only minor issues, raising questions about whether auditors were truly critical. The evidence reports note this as a limitation: "the auditor found only minor issues in the canonical case."

The most sophisticated variant was the researcher-auditor-summarizer pipeline, where the user explicitly designed a multi-agent workflow whose sole purpose was to produce a human-verifiable answer. The user's emphatic "DON'T GUESS" instruction and insistence that "ALL are agents not yu" suggests accumulated frustration with unverified answers -- and the structured pipeline was the remedy.

Ultimately, this pattern represents a pragmatic recognition that AI agents will make errors, and the human's role is to efficiently verify output rather than produce it. Structured formats (tables, scorecards, pass/fail matrices, numbered bugs) are the technology that makes human verification scale to the volume of agent output.