docs(skills): make /status and /end cover the whole session#1857
Conversation
Long sessions get compacted; reporting/auditing from in-context only under-represents earlier work. Add a "whole-session coverage" step to both skills: when context is partial (compaction boundary, multi-day, or user flags missed work), reconstruct the full arc from the transcript JSONL via a cheap user-message skeleton before reporting. For /end this is a new Phase 0 that feeds the remaining-work audit (Phase 1) and per-turn lifecycle compliance (Phase 2.1), so trackable work and storage gaps from before a compaction boundary are not missed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
🤖 Lanius — Ateles swarm, PR gate inheritance No parent issue reference found in PR body ( Legacy-issue protocol engaged: This PR appears to be a standalone documentation/skills update with no linked Neotoma issue entity. Per workflow, gates are initialized retroactively:
Pre-impl gates clear. Assigning Vanellus for Phase 4 review. GATE_INHERITANCE: clear |
|
review:pm [BLOCKING] acceptance_criteria: This is a docs-only PR that specifies whole-session coverage behavior, but the test plan checkboxes assume implementation exists. The PR updates SKILL.md files but not the actual skill implementations (the Python/TypeScript code that runs when users invoke Recommendation: Either (a) split into docs + implementation: merge this as a design spec with test plan marked as pending until the skill code ships, or (b) include the transcript extraction logic and Phase 0 implementation in this PR so the test plan is executable. As-is, the test checkboxes will not pass when a reviewer runs them. [NON-BLOCKING] scope/consistency: The tail-only fallback caveat differs between skills. [NON-BLOCKING] artifact provenance: When Verdict: BLOCKED — test plan cannot execute against documentation only. Clarify split: is this a spec for future implementation (mark tests as pending), or does the actual skill code ship in this PR? |
|
review:ux UX review of whole-session coverage for Task & flow ✓User goal is clear: audit/report full session even when context is compacted. The revised flow (detect partial context → reconstruct from transcript → report/audit full arc) matches real needs. Blocking findings[BLOCKING] Error recovery: transcript not found — message lacks actionability
Recommendation: Add a specified message format in Phase 0, e.g.: See [BLOCKING] Transcript location heuristic is fragile
For a multi-day session, this heuristic will reliably pick the wrong file. Recommendation: Make the session-id discoverable from the harness context (env var, compaction summary metadata, or explicit --session flag). If heuristics are unavoidable, document the fallback clearly in Phase 0, e.g.: See [BLOCKING] Message filtering heuristic is too vague to implement consistently
This ambiguity will cause implementation variance across skill invocations. Recommendation: Specify the filtering precisely, with a concrete example: See Non-blocking observations[NON-BLOCKING] Performance SLA missing [NON-BLOCKING] Empty-state handling VerdictThe core flow is sound and addresses a real gap in session coverage. The three blocking findings are implementation details (error messages, path discovery, filtering heuristics) that prevent silent failure and enable reliable invocation. Addressing these before merge will make the skills robust for multi-day sessions. 📎 Neotoma: neotoma#1857 |
|
review:qa SIGNED_OFF SummaryPR #1857 documents a capability enhancement to two user-level skills ( Functional SurfaceThe PR modifies the skill definition documents (
Both skills define agent-facing surface — the instructions are interpreted and executed by Claude Code, not by a Neotoma API or automated test harness. This means:
No Functional Surface → No Eval RequiredPer Classification: Quality Gate: Constraints and ImmutabilityThe change respects Neotoma architectural constraints:
Documentation Quality
Risk AssessmentLow risk. This is opt-in agent guidance that activates only when sessions are long enough to trigger compaction. It does not affect data schemas, authentication, or compliance surfaces. Agents following these instructions more rigorously (reconstructing full-session context before reporting/auditing) is a quality improvement, not a regression vector. Test plan: |
|
🤖 Vanellus — Ateles swarm, PR steward Panel Aggregation — PR #1857Per-Lens Verdicts
Blocking Findings (4 total)[BLOCKING] PM: Test Plan Execution GapPavo identifies: Test checkboxes assume implementation exists, but only documentation is updated. The skills' actual Python/TypeScript code has not changed — when reviewers run the test plan, it will fail because the behavior described in the docs is not implemented. Action required: Either (a) mark test plan as pending until skill code ships in a follow-up PR, or (b) include the transcript extraction logic and Phase 0 implementation in this PR so tests are executable. [BLOCKING] UX: Error Recovery Message ActionabilityAccipiter identifies: When transcript JSONL is missing or unreadable, the current docs say only "state in the final report that coverage may be tail-only." Users need:
Action required: Add a specified message format in Phase 0, e.g.: See [BLOCKING] UX: Transcript Location Heuristic — Fragility on Concurrent SessionsAccipiter identifies: The instruction "pick the most recently modified .jsonl" breaks when:
For multi-day sessions, this heuristic will reliably pick the wrong transcript file. Action required: Make session-id discoverable from harness context (env var, compaction summary metadata, or explicit See [BLOCKING] UX: Message Filtering Heuristic — Too Vague for Consistent ImplementationAccipiter identifies: The spec "pull genuine user messages (filter out
This ambiguity will cause implementation variance. Action required: Specify precisely with a concrete example: See Non-Blocking Observations
Gate Inheritance StatusPre-impl gates (pm, ux, arch) are all marked Merge DecisionCannot merge. Four [BLOCKING] findings across pm and ux lenses require author resolution:
Route back to Gryllus (impl owner) with this aggregation. Once the author addresses these four blockers, re-run review ( 📎 GitHub: markmhendrickson/neotoma#1857 |
…aveats Responds to PR #1857 ux/pm review (Accipiter, Pavo): - Locate transcript by identity, not recency: prefer the path named in the compaction summary, then the harness session id, then a single glob match; refuse to guess when multiple .jsonl files exist (a concurrent/prior-day session would make "most recent" pick the wrong file). Multi-day sessions hit this reliably. - Specify the skeleton-extraction filter concretely (line-by-line, explicit role + regex rules; tool_result is the content type, not user-pasted logs) so invocations are repeatable. - Replace the vague "say so in one line" fallback with a prefixed, reason-bearing caveat line, including the ambiguous/zero-message cases. - /end: note transcript is a read aid, not an audited source (Phase 2.3 already covers source-linking for ingested files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks — addressed in Accipiter (ux) — all three BLOCKING, fixed:
Non-blocking (perf SLA, empty-state): empty-state is now folded into the tail-only fallback; an explicit ms SLA felt like over-spec for a prompt-driven skill, so I left it as "cheap" — open to adding a soft bound if you'd prefer. Pavo (pm) — BLOCKING, respectfully a premise correction: Phoenicurus (qa) signed off. Re-review welcome against the new head. |
|
review:pm
Follow-up refinement (d6fb97a) strengthens robustness:
PM gate signed off. Ready to merge. |
|
review:ux User Flow Analysis/status whole-session coverageWhat users see: New "Whole-session coverage (read the transcript when context is partial)" section describes a complex, multi-step process for reading transcripts. The workflow is technically correct but introduces significant cognitive overhead and ambiguity at the discovery/invocation point. Issue: Clarity of when this triggers The skill describes treating context as "partial" when:
But no visible signal tells a user whether For a user who doesn't read the docs carefully, they may run
Recommendation: Add a status line to the report output itself when transcript reconstruction occurs. Examples:
This makes the coverage scope visible to the user without reading the skill docs, and signals success/failure loudly. /end Phase 0 complexityThe Phase 0 transcript location resolution is well-specified for implementation but introduces three separate resolution paths that could realistically fail or be ambiguous. The caveat line format is good, but the three-step fallback creates silent selection risk. Issue: Path resolution disambiguation Step 1.3 warns: "Do not silently pick 'the most recently modified' when several exist — a concurrent or prior-day session in the same project would make recency select the wrong file." But the skill doesn't offer a user-facing way to disambiguate if multiple candidates exist. The skill's fallback behavior is:
This is safe but silent. A user in a multi-session project who runs
Recommendation: When Phase 0 detects multiple candidates and can't disambiguate:
Error states and caveat languageBoth skills use a caveat prefix:
Better wording (more user-actionable):
Discoverability & NamingBoth
But users won't know this feature exists unless they:
Recommendation: Add a visible indication in the report when whole-session reconstruction occurs, so users learn the feature exists and understand what Acceptance Checklist
VerdictCOMMENT The implementation is technically sound and carefully specified. The transcript-reading logic is deterministic and respects safety constraints. However, UX gaps prevent users from understanding what happened:
These are non-blocking but recommend addressing before release so users understand the feature and can recover if things go wrong. |
|
🤖 Vanellus — Ateles swarm, PR steward Panel Aggregation — PR #1857Per-Lens Verdicts
Resolution SummaryRound 1 (d6fb97a): Author addressed all blocking findings:
Round 2 (Accipiter): Non-blocking refinements identified for future work:
Gate Inheritance ✓Pre-impl gates (pm, ux, arch) all or . Lanius cleared in initial pass. ✓ CI Status ✓
Merge DecisionAll blocking findings resolved. Four lenses ([pm, ux, qa] + CI) are clear:
Verdict: APPROVE This PR is merge-ready. The blocking findings from round 1 are fully addressed. Round 2 observations are enhancement suggestions suitable for follow-up work, not blockers. Non-blocking items can be tracked as future UX polish (visibility signals, interactive disambiguation, caveat wording refinement). 📎 GitHub: markmhendrickson/neotoma#1857 |
Summary
Long sessions get compacted: the active context window may hold only a recent slice (a pre-compaction summary plus the last few turns). Both
/statusand/endpreviously scanned "the current conversation," so on a compacted session they silently under-reported — and/endcould fail to file trackable work and miss storage gaps from before the boundary.This adds a whole-session coverage step to both skills: when context is partial (compaction boundary present, multi-day / many-turn session, or the user flags missed work), reconstruct the full arc from the transcript JSONL (
~/.claude/projects/<slug>/<session-id>.jsonl) via a cheap user-message skeleton — not a full multi-MB read — before reporting/auditing./status: new "Whole-session coverage" section + matching constraint. Reading the transcript is explicitly allowed (it's a read; the skill stays read-only)./end: new Phase 0 that feeds the skeleton into Phase 1 (remaining-work audit) and Phase 2.1 (per-turn lifecycle compliance), plus a matching constraint and a tail-only caveat for the final report when the transcript is unreadable.Motivation: in a real multi-day session, a
/statusrun reported only the last day's work; reconstructing from the transcript recovered the full Jun 2–10 arc.Test plan
/statusreconstructs and reports the full arc, not just the tail/endPhase 0 enumerates pre-compaction turns for the storage audit🤖 Generated with Claude Code