Skip to content

Feat/v0.6 self verifying#2

Open
hwfengcs wants to merge 39 commits into
mainfrom
feat/v0.6-self-verifying
Open

Feat/v0.6 self verifying#2
hwfengcs wants to merge 39 commits into
mainfrom
feat/v0.6-self-verifying

Conversation

@hwfengcs

@hwfengcs hwfengcs commented May 3, 2026

Copy link
Copy Markdown
Owner

Summary

Testing

  • pytest
  • ruff check SDYJ_Agents tests

Notes

  • No real API keys or generated outputs/ files are committed.

fhw and others added 30 commits May 2, 2026 18:14
Wire token usage and USD cost estimation end-to-end so every LLM call
is attributable. PRICING_TABLE in SDYJ_Agents/utils/cost.py is a
hardcoded {(provider, model): (input_per_million, output_per_million)}
dict — explicit, auditable, no third-party services. Unknown
(provider, model) pairs return None rather than 0.00, so the CLI can
render an honest "—" instead of a misleading zero.

Provider wrappers (OpenAI, Claude, DeepSeek, Gemini) now expose
last_usage with the raw provider dict; normalize_usage reconciles the
three different naming conventions. InstrumentedLLM writes
prompt_tokens_actual, completion_tokens_actual, and cost_usd into each
llm_calls entry. finalize_trace rolls them up into trace.metrics, and
summarize_trace surfaces them so diff-runs reports cost deltas.

The CLI inspect-run gains a per-call LLM Calls table with a tip that
links unpriced cells back to PRICING_TABLE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a two-track publish workflow that uses OIDC-based Trusted
Publishers instead of long-lived API tokens:
- workflow_dispatch with target=testpypi for alpha/beta cuts.
- GitHub Release event for stable PyPI releases.

Both jobs gate on a build + twine check step. Bump pyproject.toml to
0.6.0a1 and split optional dependencies into [web], [mcp],
[benchmarks], and [all] so the core install stays slim while opt-in
features pull their deps on demand. Add docs/release-process.md so
the one-time PyPI Trusted Publisher setup is documented for future
maintainers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Position SDYJ as a self-verifying, replayable, benchmarkable
multi-agent research framework. Both README files now lead with the
operations-not-prompting thesis, expose new badges (PyPI, Stars), and
include a comparison table against GPT Researcher, AutoGen, and the
LangGraph examples that highlights trace, replay, evidence IDs, cost,
and benchmark gates. Each "🚧 v0.6" item is honestly marked in
progress so visitors are not misled.

Add docs/release-notes/v0.6.md as the running ledger of what has
shipped (cost tracking, provider usage capture, packaging extras,
PyPI pipeline, README) and what is still in flight (verifier loop,
reflexive researcher, parallel tool calls, public benchmarks, real
MCP, web UI). Update ROADMAP.md to match and reserve a v0.7 slot for
extensibility work that previously lived under v0.6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ship a minimum-viable Web UI that reuses the existing Coordinator,
Planner, Researcher, Rapporteur, ResearchWorkflow, and InstrumentedLLM
pipeline — the UI is intentionally thin so anything we add benefits
the CLI path for free. Five tabs surface the run output: Report,
Plan, Evidence cards, Trace timeline (with tool calls), and an LLM
cost table fed by trace.metrics.

Components in SDYJ_Agents/web/components.py take plain dicts so they
are easy to test or swap. The MVP runs with auto_approve=True; a
proper human-in-the-loop approval gate is a v0.6 follow-up because it
needs LangGraph state to survive a Streamlit rerender.

streamlit_app.py at the repo root is the entry point Hugging Face
Spaces expects, and docs/huggingface-spaces-deploy.md walks through
the one-time Space setup including the YAML frontmatter, secrets,
and a local dry-run check. tests/test_web_smoke.py uses Streamlit's
built-in AppTest to catch import-time crashes and verify the example
chip wires through to session_state.

Add streamlit to the [web] extras so the core install stays free of
UI dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Publish the launch blog post for the v0.6 alpha in both English and
Chinese. The argument is that agent operations are an evaluation
problem rather than a prompting problem, organized around four
pillars — trace everything, deterministic replay, per-call cost in
the trace, and a verifier loop. The post grounds the claims in the
shipped 0.6.0a1 work (cost tracking, provider usage capture, PyPI
pipeline, Web UI) and is honest about what is still in flight.

Both files target SEO around self-verifying agents, agent
observability, and replayable agents, and they link readers back to
the repo, the v0.6 release notes, and ROADMAP.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the Verifier — the v0.6 differentiator. It re-reads the generated
report against the collected E1/E2/... evidence and grades it on four
dimensions (claim_evidence_alignment, citation_completeness,
factual_consistency, coverage). When the overall quality dips below
0.75 or alignment below 0.65, the workflow loops back to the
rapporteur with concrete revision hints, capped by max_revisions
(default 2).

Design choices worth flagging:

- LangGraph persists state mutations made inside node functions but
  *discards* mutations made inside conditional edge functions. To stop
  the critique-revise loop from running away, revision_count is bumped
  inside verifier_node and the conditional edge only reads. The edge
  also reads a transient _pending_route_to_revise flag set by the node,
  so the cap is enforced exactly once per pass.
- Verifier.verify always returns a well-shaped dict — even when the
  LLM call fails, the response is malformed, or the model is too
  lenient. Defaulting to should_revise=True on failure means a broken
  verifier never silently rubber-stamps a bad report.
- The Rapporteur's revise mode reuses the previous report and the
  hints; it does not re-run summarize / organize / synthesize. That
  keeps revisions cheap, predictable, and trace-diffable.
- Verifier metrics (overall_quality, should_revise, weakest_dimension,
  revision_count) are mirrored into trace.metrics and lifted into
  evaluation summaries, so diff-runs and CI gates can reason about
  them without reaching into raw events.

Wire-up:

- ResearchWorkflow.run / .stream / .stream_interactive accept
  skip_verification and max_revisions; CLI exposes --no-verify and
  --max-revisions for `research`, plus --enable-verify and
  --max-revisions for `eval`/`benchmark`.
- The Web UI gains a "Self-verifying loop" toggle and a max-revisions
  slider in the sidebar.
- run_evaluation defaults to enable_verification=False so existing
  v0.5 benchmark gates keep working unchanged; flip it on for the
  v0.5-vs-v0.6 ablation in P2.6.
- Replay reads skip_verification from the source trace.config so a
  recorded run with verifier calls still replays in call order.

13 new tests in tests/test_verifier.py cover score coercion,
JSON parsing fallbacks, alignment-floor override, max-revisions cap,
history helpers, and an end-to-end revise loop. Existing 33 tests
still pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a one-shot reflection step inside Researcher.execute_task. After a
task's queries run, the result quality is scored — empty batches, an
average relevance below 0.5, or an error rate at/above 0.5 trigger a
single LLM call that diagnoses the failure and proposes 1–2
materially-different replacement queries. The replacements are then
executed against the same sources, results are merged into state, and
the task is marked ``_reflected`` so a second adversarial failure
cannot induce another reflection on the same task.

Reflection is opt-in at the agent level via ``enable_reflection`` (on
by default in v0.6). Disable with the new --no-reflect flag on the
research command, or with --enable-reflect on the eval command (the
eval default stays off so v0.5 benchmark gates keep working unchanged).

Trace integration:
- Each reflection emits an ``event_type="reflection"`` event in
  trace.events with the original queries, failure stats, and the
  rewritten queries; failures are tagged ``status="error"``.
- ``trace.metrics.reflection_count`` increments only when a reflection
  produced usable rewrites, so the v0.5-vs-v0.6 ablation can show how
  often the feature actually fired.

Replay integration:
- ResearchWorkflow.run / .stream / .stream_interactive callers now
  persist ``enable_reflection`` into trace.config alongside the verifier
  toggles. replay.py reads it and instantiates the Researcher with the
  same setting, so recorded LLM call order lines up with the workflow
  path during a deterministic replay.

The Streamlit Web UI gains a "Reflexive Researcher (v0.6)" sidebar
toggle next to the verifier toggle.

15 new tests in tests/test_researcher_reflection.py cover the trigger
conditions, mixed-score relevance averaging, fenced-JSON parsing,
case-insensitive dedup of rewrites, the single-shot _reflected guard,
LLM-failure resilience, and the disable flag. 61 tests pass overall;
ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant