Feat/v0.6 self verifying by hwfengcs · Pull Request #2 · hwfengcs/SDYJ_Multi_Agents

hwfengcs · 2026-05-03T03:24:33Z

Summary

Testing

pytest
ruff check SDYJ_Agents tests

Notes

No real API keys or generated outputs/ files are committed.

Wire token usage and USD cost estimation end-to-end so every LLM call is attributable. PRICING_TABLE in SDYJ_Agents/utils/cost.py is a hardcoded {(provider, model): (input_per_million, output_per_million)} dict — explicit, auditable, no third-party services. Unknown (provider, model) pairs return None rather than 0.00, so the CLI can render an honest "—" instead of a misleading zero. Provider wrappers (OpenAI, Claude, DeepSeek, Gemini) now expose last_usage with the raw provider dict; normalize_usage reconciles the three different naming conventions. InstrumentedLLM writes prompt_tokens_actual, completion_tokens_actual, and cost_usd into each llm_calls entry. finalize_trace rolls them up into trace.metrics, and summarize_trace surfaces them so diff-runs reports cost deltas. The CLI inspect-run gains a per-call LLM Calls table with a tip that links unpriced cells back to PRICING_TABLE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a two-track publish workflow that uses OIDC-based Trusted Publishers instead of long-lived API tokens: - workflow_dispatch with target=testpypi for alpha/beta cuts. - GitHub Release event for stable PyPI releases. Both jobs gate on a build + twine check step. Bump pyproject.toml to 0.6.0a1 and split optional dependencies into [web], [mcp], [benchmarks], and [all] so the core install stays slim while opt-in features pull their deps on demand. Add docs/release-process.md so the one-time PyPI Trusted Publisher setup is documented for future maintainers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Position SDYJ as a self-verifying, replayable, benchmarkable multi-agent research framework. Both README files now lead with the operations-not-prompting thesis, expose new badges (PyPI, Stars), and include a comparison table against GPT Researcher, AutoGen, and the LangGraph examples that highlights trace, replay, evidence IDs, cost, and benchmark gates. Each "🚧 v0.6" item is honestly marked in progress so visitors are not misled. Add docs/release-notes/v0.6.md as the running ledger of what has shipped (cost tracking, provider usage capture, packaging extras, PyPI pipeline, README) and what is still in flight (verifier loop, reflexive researcher, parallel tool calls, public benchmarks, real MCP, web UI). Update ROADMAP.md to match and reserve a v0.7 slot for extensibility work that previously lived under v0.6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ship a minimum-viable Web UI that reuses the existing Coordinator, Planner, Researcher, Rapporteur, ResearchWorkflow, and InstrumentedLLM pipeline — the UI is intentionally thin so anything we add benefits the CLI path for free. Five tabs surface the run output: Report, Plan, Evidence cards, Trace timeline (with tool calls), and an LLM cost table fed by trace.metrics. Components in SDYJ_Agents/web/components.py take plain dicts so they are easy to test or swap. The MVP runs with auto_approve=True; a proper human-in-the-loop approval gate is a v0.6 follow-up because it needs LangGraph state to survive a Streamlit rerender. streamlit_app.py at the repo root is the entry point Hugging Face Spaces expects, and docs/huggingface-spaces-deploy.md walks through the one-time Space setup including the YAML frontmatter, secrets, and a local dry-run check. tests/test_web_smoke.py uses Streamlit's built-in AppTest to catch import-time crashes and verify the example chip wires through to session_state. Add streamlit to the [web] extras so the core install stays free of UI dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Publish the launch blog post for the v0.6 alpha in both English and Chinese. The argument is that agent operations are an evaluation problem rather than a prompting problem, organized around four pillars — trace everything, deterministic replay, per-call cost in the trace, and a verifier loop. The post grounds the claims in the shipped 0.6.0a1 work (cost tracking, provider usage capture, PyPI pipeline, Web UI) and is honest about what is still in flight. Both files target SEO around self-verifying agents, agent observability, and replayable agents, and they link readers back to the repo, the v0.6 release notes, and ROADMAP.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add the Verifier — the v0.6 differentiator. It re-reads the generated report against the collected E1/E2/... evidence and grades it on four dimensions (claim_evidence_alignment, citation_completeness, factual_consistency, coverage). When the overall quality dips below 0.75 or alignment below 0.65, the workflow loops back to the rapporteur with concrete revision hints, capped by max_revisions (default 2). Design choices worth flagging: - LangGraph persists state mutations made inside node functions but *discards* mutations made inside conditional edge functions. To stop the critique-revise loop from running away, revision_count is bumped inside verifier_node and the conditional edge only reads. The edge also reads a transient _pending_route_to_revise flag set by the node, so the cap is enforced exactly once per pass. - Verifier.verify always returns a well-shaped dict — even when the LLM call fails, the response is malformed, or the model is too lenient. Defaulting to should_revise=True on failure means a broken verifier never silently rubber-stamps a bad report. - The Rapporteur's revise mode reuses the previous report and the hints; it does not re-run summarize / organize / synthesize. That keeps revisions cheap, predictable, and trace-diffable. - Verifier metrics (overall_quality, should_revise, weakest_dimension, revision_count) are mirrored into trace.metrics and lifted into evaluation summaries, so diff-runs and CI gates can reason about them without reaching into raw events. Wire-up: - ResearchWorkflow.run / .stream / .stream_interactive accept skip_verification and max_revisions; CLI exposes --no-verify and --max-revisions for `research`, plus --enable-verify and --max-revisions for `eval`/`benchmark`. - The Web UI gains a "Self-verifying loop" toggle and a max-revisions slider in the sidebar. - run_evaluation defaults to enable_verification=False so existing v0.5 benchmark gates keep working unchanged; flip it on for the v0.5-vs-v0.6 ablation in P2.6. - Replay reads skip_verification from the source trace.config so a recorded run with verifier calls still replays in call order. 13 new tests in tests/test_verifier.py cover score coercion, JSON parsing fallbacks, alignment-floor override, max-revisions cap, history helpers, and an end-to-end revise loop. Existing 33 tests still pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a one-shot reflection step inside Researcher.execute_task. After a task's queries run, the result quality is scored — empty batches, an average relevance below 0.5, or an error rate at/above 0.5 trigger a single LLM call that diagnoses the failure and proposes 1–2 materially-different replacement queries. The replacements are then executed against the same sources, results are merged into state, and the task is marked ``_reflected`` so a second adversarial failure cannot induce another reflection on the same task. Reflection is opt-in at the agent level via ``enable_reflection`` (on by default in v0.6). Disable with the new --no-reflect flag on the research command, or with --enable-reflect on the eval command (the eval default stays off so v0.5 benchmark gates keep working unchanged). Trace integration: - Each reflection emits an ``event_type="reflection"`` event in trace.events with the original queries, failure stats, and the rewritten queries; failures are tagged ``status="error"``. - ``trace.metrics.reflection_count`` increments only when a reflection produced usable rewrites, so the v0.5-vs-v0.6 ablation can show how often the feature actually fired. Replay integration: - ResearchWorkflow.run / .stream / .stream_interactive callers now persist ``enable_reflection`` into trace.config alongside the verifier toggles. replay.py reads it and instantiates the Researcher with the same setting, so recorded LLM call order lines up with the workflow path during a deterministic replay. The Streamlit Web UI gains a "Reflexive Researcher (v0.6)" sidebar toggle next to the verifier toggle. 15 new tests in tests/test_researcher_reflection.py cover the trigger conditions, mixed-score relevance averaging, fenced-JSON parsing, case-insensitive dedup of rewrites, the single-shot _reflected guard, LLM-failure resilience, and the disable flag. 61 tests pass overall; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fhw and others added 30 commits May 2, 2026 18:14

feat: add mid-flight plan refinement

32a9c28

feat: harden plan refinement replay and timeline

94fb8f9

feat(researcher): run task searches in parallel

350862e

feat(phase2): add structured prompts and benchmark results

4f9ad25

chore: switch development workflow to conda

a6ed381

docs: align v0.6 shipped status

459c23d

feat(web): add static trace viewer

9c14c9d

feat(mcp): add sdk-backed transports

3ed1603

feat(cli): add deployment doctor

9edff4f

docs(mcp): add server demo scripts

3423e61

ci: publish trace viewer pages

80c9081

fix(cli): harden report output on Windows

7db589c

fix(replay): preserve recorded tool failures

4affe40

feat(benchmark): add external public harness

75444f3

feat(mcp): harden demo tool configuration

e38f377

chore(release): prepare hosted docs and package manifest

b46e95a

docs(live): record Tavily smoke blocker

e43f6f2

test(deploy): cover hosted entrypoints

9031ae8

ci(release): add package smoke gates

8d4dc1a

feat(benchmark): label public smoke artifacts

e1ec89f

docs(live): refresh Tavily blocker

1e53668

feat(docker): add container deployment assets

13acc6f

ci(release): smoke test built wheel

be66f18

fhw added 9 commits May 8, 2026 11:43

chore(release): modernize license metadata

3f5edb6

ci(release): harden hosted publish readiness

571bdc4

feat(benchmark): audit public smoke artifacts

49a8712

docs(live): record final release readiness pass

7e01545

docs(readme): sync v0.6 readiness status

c701257

feat(release): add local readiness preflight

6cb69a1

feat(benchmark): add failure analysis artifacts

4facae3

feat(benchmark): add regression analysis comparison

60ddaca

feat(verifier): add deterministic citation audit

3470a31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/v0.6 self verifying#2

Feat/v0.6 self verifying#2
hwfengcs wants to merge 39 commits into
mainfrom
feat/v0.6-self-verifying

hwfengcs commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hwfengcs commented May 3, 2026

Summary

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant