Feat/v0.6 self verifying#2
Open
hwfengcs wants to merge 39 commits into
Open
Conversation
Wire token usage and USD cost estimation end-to-end so every LLM call
is attributable. PRICING_TABLE in SDYJ_Agents/utils/cost.py is a
hardcoded {(provider, model): (input_per_million, output_per_million)}
dict — explicit, auditable, no third-party services. Unknown
(provider, model) pairs return None rather than 0.00, so the CLI can
render an honest "—" instead of a misleading zero.
Provider wrappers (OpenAI, Claude, DeepSeek, Gemini) now expose
last_usage with the raw provider dict; normalize_usage reconciles the
three different naming conventions. InstrumentedLLM writes
prompt_tokens_actual, completion_tokens_actual, and cost_usd into each
llm_calls entry. finalize_trace rolls them up into trace.metrics, and
summarize_trace surfaces them so diff-runs reports cost deltas.
The CLI inspect-run gains a per-call LLM Calls table with a tip that
links unpriced cells back to PRICING_TABLE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a two-track publish workflow that uses OIDC-based Trusted Publishers instead of long-lived API tokens: - workflow_dispatch with target=testpypi for alpha/beta cuts. - GitHub Release event for stable PyPI releases. Both jobs gate on a build + twine check step. Bump pyproject.toml to 0.6.0a1 and split optional dependencies into [web], [mcp], [benchmarks], and [all] so the core install stays slim while opt-in features pull their deps on demand. Add docs/release-process.md so the one-time PyPI Trusted Publisher setup is documented for future maintainers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Position SDYJ as a self-verifying, replayable, benchmarkable multi-agent research framework. Both README files now lead with the operations-not-prompting thesis, expose new badges (PyPI, Stars), and include a comparison table against GPT Researcher, AutoGen, and the LangGraph examples that highlights trace, replay, evidence IDs, cost, and benchmark gates. Each "🚧 v0.6" item is honestly marked in progress so visitors are not misled. Add docs/release-notes/v0.6.md as the running ledger of what has shipped (cost tracking, provider usage capture, packaging extras, PyPI pipeline, README) and what is still in flight (verifier loop, reflexive researcher, parallel tool calls, public benchmarks, real MCP, web UI). Update ROADMAP.md to match and reserve a v0.7 slot for extensibility work that previously lived under v0.6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ship a minimum-viable Web UI that reuses the existing Coordinator, Planner, Researcher, Rapporteur, ResearchWorkflow, and InstrumentedLLM pipeline — the UI is intentionally thin so anything we add benefits the CLI path for free. Five tabs surface the run output: Report, Plan, Evidence cards, Trace timeline (with tool calls), and an LLM cost table fed by trace.metrics. Components in SDYJ_Agents/web/components.py take plain dicts so they are easy to test or swap. The MVP runs with auto_approve=True; a proper human-in-the-loop approval gate is a v0.6 follow-up because it needs LangGraph state to survive a Streamlit rerender. streamlit_app.py at the repo root is the entry point Hugging Face Spaces expects, and docs/huggingface-spaces-deploy.md walks through the one-time Space setup including the YAML frontmatter, secrets, and a local dry-run check. tests/test_web_smoke.py uses Streamlit's built-in AppTest to catch import-time crashes and verify the example chip wires through to session_state. Add streamlit to the [web] extras so the core install stays free of UI dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Publish the launch blog post for the v0.6 alpha in both English and Chinese. The argument is that agent operations are an evaluation problem rather than a prompting problem, organized around four pillars — trace everything, deterministic replay, per-call cost in the trace, and a verifier loop. The post grounds the claims in the shipped 0.6.0a1 work (cost tracking, provider usage capture, PyPI pipeline, Web UI) and is honest about what is still in flight. Both files target SEO around self-verifying agents, agent observability, and replayable agents, and they link readers back to the repo, the v0.6 release notes, and ROADMAP.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the Verifier — the v0.6 differentiator. It re-reads the generated report against the collected E1/E2/... evidence and grades it on four dimensions (claim_evidence_alignment, citation_completeness, factual_consistency, coverage). When the overall quality dips below 0.75 or alignment below 0.65, the workflow loops back to the rapporteur with concrete revision hints, capped by max_revisions (default 2). Design choices worth flagging: - LangGraph persists state mutations made inside node functions but *discards* mutations made inside conditional edge functions. To stop the critique-revise loop from running away, revision_count is bumped inside verifier_node and the conditional edge only reads. The edge also reads a transient _pending_route_to_revise flag set by the node, so the cap is enforced exactly once per pass. - Verifier.verify always returns a well-shaped dict — even when the LLM call fails, the response is malformed, or the model is too lenient. Defaulting to should_revise=True on failure means a broken verifier never silently rubber-stamps a bad report. - The Rapporteur's revise mode reuses the previous report and the hints; it does not re-run summarize / organize / synthesize. That keeps revisions cheap, predictable, and trace-diffable. - Verifier metrics (overall_quality, should_revise, weakest_dimension, revision_count) are mirrored into trace.metrics and lifted into evaluation summaries, so diff-runs and CI gates can reason about them without reaching into raw events. Wire-up: - ResearchWorkflow.run / .stream / .stream_interactive accept skip_verification and max_revisions; CLI exposes --no-verify and --max-revisions for `research`, plus --enable-verify and --max-revisions for `eval`/`benchmark`. - The Web UI gains a "Self-verifying loop" toggle and a max-revisions slider in the sidebar. - run_evaluation defaults to enable_verification=False so existing v0.5 benchmark gates keep working unchanged; flip it on for the v0.5-vs-v0.6 ablation in P2.6. - Replay reads skip_verification from the source trace.config so a recorded run with verifier calls still replays in call order. 13 new tests in tests/test_verifier.py cover score coercion, JSON parsing fallbacks, alignment-floor override, max-revisions cap, history helpers, and an end-to-end revise loop. Existing 33 tests still pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a one-shot reflection step inside Researcher.execute_task. After a task's queries run, the result quality is scored — empty batches, an average relevance below 0.5, or an error rate at/above 0.5 trigger a single LLM call that diagnoses the failure and proposes 1–2 materially-different replacement queries. The replacements are then executed against the same sources, results are merged into state, and the task is marked ``_reflected`` so a second adversarial failure cannot induce another reflection on the same task. Reflection is opt-in at the agent level via ``enable_reflection`` (on by default in v0.6). Disable with the new --no-reflect flag on the research command, or with --enable-reflect on the eval command (the eval default stays off so v0.5 benchmark gates keep working unchanged). Trace integration: - Each reflection emits an ``event_type="reflection"`` event in trace.events with the original queries, failure stats, and the rewritten queries; failures are tagged ``status="error"``. - ``trace.metrics.reflection_count`` increments only when a reflection produced usable rewrites, so the v0.5-vs-v0.6 ablation can show how often the feature actually fired. Replay integration: - ResearchWorkflow.run / .stream / .stream_interactive callers now persist ``enable_reflection`` into trace.config alongside the verifier toggles. replay.py reads it and instantiates the Researcher with the same setting, so recorded LLM call order lines up with the workflow path during a deterministic replay. The Streamlit Web UI gains a "Reflexive Researcher (v0.6)" sidebar toggle next to the verifier toggle. 15 new tests in tests/test_researcher_reflection.py cover the trigger conditions, mixed-score relevance averaging, fenced-JSON parsing, case-insensitive dedup of rewrites, the single-shot _reflected guard, LLM-failure resilience, and the disable flag. 61 tests pass overall; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
added 9 commits
May 8, 2026 11:43
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing
pytestruff check SDYJ_Agents testsNotes
outputs/files are committed.