Skip to content

feat(harness): single-agent cached tool-calling loop (replaces agentic expert panel, flag-gated)#877

Merged
korutx merged 13 commits into
trunkfrom
worktree-feat-harness-single-agent-loop
Jul 3, 2026
Merged

feat(harness): single-agent cached tool-calling loop (replaces agentic expert panel, flag-gated)#877
korutx merged 13 commits into
trunkfrom
worktree-feat-harness-single-agent-loop

Conversation

@odtorres

@odtorres odtorres commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the agentic expert-panel pipeline (planner → verifier → synthesizer → critic seats, 5-10+ sequential full-context LLM calls per question) with one cached native tool-calling agent loop, behind a feature flag (MIOT_HARNESS_AGENTS_AGENT_LOOP_ENABLED, default off — flag-off behavior is verified identical to trunk).

Root causes addressed (Console showed zero prompt-cache usage):

  1. No cache_control anywhere — every LLM call paid full input price and full prefill latency.
  2. Panel shape defeats caching — each seat built a different system prompt, so the static context (primer + tool catalogs) could never be shared across seats.

Architecture

  • runtime/agent_loop.pyAgentLoopRunner: one model, tools bound once at boot, append-only conversation looping tool_use → invoke_step → tool_result until the model answers. Replaces the planner/verifier/synthesizer/critic seats; reuses invoke_step, the rule-based freshness_judge, and provenance logging unchanged.
  • Cache layout (Anthropic renders tools → system → messages): frozen byte-stable prefix (sorted native tool defs + static system prompt with an ephemeral breakpoint) + one request-time tail marker applied on a copy — exactly 2 breakpoints per request, wire-verified via _get_request_payload tests.
  • Deterministic safety gates kept as code: tenancy pre-gate before any LLM call; freshness classification per step (same annotate-don't-refuse semantics as the legacy agentic graph). Anti-fabrication/citation rules moved into the cached prompt (all 10 rule phrases pinned by tests).
  • Dynamic content never touches the prefix: skill bodies / JSON-blocks contract / conversation history are demoted to <system-reminder> blocks in the user turn.
  • Telemetry now records ephemeral cache-write tokens (langchain maps them to input_token_details["ephemeral_5m_input_tokens"] and zeroes cache_creation), so the Console/Langfuse cache panels report real numbers.

Verification

  • Live cache test passed against the real Anthropic API: ~5011-token prefix written on call 1, 5012 tokens read from cache on call 2 (tests/integration/test_agent_loop_cache_live.py, skip-gated behind MIOT_HARNESS_RUN_LIVE_TESTS=1).
  • 901 tests pass, 3 skipped; ruff + mypy clean on new modules.
  • Each task passed an independent spec+quality review; final whole-branch review found one blocker (telemetry cache accounting) — fixed in the last commit.

Rollout plan (not in this PR)

Golden-eval parity run (legacy agentic graph vs loop) → review latency/cost/cache metrics → flip the flag default and delete the legacy agentic seats in a follow-up.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an optional single-agent tool-calling loop for data requests, configurable via new settings for enabling the mode, capping tool-result context size, and setting an LLM timeout.
    • Improved native tool-calling and prompt-cache stability for agent interactions.
  • Bug Fixes

    • Fixed Anthropic model timeout handling to respect a caller-provided timeout when set.
    • Updated Anthropic usage tracking so ephemeral cache creation is reflected correctly.
  • Tests

    • Added unit and live integration coverage for the agent loop, prompt caching, native tool generation, configuration defaults, and timeout overrides.

Closes #878

odtorres and others added 12 commits July 2, 2026 15:27
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… dict per message

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…oints

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-content tails

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…xtraction

langchain_anthropic maps ephemeral cache writes to ephemeral_5m_input_tokens /
ephemeral_1h_input_tokens and sets cache_creation=0. _extract_usage now sums
all three keys so dashboards show accurate cache-write counts and pricing.py
applies the correct 1.25× rate instead of base input rate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f14aade-129b-485e-963e-04c13f5d79fa

📥 Commits

Reviewing files that changed from the base of the PR and between 257915a and 72704b3.

📒 Files selected for processing (1)
  • miot-harness/tests/integration/test_agent_loop_cache_live.py

📝 Walkthrough

Walkthrough

This PR adds a single-agent native tool-calling execution path with cached prompt construction, deterministic tool definitions, supervisor/server wiring, telemetry updates for Anthropic cache tokens, and unit plus live integration tests.

Changes

Agent Loop Feature

Layer / File(s) Summary
Configuration and timeout support
miot-harness/src/miot_harness/config.py, miot-harness/src/miot_harness/agents/chat_models.py, miot-harness/tests/agents/test_chat_models_loop.py, miot-harness/tests/test_config.py
Adds agent-loop settings and a Claude timeout override, with tests for defaults and environment-based configuration.
Native tool schema and cached system prompt
miot-harness/src/miot_harness/agents/native_tools.py, miot-harness/src/miot_harness/runtime/agent_prompt.py, miot-harness/tests/test_native_tools.py, miot-harness/tests/runtime/test_agent_prompt.py
Adds deterministic tool-definition rendering and a frozen cached system prompt, with tests for schema shape, ordering, and prompt byte stability.
AgentLoopRunner core loop and message caching
miot-harness/src/miot_harness/runtime/agent_loop.py, miot-harness/tests/runtime/test_agent_loop.py, miot-harness/tests/runtime/test_agent_loop_payload.py
Implements the single-agent loop, cache-marker handling, tool-call execution, turn limiting, and event emission, with runtime and payload tests for the main execution paths.
Supervisor and server wiring
miot-harness/src/miot_harness/runtime/supervisor.py, miot-harness/src/miot_harness/api/server.py, miot-harness/tests/runtime/test_supervisor_agent_loop.py
Adds optional loop injection, routes DATA_AGENTIC through the loop when enabled, instantiates it during startup, and clears it on startup failure, with routing tests.
Cache token telemetry and live integration tests
miot-harness/src/miot_harness/observability/callbacks.py, miot-harness/tests/observability/test_callbacks.py, miot-harness/tests/integration/conftest.py, miot-harness/tests/integration/test_agent_loop_cache_live.py
Adjusts cache token normalization for Anthropic ephemeral counters and adds unit plus live integration coverage for cache accounting and prompt-cache behavior.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Server
    participant HarnessSupervisor
    participant AgentLoopRunner
    participant Model
    participant ToolRegistry

    Server->>HarnessSupervisor: inject agent_loop at startup
    HarnessSupervisor->>AgentLoopRunner: run(user_message, ctx, prior_messages)
    AgentLoopRunner->>Model: ainvoke(cached system + history)
    Model-->>AgentLoopRunner: AIMessage / tool_calls
    alt tool calls returned
        loop for each tool call
            AgentLoopRunner->>ToolRegistry: invoke_step(tool_call)
            ToolRegistry-->>AgentLoopRunner: evidence or failure
            AgentLoopRunner->>Model: ainvoke(next turn)
            Model-->>AgentLoopRunner: final AIMessage
        end
    else direct answer
        Model-->>AgentLoopRunner: final AIMessage
    end
    AgentLoopRunner-->>HarnessSupervisor: answer, evidence, usage_log
Loading

Possibly related PRs

Suggested reviewers: korutx

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the new single-agent cached tool-calling loop and its flag-gated replacement of the expert panel.
Linked Issues check ✅ Passed The changes implement the requested single cached native tool-calling loop, cache-aware prompt/telemetry updates, and supporting tests behind the feature flag.
Out of Scope Changes check ✅ Passed The added modules, settings, tests, and integration fixtures all support the loop and cache work, with no obvious unrelated changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-feat-harness-single-agent-loop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@odtorres odtorres self-assigned this Jul 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
miot-harness/tests/integration/test_agent_loop_cache_live.py (1)

109-121: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use ToolRegistry() here instead of __new__. The registry already has a public empty constructor, and __init__ only initializes _tools; keeping the helper on the normal construction path avoids reaching into private state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/integration/test_agent_loop_cache_live.py` around lines
109 - 121, Replace the manual ToolRegistry.__new__ construction in
_big_registry() with the public ToolRegistry() constructor. The helper only
needs an empty registry, and using the normal initialization path preserves
__init__ behavior instead of setting the private _tools field directly.
miot-harness/src/miot_harness/observability/callbacks.py (1)

85-93: 🗄️ Data Integrity & Integration | 🔵 Trivial | 💤 Low value

Keep 1h ephemeral cache writes distinct
TokenUsage.cache_creation_input_tokens collapses ephemeral_5m_input_tokens and ephemeral_1h_input_tokens, so compute_cost will price both at the same cache-creation rate. If 1h ephemeral caching is ever enabled, split this bucket or carry the TTL through to pricing so those writes don’t get underbilled.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/src/miot_harness/observability/callbacks.py` around lines 85 -
93, The TokenUsage cache_creation_input_tokens aggregation in callbacks.py is
combining 5m and 1h ephemeral writes into one bucket, which hides the TTL
distinction. Update the telemetry path around the cache_creation calculation so
ephemeral_5m_input_tokens and ephemeral_1h_input_tokens remain separate (or
preserve TTL metadata) and adjust compute_cost to price them with the correct
cache-creation rate based on the specific TTL.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@miot-harness/src/miot_harness/runtime/agent_loop.py`:
- Around line 102-135: The live cache-marker coverage in AgentLoopRunner
currently misses the tool-result path, so add a live test case that exercises a
real tool call through AgentLoopRunner.run() and reaches the
ToolMessage/content=str branch in _mark_message and _with_tail_marker. Extend
the existing integration test to include a tool-using turn and assert the
request still succeeds with cache markers applied to the final tool-result
message, so a server-side invalid_cache regression on tool-result turns is
caught.

---

Nitpick comments:
In `@miot-harness/src/miot_harness/observability/callbacks.py`:
- Around line 85-93: The TokenUsage cache_creation_input_tokens aggregation in
callbacks.py is combining 5m and 1h ephemeral writes into one bucket, which
hides the TTL distinction. Update the telemetry path around the cache_creation
calculation so ephemeral_5m_input_tokens and ephemeral_1h_input_tokens remain
separate (or preserve TTL metadata) and adjust compute_cost to price them with
the correct cache-creation rate based on the specific TTL.

In `@miot-harness/tests/integration/test_agent_loop_cache_live.py`:
- Around line 109-121: Replace the manual ToolRegistry.__new__ construction in
_big_registry() with the public ToolRegistry() constructor. The helper only
needs an empty registry, and using the normal initialization path preserves
__init__ behavior instead of setting the private _tools field directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1f5a5a6-9eec-4d8d-a60c-b7e3f0859fff

📥 Commits

Reviewing files that changed from the base of the PR and between 0d42082 and 257915a.

📒 Files selected for processing (19)
  • miot-harness/src/miot_harness/agents/chat_models.py
  • miot-harness/src/miot_harness/agents/native_tools.py
  • miot-harness/src/miot_harness/api/server.py
  • miot-harness/src/miot_harness/config.py
  • miot-harness/src/miot_harness/observability/callbacks.py
  • miot-harness/src/miot_harness/runtime/agent_loop.py
  • miot-harness/src/miot_harness/runtime/agent_prompt.py
  • miot-harness/src/miot_harness/runtime/supervisor.py
  • miot-harness/tests/agents/test_chat_models_loop.py
  • miot-harness/tests/integration/__init__.py
  • miot-harness/tests/integration/conftest.py
  • miot-harness/tests/integration/test_agent_loop_cache_live.py
  • miot-harness/tests/observability/test_callbacks.py
  • miot-harness/tests/runtime/test_agent_loop.py
  • miot-harness/tests/runtime/test_agent_loop_payload.py
  • miot-harness/tests/runtime/test_agent_prompt.py
  • miot-harness/tests/runtime/test_supervisor_agent_loop.py
  • miot-harness/tests/test_config.py
  • miot-harness/tests/test_native_tools.py

Comment thread miot-harness/src/miot_harness/runtime/agent_loop.py
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@korutx korutx merged commit 7b89d97 into trunk Jul 3, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: collapse harness agentic expert panel into a single prompt-cached tool-calling agent

2 participants