feat(harness): single-agent cached tool-calling loop (replaces agentic expert panel, flag-gated) by odtorres · Pull Request #877 · microboxlabs/modulariot

odtorres · 2026-07-03T01:17:41Z

Summary

Replaces the agentic expert-panel pipeline (planner → verifier → synthesizer → critic seats, 5-10+ sequential full-context LLM calls per question) with one cached native tool-calling agent loop, behind a feature flag (MIOT_HARNESS_AGENTS_AGENT_LOOP_ENABLED, default off — flag-off behavior is verified identical to trunk).

Root causes addressed (Console showed zero prompt-cache usage):

No cache_control anywhere — every LLM call paid full input price and full prefill latency.
Panel shape defeats caching — each seat built a different system prompt, so the static context (primer + tool catalogs) could never be shared across seats.

Architecture

runtime/agent_loop.py — AgentLoopRunner: one model, tools bound once at boot, append-only conversation looping tool_use → invoke_step → tool_result until the model answers. Replaces the planner/verifier/synthesizer/critic seats; reuses invoke_step, the rule-based freshness_judge, and provenance logging unchanged.
Cache layout (Anthropic renders tools → system → messages): frozen byte-stable prefix (sorted native tool defs + static system prompt with an ephemeral breakpoint) + one request-time tail marker applied on a copy — exactly 2 breakpoints per request, wire-verified via _get_request_payload tests.
Deterministic safety gates kept as code: tenancy pre-gate before any LLM call; freshness classification per step (same annotate-don't-refuse semantics as the legacy agentic graph). Anti-fabrication/citation rules moved into the cached prompt (all 10 rule phrases pinned by tests).
Dynamic content never touches the prefix: skill bodies / JSON-blocks contract / conversation history are demoted to <system-reminder> blocks in the user turn.
Telemetry now records ephemeral cache-write tokens (langchain maps them to input_token_details["ephemeral_5m_input_tokens"] and zeroes cache_creation), so the Console/Langfuse cache panels report real numbers.

Verification

Live cache test passed against the real Anthropic API: ~5011-token prefix written on call 1, 5012 tokens read from cache on call 2 (tests/integration/test_agent_loop_cache_live.py, skip-gated behind MIOT_HARNESS_RUN_LIVE_TESTS=1).
901 tests pass, 3 skipped; ruff + mypy clean on new modules.
Each task passed an independent spec+quality review; final whole-branch review found one blocker (telemetry cache accounting) — fixed in the last commit.

Rollout plan (not in this PR)

Golden-eval parity run (legacy agentic graph vs loop) → review latency/cost/cache metrics → flip the flag default and delete the legacy agentic seats in a follow-up.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added an optional single-agent tool-calling loop for data requests, configurable via new settings for enabling the mode, capping tool-result context size, and setting an LLM timeout.
- Improved native tool-calling and prompt-cache stability for agent interactions.
Bug Fixes
- Fixed Anthropic model timeout handling to respect a caller-provided timeout when set.
- Updated Anthropic usage tracking so ephemeral cache creation is reflected correctly.
Tests
- Added unit and live integration coverage for the agent loop, prompt caching, native tool generation, configuration defaults, and timeout overrides.

Closes #878

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… dict per message Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…oints Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…-content tails Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…flag

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…xtraction langchain_anthropic maps ephemeral cache writes to ephemeral_5m_input_tokens / ephemeral_1h_input_tokens and sets cache_creation=0. _extract_usage now sums all three keys so dashboards show accurate cache-write counts and pricing.py applies the correct 1.25× rate instead of base input rate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

coderabbitai · 2026-07-03T01:18:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f14aade-129b-485e-963e-04c13f5d79fa

📥 Commits

Reviewing files that changed from the base of the PR and between 257915a and 72704b3.

📒 Files selected for processing (1)

miot-harness/tests/integration/test_agent_loop_cache_live.py

📝 Walkthrough

Walkthrough

This PR adds a single-agent native tool-calling execution path with cached prompt construction, deterministic tool definitions, supervisor/server wiring, telemetry updates for Anthropic cache tokens, and unit plus live integration tests.

Changes

Agent Loop Feature

Layer / File(s)	Summary
Configuration and timeout support `miot-harness/src/miot_harness/config.py`, `miot-harness/src/miot_harness/agents/chat_models.py`, `miot-harness/tests/agents/test_chat_models_loop.py`, `miot-harness/tests/test_config.py`	Adds agent-loop settings and a Claude timeout override, with tests for defaults and environment-based configuration.
Native tool schema and cached system prompt `miot-harness/src/miot_harness/agents/native_tools.py`, `miot-harness/src/miot_harness/runtime/agent_prompt.py`, `miot-harness/tests/test_native_tools.py`, `miot-harness/tests/runtime/test_agent_prompt.py`	Adds deterministic tool-definition rendering and a frozen cached system prompt, with tests for schema shape, ordering, and prompt byte stability.
AgentLoopRunner core loop and message caching `miot-harness/src/miot_harness/runtime/agent_loop.py`, `miot-harness/tests/runtime/test_agent_loop.py`, `miot-harness/tests/runtime/test_agent_loop_payload.py`	Implements the single-agent loop, cache-marker handling, tool-call execution, turn limiting, and event emission, with runtime and payload tests for the main execution paths.
Supervisor and server wiring `miot-harness/src/miot_harness/runtime/supervisor.py`, `miot-harness/src/miot_harness/api/server.py`, `miot-harness/tests/runtime/test_supervisor_agent_loop.py`	Adds optional loop injection, routes DATA_AGENTIC through the loop when enabled, instantiates it during startup, and clears it on startup failure, with routing tests.
Cache token telemetry and live integration tests `miot-harness/src/miot_harness/observability/callbacks.py`, `miot-harness/tests/observability/test_callbacks.py`, `miot-harness/tests/integration/conftest.py`, `miot-harness/tests/integration/test_agent_loop_cache_live.py`	Adjusts cache token normalization for Anthropic ephemeral counters and adds unit plus live integration coverage for cache accounting and prompt-cache behavior.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Server
    participant HarnessSupervisor
    participant AgentLoopRunner
    participant Model
    participant ToolRegistry

    Server->>HarnessSupervisor: inject agent_loop at startup
    HarnessSupervisor->>AgentLoopRunner: run(user_message, ctx, prior_messages)
    AgentLoopRunner->>Model: ainvoke(cached system + history)
    Model-->>AgentLoopRunner: AIMessage / tool_calls
    alt tool calls returned
        loop for each tool call
            AgentLoopRunner->>ToolRegistry: invoke_step(tool_call)
            ToolRegistry-->>AgentLoopRunner: evidence or failure
            AgentLoopRunner->>Model: ainvoke(next turn)
            Model-->>AgentLoopRunner: final AIMessage
        end
    else direct answer
        Model-->>AgentLoopRunner: final AIMessage
    end
    AgentLoopRunner-->>HarnessSupervisor: answer, evidence, usage_log

Possibly related PRs

microboxlabs/modulariot#445: Related supervisor/runtime orchestration changes in the same execution path.
microboxlabs/modulariot#446: Also updates get_chat_model, though for a different parameter set.
microboxlabs/modulariot#462: Shares agent-route execution and cache/token accounting surfaces.

Suggested reviewers: korutx

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the new single-agent cached tool-calling loop and its flag-gated replacement of the expert panel.
Linked Issues check	✅ Passed	The changes implement the requested single cached native tool-calling loop, cache-aware prompt/telemetry updates, and supporting tests behind the feature flag.
Out of Scope Changes check	✅ Passed	The added modules, settings, tests, and integration fixtures all support the loop and cache work, with no obvious unrelated changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-feat-harness-single-agent-loop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

miot-harness/tests/integration/test_agent_loop_cache_live.py (1)
109-121: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Use ToolRegistry() here instead of __new__. The registry already has a public empty constructor, and __init__ only initializes _tools; keeping the helper on the normal construction path avoids reaching into private state.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/tests/integration/test_agent_loop_cache_live.py` around lines
109 - 121, Replace the manual ToolRegistry.__new__ construction in
_big_registry() with the public ToolRegistry() constructor. The helper only
needs an empty registry, and using the normal initialization path preserves
__init__ behavior instead of setting the private _tools field directly.
miot-harness/src/miot_harness/observability/callbacks.py (1)
85-93: 🗄️ Data Integrity & Integration | 🔵 Trivial | 💤 Low value

Keep 1h ephemeral cache writes distinct
TokenUsage.cache_creation_input_tokens collapses ephemeral_5m_input_tokens and ephemeral_1h_input_tokens, so compute_cost will price both at the same cache-creation rate. If 1h ephemeral caching is ever enabled, split this bucket or carry the TTL through to pricing so those writes don’t get underbilled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@miot-harness/src/miot_harness/observability/callbacks.py` around lines 85 -
93, The TokenUsage cache_creation_input_tokens aggregation in callbacks.py is
combining 5m and 1h ephemeral writes into one bucket, which hides the TTL
distinction. Update the telemetry path around the cache_creation calculation so
ephemeral_5m_input_tokens and ephemeral_1h_input_tokens remain separate (or
preserve TTL metadata) and adjust compute_cost to price them with the correct
cache-creation rate based on the specific TTL.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@miot-harness/src/miot_harness/runtime/agent_loop.py`:
- Around line 102-135: The live cache-marker coverage in AgentLoopRunner
currently misses the tool-result path, so add a live test case that exercises a
real tool call through AgentLoopRunner.run() and reaches the
ToolMessage/content=str branch in _mark_message and _with_tail_marker. Extend
the existing integration test to include a tool-using turn and assert the
request still succeeds with cache markers applied to the final tool-result
message, so a server-side invalid_cache regression on tool-result turns is
caught.

---

Nitpick comments:
In `@miot-harness/src/miot_harness/observability/callbacks.py`:
- Around line 85-93: The TokenUsage cache_creation_input_tokens aggregation in
callbacks.py is combining 5m and 1h ephemeral writes into one bucket, which
hides the TTL distinction. Update the telemetry path around the cache_creation
calculation so ephemeral_5m_input_tokens and ephemeral_1h_input_tokens remain
separate (or preserve TTL metadata) and adjust compute_cost to price them with
the correct cache-creation rate based on the specific TTL.

In `@miot-harness/tests/integration/test_agent_loop_cache_live.py`:
- Around line 109-121: Replace the manual ToolRegistry.__new__ construction in
_big_registry() with the public ToolRegistry() constructor. The helper only
needs an empty registry, and using the normal initialization path preserves
__init__ behavior instead of setting the private _tools field directly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1f5a5a6-9eec-4d8d-a60c-b7e3f0859fff

📥 Commits

Reviewing files that changed from the base of the PR and between 0d42082 and 257915a.

📒 Files selected for processing (19)

miot-harness/src/miot_harness/agents/chat_models.py
miot-harness/src/miot_harness/agents/native_tools.py
miot-harness/src/miot_harness/api/server.py
miot-harness/src/miot_harness/config.py
miot-harness/src/miot_harness/observability/callbacks.py
miot-harness/src/miot_harness/runtime/agent_loop.py
miot-harness/src/miot_harness/runtime/agent_prompt.py
miot-harness/src/miot_harness/runtime/supervisor.py
miot-harness/tests/agents/test_chat_models_loop.py
miot-harness/tests/integration/__init__.py
miot-harness/tests/integration/conftest.py
miot-harness/tests/integration/test_agent_loop_cache_live.py
miot-harness/tests/observability/test_callbacks.py
miot-harness/tests/runtime/test_agent_loop.py
miot-harness/tests/runtime/test_agent_loop_payload.py
miot-harness/tests/runtime/test_agent_prompt.py
miot-harness/tests/runtime/test_supervisor_agent_loop.py
miot-harness/tests/test_config.py
miot-harness/tests/test_native_tools.py

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

odtorres and others added 12 commits July 2, 2026 15:27

feat(harness): add agent-loop feature flag and tool-result cap settings

1c4ce88

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(harness): build native Anthropic tool definitions from the registry

6a3fec8

feat(harness): frozen cached system prompt for the agent loop

0c7dd1f

fix(harness): drop stray __init__.py — src uses namespace packages

a6100ca

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(harness): pin all agent-prompt rule phrases; fresh cache-control…

4a5f529

… dict per message Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(harness): agent-loop request assembly with verified cache breakp…

c4398f7

…oints Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

fix(harness): drop unused imports; pin no-mutation invariant for list…

e399077

…-content tails Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(harness): single-agent tool-calling loop runner (AgentLoopRunner)

e317282

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(harness): wire AgentLoopRunner behind agents_agent_loop_enabled …

4eb4fa1

…flag

feat(harness): configurable LLM timeout for the agent loop

0a0b5b1

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test(harness): live prompt-cache verification for the agent loop

db3e89e

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

odtorres self-assigned this Jul 3, 2026

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread miot-harness/src/miot_harness/runtime/agent_loop.py

test(harness): live-verify cache marker on tool-result turns

72704b3

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

korutx merged commit 7b89d97 into trunk Jul 3, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harness): single-agent cached tool-calling loop (replaces agentic expert panel, flag-gated)#877

feat(harness): single-agent cached tool-calling loop (replaces agentic expert panel, flag-gated)#877
korutx merged 13 commits into
trunkfrom
worktree-feat-harness-single-agent-loop

odtorres commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

odtorres commented Jul 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Verification

Rollout plan (not in this PR)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odtorres commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading