Skip to content

Demo review: Thursday June 4 agent sessions — full inventory + what went off the rails #26

@boettiger-lab-llm-agent

Description

@boettiger-lab-llm-agent

Purpose

Review of all agent sessions run against tpl-ca.nrp-nautilus.io on Thursday June 4, 2026 — the day around the TPL demo. Goal: discuss what went well, what went off the rails, and why, then route fixes to the right layer (STAC / MCP / geo-agent / this app's system-prompt).

Source: open-llm-proxy logs (consolidated/daily/2026-06-04.parquet), scoped to this app's origin. Watershed findings reproduced against live data via the duckdb-geo MCP tools.

Status (resolved): system-prompt clarify/label-inference fix is shipped & live; USGS WBD import → data-workflows#200; geocoding tool → geo-agent#238. Session #5 re-assessed as a success (see below). Details in the comments.

Traffic context (last 7 days)

Day LLM calls Sessions
Jun 1 8 3
Jun 2 46 5
Jun 3 16 1
Jun 4 (Thu) 114 5
Jun 5 3 1
Jun 6 17 2

Thursday carried ~60% of the week. user_question below is the session's opening question (the proxy stamps it on every turn), so each row is one session, not one SQL query — and one "session" can contain several distinct user turns. Three models appear because the user switched models mid-session.

Thursday June 4 — full inventory

# Session (opening question) Model(s) LLM calls Tool calls Max msg-count Started (PT) Wall-clock LLM compute Tokens Outcome
1 Which CA congressional districts had failed conservation ballot measures + funding at stake minimax-m2 8 7 16 11:02 2 min 86 s 342 K ✅ Answered; correctly noted the 6 statewide measures inflate all districts equally
2 Which CDs had greatest increase in conservation land in last 10 yrs minimax-m2 4 4 9 11:04 1 min 60 s 166 K ✅ Clean ranked table (2014–2018 Almanac)
3 "I work for TPL… preparing a grant application. What question would you be able to answer that would assist me?" minimax-m2 → glm-5 → claude-sonnet-4-6 66 63 43 13:20 154 min 16 min 2.85 M Multi-task exploratory session: capability tour, a Yosemite map demo, an address→watershed lookup, WCB/TPL funding tables. Mostly succeeded but contained the watershed runaway; also some WCB column churn (verified not a data bug — see C)
4 "What watershed is this address located in? 3109 6th Ave, LA 90018" claude-sonnet-4-6 20 20 40 15:36 2 min 75 s 956 K ⚠️ The headline failure — ~15 query retries, never found it in data, ended by fabricating the HUC code from numbering structure rather than from any dataset
5 Miles of the Pacific Crest Trail through the Tahoe basin minimax-m2 16 17 20 15:57 6 min 103 s 656 K Success (3 separate user turns). PCT answered (~50.4 mi); on a rivers follow-up the model clarified before proceeding — a positive clarify-first example. (Originally mis-flagged as "drift" — that was a grouping artifact.)

Totals: 114 LLM calls, ~111 tool calls, ~4.97 M tokens, ~20 min actual LLM compute. Session #3's 154-min wall-clock is mostly user think-time (16 min compute).

What went well

What went off the rails (and why)

A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure

The model hardcoded approximate coordinates from memory (lat=34.0369, lon=-118.2974), then ran ~15 variations of "point → H3 h8 cell → join against american-rivers-ira-watersheds" before giving up and inferring the HUC code from the structure of HUC numbering and presenting it as fact.

Reproduced against live data — the loop was doomed from the first query:

  1. american-rivers-ira-watersheds is a thematic subset, not basemap coverage. Title: "IRA-Influenced Drinking Water Watersheds (HUC12)" — only 9,991 HUC12 polygons selected by EPA funding-priority scores (~1/10 of national coverage). The LA address's h8 (613221950241636351) is absent from the hex tiles; ST_Contains on the flat geoparquet returns 0 polygons. Urban LA isn't in this dataset.
  2. No complete-coverage HUC12 dataset exists in this catalog. NRI-2024 is sparse river lines; epa-sab-v3-cws is water-utility service areas; HydroBasins is complete but uses Pfafstetter IDs, not HUC12. → data-workflows#200 imports USGS WBD (national, HUC2–HUC12).
  3. No geocoding tool — coordinates were guessed. → geo-agent#238.
  4. The MCP H3 guide has no "point/address → containing region" recipe, so the join was improvised and fumbled repeatedly.

B. PCT-through-Tahoe (session #5) drifted — RETRACTED

Original claim was wrong. Session #5 was three separate user turns (PCT mileage → a rivers follow-up where the model clarified → Wild & Scenic Rivers). No drift; it's a success. See the correction comment. The only minor carry-over: the PCT-miles computation used a guessed bounding box for the Tahoe basin (no Tahoe-basin polygon in the catalog) + ST_ line clipping — the same "no AOI polygon → guess the extent" gap as the watershed case, but handled gracefully. Low priority.

C. WCB funding column churn (inside session #3) — verified, NOT a data bug

The model called get_schema on wcb-approved-projects (good discipline). The STAC schema is correct, complete, and self-documenting (descriptions explain truncated names like PrimGrante). The churn was the model reverting to invented un-truncated names (Grantee, dtcost, WCBExpend) under context pressure deep in a 2.85M-token session, then self-correcting. Not a STAC bug; no dedicated issue filed. Only structural angle: shapefile-truncated column names are LLM-hostile (possible future data-workflows aliasing discussion).

The core lesson: explain what you have; don't guess, and label inference

"I don't have that data" is a dead end. The professional response for an expert tool is to surface the relevant data the app does hold and its coverage — and when the model supplements from general knowledge (e.g. knowing the LA address is HUC 18070105 / LA River), to explicitly mark what is data-backed vs. its own inference, never present a guess as a lookup. The watershed session did the opposite: three models all fabricated coordinates and a HUC code and ground through a dozen-plus queries against a dataset that could never contain the answer.

Routing & status

Fix Layer Status
System prompt: explain available data + coverage, and mark data-backed vs. inference Layer 4 — app Shipped & live (one appended sentence in system-prompt.md)
Import complete watershed coverage (USGS WBD, national HUC2–HUC12) Layer 1 — data 📋 data-workflows#200
Geocoding tool (address/place → coordinates) Layer 3 — framework 📋 geo-agent#238
H3 "point/address → containing region" recipe; "one clean no-match = coverage gap, stop retrying" guidance; optional runaway-loop cap Layer 2 / 3 — MCP / geo-agent ⬜ Not yet filed (lower priority)

Remaining / optional

  • Sweep Jun 2 (46 calls) + Jun 6 (17 calls) to confirm whether June 4's patterns repeat or surface new failure modes.
  • Decide whether the H3 point-in-region recipe + loop-cap are worth filing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions