Purpose
Review of all agent sessions run against tpl-ca.nrp-nautilus.io on Thursday June 4, 2026 — the day around the TPL demo. Goal: discuss what went well, what went off the rails, and why, then route fixes to the right layer (STAC / MCP / geo-agent / this app's system-prompt).
Source: open-llm-proxy logs (consolidated/daily/2026-06-04.parquet), scoped to this app's origin. Watershed findings reproduced against live data via the duckdb-geo MCP tools.
Status (resolved): system-prompt clarify/label-inference fix is shipped & live; USGS WBD import → data-workflows#200; geocoding tool → geo-agent#238. Session #5 re-assessed as a success (see below). Details in the comments.
Traffic context (last 7 days)
| Day |
LLM calls |
Sessions |
| Jun 1 |
8 |
3 |
| Jun 2 |
46 |
5 |
| Jun 3 |
16 |
1 |
| Jun 4 (Thu) |
114 |
5 |
| Jun 5 |
3 |
1 |
| Jun 6 |
17 |
2 |
Thursday carried ~60% of the week. user_question below is the session's opening question (the proxy stamps it on every turn), so each row is one session, not one SQL query — and one "session" can contain several distinct user turns. Three models appear because the user switched models mid-session.
Thursday June 4 — full inventory
| # |
Session (opening question) |
Model(s) |
LLM calls |
Tool calls |
Max msg-count |
Started (PT) |
Wall-clock |
LLM compute |
Tokens |
Outcome |
| 1 |
Which CA congressional districts had failed conservation ballot measures + funding at stake |
minimax-m2 |
8 |
7 |
16 |
11:02 |
2 min |
86 s |
342 K |
✅ Answered; correctly noted the 6 statewide measures inflate all districts equally |
| 2 |
Which CDs had greatest increase in conservation land in last 10 yrs |
minimax-m2 |
4 |
4 |
9 |
11:04 |
1 min |
60 s |
166 K |
✅ Clean ranked table (2014–2018 Almanac) |
| 3 |
"I work for TPL… preparing a grant application. What question would you be able to answer that would assist me?" |
minimax-m2 → glm-5 → claude-sonnet-4-6 |
66 |
63 |
43 |
13:20 |
154 min |
16 min |
2.85 M |
Multi-task exploratory session: capability tour, a Yosemite map demo, an address→watershed lookup, WCB/TPL funding tables. Mostly succeeded but contained the watershed runaway; also some WCB column churn (verified not a data bug — see C) |
| 4 |
"What watershed is this address located in? 3109 6th Ave, LA 90018" |
claude-sonnet-4-6 |
20 |
20 |
40 |
15:36 |
2 min |
75 s |
956 K |
⚠️ The headline failure — ~15 query retries, never found it in data, ended by fabricating the HUC code from numbering structure rather than from any dataset |
| 5 |
Miles of the Pacific Crest Trail through the Tahoe basin |
minimax-m2 |
16 |
17 |
20 |
15:57 |
6 min |
103 s |
656 K |
✅ Success (3 separate user turns). PCT answered (~50.4 mi); on a rivers follow-up the model clarified before proceeding — a positive clarify-first example. (Originally mis-flagged as "drift" — that was a grouping artifact.) |
Totals: 114 LLM calls, ~111 tool calls, ~4.97 M tokens, ~20 min actual LLM compute. Session #3's 154-min wall-clock is mostly user think-time (16 min compute).
What went well
What went off the rails (and why)
A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure
The model hardcoded approximate coordinates from memory (lat=34.0369, lon=-118.2974), then ran ~15 variations of "point → H3 h8 cell → join against american-rivers-ira-watersheds" before giving up and inferring the HUC code from the structure of HUC numbering and presenting it as fact.
Reproduced against live data — the loop was doomed from the first query:
american-rivers-ira-watersheds is a thematic subset, not basemap coverage. Title: "IRA-Influenced Drinking Water Watersheds (HUC12)" — only 9,991 HUC12 polygons selected by EPA funding-priority scores (~1/10 of national coverage). The LA address's h8 (613221950241636351) is absent from the hex tiles; ST_Contains on the flat geoparquet returns 0 polygons. Urban LA isn't in this dataset.
- No complete-coverage HUC12 dataset exists in this catalog. NRI-2024 is sparse river lines;
epa-sab-v3-cws is water-utility service areas; HydroBasins is complete but uses Pfafstetter IDs, not HUC12. → data-workflows#200 imports USGS WBD (national, HUC2–HUC12).
- No geocoding tool — coordinates were guessed. → geo-agent#238.
- The MCP H3 guide has no "point/address → containing region" recipe, so the join was improvised and fumbled repeatedly.
B. PCT-through-Tahoe (session #5) drifted — RETRACTED
Original claim was wrong. Session #5 was three separate user turns (PCT mileage → a rivers follow-up where the model clarified → Wild & Scenic Rivers). No drift; it's a success. See the correction comment. The only minor carry-over: the PCT-miles computation used a guessed bounding box for the Tahoe basin (no Tahoe-basin polygon in the catalog) + ST_ line clipping — the same "no AOI polygon → guess the extent" gap as the watershed case, but handled gracefully. Low priority.
C. WCB funding column churn (inside session #3) — verified, NOT a data bug
The model called get_schema on wcb-approved-projects 4× (good discipline). The STAC schema is correct, complete, and self-documenting (descriptions explain truncated names like PrimGrante). The churn was the model reverting to invented un-truncated names (Grantee, dtcost, WCBExpend) under context pressure deep in a 2.85M-token session, then self-correcting. Not a STAC bug; no dedicated issue filed. Only structural angle: shapefile-truncated column names are LLM-hostile (possible future data-workflows aliasing discussion).
The core lesson: explain what you have; don't guess, and label inference
"I don't have that data" is a dead end. The professional response for an expert tool is to surface the relevant data the app does hold and its coverage — and when the model supplements from general knowledge (e.g. knowing the LA address is HUC 18070105 / LA River), to explicitly mark what is data-backed vs. its own inference, never present a guess as a lookup. The watershed session did the opposite: three models all fabricated coordinates and a HUC code and ground through a dozen-plus queries against a dataset that could never contain the answer.
Routing & status
| Fix |
Layer |
Status |
| System prompt: explain available data + coverage, and mark data-backed vs. inference |
Layer 4 — app |
✅ Shipped & live (one appended sentence in system-prompt.md) |
| Import complete watershed coverage (USGS WBD, national HUC2–HUC12) |
Layer 1 — data |
📋 data-workflows#200 |
| Geocoding tool (address/place → coordinates) |
Layer 3 — framework |
📋 geo-agent#238 |
| H3 "point/address → containing region" recipe; "one clean no-match = coverage gap, stop retrying" guidance; optional runaway-loop cap |
Layer 2 / 3 — MCP / geo-agent |
⬜ Not yet filed (lower priority) |
Remaining / optional
- Sweep Jun 2 (46 calls) + Jun 6 (17 calls) to confirm whether June 4's patterns repeat or surface new failure modes.
- Decide whether the H3 point-in-region recipe + loop-cap are worth filing.
Purpose
Review of all agent sessions run against tpl-ca.nrp-nautilus.io on Thursday June 4, 2026 — the day around the TPL demo. Goal: discuss what went well, what went off the rails, and why, then route fixes to the right layer (STAC / MCP / geo-agent / this app's system-prompt).
Source:
open-llm-proxylogs (consolidated/daily/2026-06-04.parquet), scoped to this app's origin. Watershed findings reproduced against live data via the duckdb-geo MCP tools.Traffic context (last 7 days)
Thursday carried ~60% of the week.
user_questionbelow is the session's opening question (the proxy stamps it on every turn), so each row is one session, not one SQL query — and one "session" can contain several distinct user turns. Three models appear because the user switched models mid-session.Thursday June 4 — full inventory
Totals: 114 LLM calls, ~111 tool calls, ~4.97 M tokens, ~20 min actual LLM compute. Session #3's 154-min wall-clock is mostly user think-time (16 min compute).
What went well
What went off the rails (and why)
A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure
The model hardcoded approximate coordinates from memory (
lat=34.0369, lon=-118.2974), then ran ~15 variations of "point → H3 h8 cell → join againstamerican-rivers-ira-watersheds" before giving up and inferring the HUC code from the structure of HUC numbering and presenting it as fact.Reproduced against live data — the loop was doomed from the first query:
american-rivers-ira-watershedsis a thematic subset, not basemap coverage. Title: "IRA-Influenced Drinking Water Watersheds (HUC12)" — only 9,991 HUC12 polygons selected by EPA funding-priority scores (~1/10 of national coverage). The LA address'sh8(613221950241636351) is absent from the hex tiles;ST_Containson the flat geoparquet returns 0 polygons. Urban LA isn't in this dataset.epa-sab-v3-cwsis water-utility service areas; HydroBasins is complete but uses Pfafstetter IDs, not HUC12. → data-workflows#200 imports USGS WBD (national, HUC2–HUC12).B.
PCT-through-Tahoe (session #5) drifted— RETRACTEDOriginal claim was wrong. Session #5 was three separate user turns (PCT mileage → a rivers follow-up where the model clarified → Wild & Scenic Rivers). No drift; it's a success. See the correction comment. The only minor carry-over: the PCT-miles computation used a guessed bounding box for the Tahoe basin (no Tahoe-basin polygon in the catalog) +
ST_line clipping — the same "no AOI polygon → guess the extent" gap as the watershed case, but handled gracefully. Low priority.C. WCB funding column churn (inside session #3) — verified, NOT a data bug
The model called
get_schemaonwcb-approved-projects4× (good discipline). The STAC schema is correct, complete, and self-documenting (descriptions explain truncated names likePrimGrante). The churn was the model reverting to invented un-truncated names (Grantee,dtcost,WCBExpend) under context pressure deep in a 2.85M-token session, then self-correcting. Not a STAC bug; no dedicated issue filed. Only structural angle: shapefile-truncated column names are LLM-hostile (possible future data-workflows aliasing discussion).The core lesson: explain what you have; don't guess, and label inference
"I don't have that data" is a dead end. The professional response for an expert tool is to surface the relevant data the app does hold and its coverage — and when the model supplements from general knowledge (e.g. knowing the LA address is HUC
18070105/ LA River), to explicitly mark what is data-backed vs. its own inference, never present a guess as a lookup. The watershed session did the opposite: three models all fabricated coordinates and a HUC code and ground through a dozen-plus queries against a dataset that could never contain the answer.Routing & status
system-prompt.md)Remaining / optional