Demo review: Thursday June 4 agent sessions — full inventory + what went off the rails

## Purpose

Review of all agent sessions run against **tpl-ca.nrp-nautilus.io** on **Thursday June 4, 2026** — the day around the TPL demo. Goal: discuss what went well, what went off the rails, and why, then route fixes to the right layer (STAC / MCP / geo-agent / this app's system-prompt).

Source: `open-llm-proxy` logs (`consolidated/daily/2026-06-04.parquet`), scoped to this app's origin. Watershed findings reproduced against live data via the duckdb-geo MCP tools.

> **Status (resolved):** system-prompt clarify/label-inference fix is **shipped & live**; USGS WBD import → data-workflows#200; geocoding tool → geo-agent#238. Session #5 re-assessed as a **success** (see below). Details in the comments.

## Traffic context (last 7 days)

| Day | LLM calls | Sessions |
|---|---|---|
| Jun 1 | 8 | 3 |
| Jun 2 | 46 | 5 |
| Jun 3 | 16 | 1 |
| **Jun 4 (Thu)** | **114** | **5** |
| Jun 5 | 3 | 1 |
| Jun 6 | 17 | 2 |

Thursday carried ~60% of the week. `user_question` below is the *session's opening question* (the proxy stamps it on every turn), so each row is one session, not one SQL query — and one "session" can contain several distinct user turns. Three models appear because the user switched models mid-session.

## Thursday June 4 — full inventory

| # | Session (opening question) | Model(s) | LLM calls | Tool calls | Max msg-count | Started (PT) | Wall-clock | LLM compute | Tokens | Outcome |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Which CA congressional districts had failed conservation ballot measures + funding at stake | minimax-m2 | 8 | 7 | 16 | 11:02 | 2 min | 86 s | 342 K | ✅ Answered; correctly noted the 6 statewide measures inflate all districts equally |
| 2 | Which CDs had greatest increase in conservation land in last 10 yrs | minimax-m2 | 4 | 4 | 9 | 11:04 | 1 min | 60 s | 166 K | ✅ Clean ranked table (2014–2018 Almanac) |
| 3 | "I work for TPL… preparing a grant application. What question would you be able to answer that would assist me?" | minimax-m2 → glm-5 → claude-sonnet-4-6 | 66 | 63 | 43 | 13:20 | 154 min | 16 min | 2.85 M | Multi-task exploratory session: capability tour, a Yosemite map demo, an address→watershed lookup, WCB/TPL funding tables. Mostly succeeded but **contained the watershed runaway**; also some WCB column churn (verified *not* a data bug — see C) |
| 4 | "What watershed is this address located in? 3109 6th Ave, LA 90018" | claude-sonnet-4-6 | 20 | 20 | 40 | 15:36 | 2 min | 75 s | 956 K | ⚠️ **The headline failure** — ~15 query retries, never found it in data, ended by fabricating the HUC code from numbering structure rather than from any dataset |
| 5 | Miles of the Pacific Crest Trail through the Tahoe basin | minimax-m2 | 16 | 17 | 20 | 15:57 | 6 min | 103 s | 656 K | ✅ **Success** (3 separate user turns). PCT answered (~50.4 mi); on a rivers follow-up the model **clarified before proceeding** — a positive clarify-first example. (Originally mis-flagged as "drift" — that was a grouping artifact.) |

**Totals:** 114 LLM calls, ~111 tool calls, ~4.97 M tokens, ~20 min actual LLM compute. Session #3's 154-min wall-clock is mostly user think-time (16 min compute).

## What went well

- **Sessions #1 and #2** (congressional-district conservation funding) are exactly the target use case and they worked cleanly in 1–2 min. Notably #1 caught a real analytical trap on its own — that 6 statewide ballot measures appear in every district and inflate the totals equally — and surfaced the meaningful county/municipal variation instead. Expert-grade reasoning.
- **Session #5** answered the PCT-mileage question and, on a follow-up about rivers, **asked a clarifying question instead of guessing** — the exact behavior we want.
- The **Almanac attribution caveats** in the system prompt held up — funding tables described WCB/Almanac sources correctly rather than calling it "TPL-protected land."

## What went off the rails (and why)

### A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure

The model hardcoded approximate coordinates from memory (`lat=34.0369, lon=-118.2974`), then ran ~15 variations of "point → H3 h8 cell → join against `american-rivers-ira-watersheds`" before giving up and **inferring the HUC code from the structure of HUC numbering and presenting it as fact.**

Reproduced against live data — the loop was **doomed from the first query**:

1. **`american-rivers-ira-watersheds` is a thematic subset, not basemap coverage.** Title: "IRA-Influenced Drinking Water Watersheds (HUC12)" — only **9,991** HUC12 polygons selected by EPA funding-priority scores (~1/10 of national coverage). The LA address's `h8` (`613221950241636351`) is **absent** from the hex tiles; `ST_Contains` on the flat geoparquet returns **0** polygons. Urban LA isn't in this dataset.
2. **No complete-coverage HUC12 dataset exists in this catalog.** NRI-2024 is sparse river *lines*; `epa-sab-v3-cws` is water-*utility* service areas; HydroBasins is complete but uses Pfafstetter IDs, not HUC12. → **data-workflows#200** imports USGS WBD (national, HUC2–HUC12).
3. **No geocoding tool** — coordinates were guessed. → **geo-agent#238**.
4. The MCP H3 guide has no "point/address → containing region" recipe, so the join was improvised and fumbled repeatedly.

### B. ~~PCT-through-Tahoe (session #5) drifted~~ — RETRACTED

Original claim was wrong. Session #5 was three separate user turns (PCT mileage → a rivers follow-up where the model clarified → Wild & Scenic Rivers). No drift; it's a success. See the correction comment. The only minor carry-over: the PCT-miles computation used a **guessed bounding box** for the Tahoe basin (no Tahoe-basin polygon in the catalog) + `ST_` line clipping — the same "no AOI polygon → guess the extent" gap as the watershed case, but handled gracefully. Low priority.

### C. WCB funding column churn (inside session #3) — verified, NOT a data bug

The model called `get_schema` on `wcb-approved-projects` **4×** (good discipline). The STAC schema is **correct, complete, and self-documenting** (descriptions explain truncated names like `PrimGrante`). The churn was the model reverting to invented un-truncated names (`Grantee`, `dtcost`, `WCBExpend`) under context pressure deep in a 2.85M-token session, then self-correcting. Not a STAC bug; no dedicated issue filed. Only structural angle: shapefile-truncated column names are LLM-hostile (possible future data-workflows aliasing discussion).

## The core lesson: explain what you have; don't guess, and label inference

"I don't have that data" is a dead end. The professional response for an expert tool is to **surface the relevant data the app does hold and its coverage** — and when the model supplements from general knowledge (e.g. knowing the LA address is HUC `18070105` / LA River), to **explicitly mark what is data-backed vs. its own inference**, never present a guess as a lookup. The watershed session did the opposite: three models all fabricated coordinates and a HUC code and ground through a dozen-plus queries against a dataset that could never contain the answer.

## Routing & status

| Fix | Layer | Status |
|---|---|---|
| System prompt: explain available data + coverage, and mark data-backed vs. inference | Layer 4 — app | ✅ **Shipped & live** (one appended sentence in `system-prompt.md`) |
| Import complete watershed coverage (USGS WBD, national HUC2–HUC12) | Layer 1 — data | 📋 **data-workflows#200** |
| Geocoding tool (address/place → coordinates) | Layer 3 — framework | 📋 **geo-agent#238** |
| H3 "point/address → containing region" recipe; "one clean no-match = coverage gap, stop retrying" guidance; optional runaway-loop cap | Layer 2 / 3 — MCP / geo-agent | ⬜ Not yet filed (lower priority) |

## Remaining / optional

- Sweep Jun 2 (46 calls) + Jun 6 (17 calls) to confirm whether June 4's patterns repeat or surface new failure modes.
- Decide whether the H3 point-in-region recipe + loop-cap are worth filing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo review: Thursday June 4 agent sessions — full inventory + what went off the rails #26

Purpose

Traffic context (last 7 days)

Thursday June 4 — full inventory

What went well

What went off the rails (and why)

A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure

B. PCT-through-Tahoe (session #5) drifted — RETRACTED

C. WCB funding column churn (inside session #3) — verified, NOT a data bug

The core lesson: explain what you have; don't guess, and label inference

Routing & status

Remaining / optional

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Day	LLM calls	Sessions
Jun 1	8	3
Jun 2	46	5
Jun 3	16	1
Jun 4 (Thu)	114	5
Jun 5	3	1
Jun 6	17	2

#	Session (opening question)	Model(s)	LLM calls	Tool calls	Max msg-count	Started (PT)	Wall-clock	LLM compute	Tokens	Outcome
1	Which CA congressional districts had failed conservation ballot measures + funding at stake	minimax-m2	8	7	16	11:02	2 min	86 s	342 K	✅ Answered; correctly noted the 6 statewide measures inflate all districts equally
2	Which CDs had greatest increase in conservation land in last 10 yrs	minimax-m2	4	4	9	11:04	1 min	60 s	166 K	✅ Clean ranked table (2014–2018 Almanac)
3	"I work for TPL… preparing a grant application. What question would you be able to answer that would assist me?"	minimax-m2 → glm-5 → claude-sonnet-4-6	66	63	43	13:20	154 min	16 min	2.85 M	Multi-task exploratory session: capability tour, a Yosemite map demo, an address→watershed lookup, WCB/TPL funding tables. Mostly succeeded but contained the watershed runaway; also some WCB column churn (verified not a data bug — see C)
4	"What watershed is this address located in? 3109 6th Ave, LA 90018"	claude-sonnet-4-6	20	20	40	15:36	2 min	75 s	956 K	⚠️ The headline failure — ~15 query retries, never found it in data, ended by fabricating the HUC code from numbering structure rather than from any dataset
5	Miles of the Pacific Crest Trail through the Tahoe basin	minimax-m2	16	17	20	15:57	6 min	103 s	656 K	✅ Success (3 separate user turns). PCT answered (~50.4 mi); on a rivers follow-up the model clarified before proceeding — a positive clarify-first example. (Originally mis-flagged as "drift" — that was a grouping artifact.)

Fix	Layer	Status
System prompt: explain available data + coverage, and mark data-backed vs. inference	Layer 4 — app	✅ Shipped & live (one appended sentence in `system-prompt.md`)
Import complete watershed coverage (USGS WBD, national HUC2–HUC12)	Layer 1 — data	📋 data-workflows#200
Geocoding tool (address/place → coordinates)	Layer 3 — framework	📋 geo-agent#238
H3 "point/address → containing region" recipe; "one clean no-match = coverage gap, stop retrying" guidance; optional runaway-loop cap	Layer 2 / 3 — MCP / geo-agent	⬜ Not yet filed (lower priority)

Demo review: Thursday June 4 agent sessions — full inventory + what went off the rails #26

Description

Purpose

Traffic context (last 7 days)

Thursday June 4 — full inventory

What went well

What went off the rails (and why)

A. Watershed-by-address (session #4, and a duplicate inside #3) — the headline failure

B. PCT-through-Tahoe (session #5) drifted — RETRACTED

C. WCB funding column churn (inside session #3) — verified, NOT a data bug

The core lesson: explain what you have; don't guess, and label inference

Routing & status

Remaining / optional

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions