Detect and surface malformed/empty LLM responses instead of silently returning bad output

The namer expects each LLM call to return JSON of the form `{"topic_name": <topic_name>, "topic_score": <topic_score>}` (or something like this in the new version). When a reasoning model exhausts its token budget, it returns an empty string (`''`) with `finish_reason="length"`, but we currently do nothing to catch, warn, or log this. The failure passes through silently, making it hard to tell that breakage is happening at the LLM prompt/response layer. This affects any reasoning-capable model where the token budget can be consumed entirely by reasoning, leaving no tokens for the final answer.

## Problem
Two distinct failure modes go undetected:

1. **Empty / off-format content** — responses like `''` or `{}` that clearly don't match the expected schema are passed downstream without any signal that something went wrong.
2. **Non-`stop` finish reasons** — when `finish_reason` is `"length"` (or anything other than `"stop"`), it's a strong indicator the response is truncated or unusable, but we don't inspect it. The combination of an empty string *and* a non-`stop` finish reason is the specific signature of the reasoning-token-exhaustion bug.

Debugging this case also surfaced that the current debug tooling doesn't expose enough to diagnose problems like this efficiently.

## Proposed changes

**Response validation**
- Check `finish_reason` on each response. If it isn't `"stop"`, at minimum log it.
- Escalate the response: if `finish_reason` is not `"stop"` **and** the content is empty/off-format, warn (or raise) more aggressively, since this is almost certainly a real failure rather than a benign edge case.
- Validate that the content parses into the expected `{"topic_name", "topic_score"}` shape; if it's very off-format (empty string, `{}`, unparseable), let the user know breakage is happening at the LLM layer rather than failing silently downstream.

**Debug tooling**
- Include `finish_reason` in the debug callback.
- Include other useful response metadata that would help diagnose what's going wrong (e.g. token usage, raw content, model, refusal/stop details).
- Include a test with a user facing hook (like .connectivity_status()) that uses a real prompt to make it easier to surface what's going on 

**Documenting debug tooling/process**
- Include debugging tooling and how to use it in the docs (currently there exists some tooling but it is not referenced in teh docs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect and surface malformed/empty LLM responses instead of silently returning bad output #155

Problem

Proposed changes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Detect and surface malformed/empty LLM responses instead of silently returning bad output #155

Description

Problem

Proposed changes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions