Skip to content

Detect and surface malformed/empty LLM responses instead of silently returning bad output #155

Description

@acwooding

The namer expects each LLM call to return JSON of the form {"topic_name": <topic_name>, "topic_score": <topic_score>} (or something like this in the new version). When a reasoning model exhausts its token budget, it returns an empty string ('') with finish_reason="length", but we currently do nothing to catch, warn, or log this. The failure passes through silently, making it hard to tell that breakage is happening at the LLM prompt/response layer. This affects any reasoning-capable model where the token budget can be consumed entirely by reasoning, leaving no tokens for the final answer.

Problem

Two distinct failure modes go undetected:

  1. Empty / off-format content — responses like '' or {} that clearly don't match the expected schema are passed downstream without any signal that something went wrong.
  2. Non-stop finish reasons — when finish_reason is "length" (or anything other than "stop"), it's a strong indicator the response is truncated or unusable, but we don't inspect it. The combination of an empty string and a non-stop finish reason is the specific signature of the reasoning-token-exhaustion bug.

Debugging this case also surfaced that the current debug tooling doesn't expose enough to diagnose problems like this efficiently.

Proposed changes

Response validation

  • Check finish_reason on each response. If it isn't "stop", at minimum log it.
  • Escalate the response: if finish_reason is not "stop" and the content is empty/off-format, warn (or raise) more aggressively, since this is almost certainly a real failure rather than a benign edge case.
  • Validate that the content parses into the expected {"topic_name", "topic_score"} shape; if it's very off-format (empty string, {}, unparseable), let the user know breakage is happening at the LLM layer rather than failing silently downstream.

Debug tooling

  • Include finish_reason in the debug callback.
  • Include other useful response metadata that would help diagnose what's going wrong (e.g. token usage, raw content, model, refusal/stop details).
  • Include a test with a user facing hook (like .connectivity_status()) that uses a real prompt to make it easier to surface what's going on

Documenting debug tooling/process

  • Include debugging tooling and how to use it in the docs (currently there exists some tooling but it is not referenced in teh docs)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions