Skip to content

Bound all unbounded retry and polling loops; validate LLM-built camera tree#57

Open
fviolette26 wants to merge 1 commit into
HKUDS:mainfrom
fviolette26:fix/batch2-hang-fixes
Open

Bound all unbounded retry and polling loops; validate LLM-built camera tree#57
fviolette26 wants to merge 1 commit into
HKUDS:mainfrom
fviolette26:fix/batch2-hang-fixes

Conversation

@fviolette26

Copy link
Copy Markdown

Summary

Several failure paths previously hung forever (or spent tokens without bound) instead of surfacing an error:

  • utils/image.py / utils/video.pydownload_image/download_video used bare @retry (tenacity: retry forever, zero wait) around a requests.get with no timeout, so an expired signed URL became an infinite hot loop. Now: 3 attempts, exponential backoff, fail-fast on 4xx, connect/read timeouts (shared policy in utils/retry.py).
  • tools/video_generator_doubao_seedance_yunwu_api.py — task creation retried every second forever on any exception (a bad API key never surfaced; the exception wasn't even included in the log message), and polling had no deadline. Now checks HTTP status, fails fast on 4xx, bounds create retries, and polls with a deadline plus a consecutive-error cap.
  • tools/video_generator_omni_yunwu_api.py — same unbounded create loop, fixed the same way; the poll deadline now defaults to 300 instead of unbounded None.
  • agents/event_extractor.py / pipelines/novel2movie_pipeline.py — event/scene extraction looped on the LLM-asserted is_last flag with no cap; a model that never sets it spent tokens forever. Hard caps now abort with a clear error.
  • agents/camera_image_generator.pyconstruct_camera_tree accepted whatever parent graph the LLM emitted; a cycle deadlocked frame generation forever (cameras awaiting events only their descendants would set), with no error. The tree is now validated for length, unknown parents, self-parents, and cycles. Also removes a duplicated parent_shot_idx assignment.

Test plan

uv run --with pytest python -m pytest tests/ — 116 passed (102 existing + 14 new in tests/test_hang_guards.py). The new tests' fakes succeed after N calls, so the fixed code must give up before the fake would have succeeded — they fail fast against the old behavior instead of hanging.

🤖 Generated with Claude Code

…a tree

Several failure paths previously hung forever instead of erroring:

- utils download_image/download_video used bare @Retry (retry forever,
  no wait) around requests.get with no timeout: an expired signed URL
  became an infinite hot loop. Now: 3 attempts, exponential backoff,
  fail-fast on 4xx, connect/read timeouts.
- The doubao-seedance client retried task creation every second forever
  on any exception (including auth errors, which it never surfaced) and
  polled with no deadline. The omni client had the same create loop and
  an unbounded default poll. Both now check HTTP status, fail fast on
  4xx, bound create retries, and default the poll deadline (300 polls).
- Event/scene extraction looped on the LLM-asserted is_last flag with
  no cap, so a model that never set it spent tokens without bound. The
  extractor and both pipeline loops now abort at a hard cap.
- construct_camera_tree accepted whatever parent graph the LLM emitted;
  a cycle deadlocked frame generation forever (cameras awaiting events
  only their descendants would set). The tree is now validated for
  length, unknown parents, self-parents, and cycles. Also removes a
  duplicated parent_shot_idx assignment.

Adds regression tests for every bound.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant