Skip to content

[codex] Support raw image refs for multimodal rendering#89

Open
eligotts wants to merge 19 commits into
mainfrom
codex/raw-image-assets-renderers
Open

[codex] Support raw image refs for multimodal rendering#89
eligotts wants to merge 19 commits into
mainfrom
codex/raw-image-assets-renderers

Conversation

@eligotts

@eligotts eligotts commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Design update — inline/offload image storage

This PR now supports both raw image transport modes used by prime-rl:

  • offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.
  • inline: data-image URIs remain inline and raw refs carry the inline source instead of requiring raw_image_id.

This repo adds inline-capable mmraw:v3 refs while preserving mmraw:v2 parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with either raw_image_id or raw_uri.

Validation after latest push: uv run pytest tests/test_client.py -q passed (14 passed).

Design update — dropped the None/cache-only image path

This PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send None for already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns); /inference/v1/generate rematerializes each ref from disk every request.

Why: the None path coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLM EngineDeadError (qwen3-vl mrope dereferences a None image_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM's mm_hash encoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.

Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0 data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.

This repo: every image emits a raw descriptor ref at every slot. _descriptor_only_mm_data no longer strips the pointer (pixel_values were never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed the materialize_all_image_refs flag and the now-orphaned materialize_image_refs / materialize_kimi_image_refs.


Original description

Summary

  • adds generic mmraw:v2 raw multimodal refs in renderers.mm_store, parsed as RawMMRef objects with family, fingerprint, modality, hash, asset id, and adapter-owned payload
  • emits strict prime_raw_mm_item envelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image rendering
  • keeps adapter-specific layout details in renderer-owned payloads (image_grid_thw for Qwen, grid_thws/media token metadata for Kimi)
  • supports materializing all raw image refs for retry paths after vLLM multimodal cache misses
  • keeps run-scoped image asset refs file-backed so downstream Prime-RL trainer materializes images with its own processor

Companion PRs

Notes

  • Draft/WIP: stacked with the Verifiers and Prime-RL raw image offload PRs.
  • Verifiers is expected to offload image content to file://.../assets/images/... refs before rendering.
  • This intentionally treats raw image refs as the supported path, not processed multimodal feature sidecars.

Validation

  • uvx ruff@0.15.18 check . passed.
  • uvx ruff@0.15.18 format --check . passed.
  • uvx 'ty<0.0.22' check . exited 0; remaining diagnostics are warning-level advisories under the repo config.
  • PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.py passed: 14 passed.
  • End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/renderers, /home/ubuntu/verifiers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

[!NOTE]

Support raw image refs for multimodal rendering in Qwen3-VL, Qwen3.5, and Kimi-K2.5 renderers

  • Adds a multimodal_output config field ('raw' or 'processed') to BaseRendererConfig; renderers default to 'raw', emitting file-URI image references and baked layout metadata instead of processed pixel tensors.
  • Introduces renderers/mm_store.py with utilities to construct, serialize, offload, and validate raw multimodal image references and layout fingerprints, with no torch/vLLM dependency.
  • Refactors image handling in renderers/qwen3_vl.py, renderers/qwen35.py, and renderers/kimi_k25.py to delegate image processing through shared helpers (qwen_image_item_for_render, kimi_image_item_for_render) and select between raw refs or processed payloads based on config.
  • Updates renderers/client.py to serialize multimodal features as raw references via _build_vllm_mm_features, removing the renderer-class-specific torch/vLLM encoding path.
  • Adds a vision optional dependency group in pyproject.toml for pillow, torch, and torchvision, required only for multimodal_output='processed'.
  • Risk: renderer constructors no longer accept an injected processor argument; callers relying on direct processor injection will break.

Macroscope summarized e3c12e9.

Update: review hardening (e3c12e9)

  • Fixed render_completion_update mutating the caller's previous_multi_modal_data in place (shallow dict copy + setdefault().extend()); the merge now lives in a shared merge_multi_modal_data helper in base.py that copies inner lists, and the bridge test asserts the previous sidecar is unmutated.
  • Qwen resize math is imported from transformers (torch-free PIL-backend module, with a fallback for older layouts) instead of maintaining a port.
  • Raw layout describes resolve and read each image asset once; mm_store uses full sha256 content-addressed filenames and raises on undecodable base64.
  • Added test_raw_layout_math_matches_image_processor: parity against the real Qwen3-VL-4B and Kimi-K2.5 (pinned revision) processors at rounding-boundary dimensions. Full suite: 2182 passed.

Note

High Risk
Default multimodal sidecar shape changes break callers expecting pixel_values unless they set multimodal_output='processed', and inference correctness now depends on run-scoped image offload and ref materialization across companion services.

Overview
Multimodal inference now defaults to lightweight raw image descriptors instead of embedding processed pixel_values in MultiModalData. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from family layout specs (offloaded file:// assets), emit prime_raw_mm_item envelopes via new renderers.mm_store, and bridge turns through shared merge_multi_modal_data.

multimodal_output on renderer configs selects raw (inference / vLLM) vs processed (lazy HF processor + cache for SFT). Auto/default renderer resolution propagates this flag.

generate() no longer stacks tensors with vLLM-specific encoders; it builds features from hashes, placeholders, and mmraw: refs the endpoint materializes. Prebuilt prompts can pull multi_modal_data from prompt_attribution when omitted.

Packaging: optional vision extra for Pillow/torch when using processed mode. Per-renderer image_cache_max config fields are removed. Tests expect offloaded images and assert raw-ref wire shape plus layout parity with real processors.

Reviewed by Cursor Bugbot for commit e3c12e9. Bugbot is set up for automated code reviews on this repo. Configure here.

eligotts and others added 10 commits June 20, 2026 07:41
Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tale comments

- Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache.

- mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers.

- client.py: fix stale generate() docstring + comment that referenced the removed None/cache path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-renderers

# Conflicts:
#	renderers/configs.py
#	renderers/qwen3_vl.py
@eligotts eligotts marked this pull request as ready for review June 29, 2026 16:36
@macroscopeapp

macroscopeapp Bot commented Jun 29, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a new multimodal rendering mode that changes default behavior from sending processed image tensors to sending image refs/descriptors. The change affects wire format to inference endpoints and introduces new abstractions in mm_store.py. The scope and behavioral impact warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Comment thread renderers/client.py
Comment thread renderers/client.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.

Comment thread renderers/client.py Outdated
eligotts and others added 6 commits June 29, 2026 17:37
- Fix bridge merge mutating the caller's previous sidecar in place:
  shared merge_multi_modal_data helper copies inner lists, replacing the
  three per-renderer merge blocks; bridge test asserts no mutation.
- Import Qwen's smart_resize from transformers (torch-free PIL-backend
  module) instead of maintaining a port.
- Resolve and read each raw image asset once per layout describe.
- mm_store: full sha256 content-addressed filenames; raise on
  undecodable base64 instead of silently passing the data URL through.
- Drop the dead features/mm_data tuple plumbing in client.generate.
- Add layout-math parity test against the real Qwen3-VL and Kimi-K2.5
  image processors at rounding-boundary dimensions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants