[codex] Support raw image refs for multimodal rendering by eligotts · Pull Request #89 · PrimeIntellect-ai/renderers

eligotts · 2026-06-18T07:04:47Z

Design update — inline/offload image storage

This PR now supports both raw image transport modes used by prime-rl:

offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.
inline: data-image URIs remain inline and raw refs carry the inline source instead of requiring raw_image_id.

This repo adds inline-capable mmraw:v3 refs while preserving mmraw:v2 parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with either raw_image_id or raw_uri.

Validation after latest push: uv run pytest tests/test_client.py -q passed (14 passed).

Design update — dropped the `None`/cache-only image path

This PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send None for already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns); /inference/v1/generate rematerializes each ref from disk every request.

Why: the None path coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLM EngineDeadError (qwen3-vl mrope dereferences a None image_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM's mm_hash encoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.

Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0 data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.

This repo: every image emits a raw descriptor ref at every slot. _descriptor_only_mm_data no longer strips the pointer (pixel_values were never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed the materialize_all_image_refs flag and the now-orphaned materialize_image_refs / materialize_kimi_image_refs.

Original description

Summary

adds generic mmraw:v2 raw multimodal refs in renderers.mm_store, parsed as RawMMRef objects with family, fingerprint, modality, hash, asset id, and adapter-owned payload
emits strict prime_raw_mm_item envelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image rendering
keeps adapter-specific layout details in renderer-owned payloads (image_grid_thw for Qwen, grid_thws/media token metadata for Kimi)
supports materializing all raw image refs for retry paths after vLLM multimodal cache misses
keeps run-scoped image asset refs file-backed so downstream Prime-RL trainer materializes images with its own processor

Companion PRs

Prime-RL: Support v1 raw multimodal image offload prime-rl#2836
Verifiers: [codex] Support raw image offload in v1 train client verifiers#1746

Notes

Draft/WIP: stacked with the Verifiers and Prime-RL raw image offload PRs.
Verifiers is expected to offload image content to file://.../assets/images/... refs before rendering.
This intentionally treats raw image refs as the supported path, not processed multimodal feature sidecars.

Validation

uvx ruff@0.15.18 check . passed.
uvx ruff@0.15.18 format --check . passed.
uvx 'ty<0.0.22' check . exited 0; remaining diagnostics are warning-level advisories under the repo config.
PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.py passed: 14 passed.
End-to-end hosted-style smoke through Prime-RL with /home/ubuntu/renderers, /home/ubuntu/verifiers, and /home/ubuntu/prime-rl-v1-raw-mm-offload completed inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.

[!NOTE]

Support raw image refs for multimodal rendering in Qwen3-VL, Qwen3.5, and Kimi-K2.5 renderers

Adds a multimodal_output config field ('raw' or 'processed') to BaseRendererConfig; renderers default to 'raw', emitting file-URI image references and baked layout metadata instead of processed pixel tensors.

Introduces renderers/mm_store.py with utilities to construct, serialize, offload, and validate raw multimodal image references and layout fingerprints, with no torch/vLLM dependency.

Refactors image handling in renderers/qwen3_vl.py, renderers/qwen35.py, and renderers/kimi_k25.py to delegate image processing through shared helpers (qwen_image_item_for_render, kimi_image_item_for_render) and select between raw refs or processed payloads based on config.

Updates renderers/client.py to serialize multimodal features as raw references via _build_vllm_mm_features, removing the renderer-class-specific torch/vLLM encoding path.

Adds a vision optional dependency group in pyproject.toml for pillow, torch, and torchvision, required only for multimodal_output='processed'.

Risk: renderer constructors no longer accept an injected processor argument; callers relying on direct processor injection will break.

^{Macroscope summarized e3c12e9.}

Update: review hardening (`e3c12e9`)

Fixed render_completion_update mutating the caller's previous_multi_modal_data in place (shallow dict copy + setdefault().extend()); the merge now lives in a shared merge_multi_modal_data helper in base.py that copies inner lists, and the bridge test asserts the previous sidecar is unmutated.
Qwen resize math is imported from transformers (torch-free PIL-backend module, with a fallback for older layouts) instead of maintaining a port.
Raw layout describes resolve and read each image asset once; mm_store uses full sha256 content-addressed filenames and raises on undecodable base64.
Added test_raw_layout_math_matches_image_processor: parity against the real Qwen3-VL-4B and Kimi-K2.5 (pinned revision) processors at rounding-boundary dimensions. Full suite: 2182 passed.

Note

High Risk
Default multimodal sidecar shape changes break callers expecting pixel_values unless they set multimodal_output='processed', and inference correctness now depends on run-scoped image offload and ref materialization across companion services.

Overview
Multimodal inference now defaults to lightweight raw image descriptors instead of embedding processed pixel_values in MultiModalData. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from family layout specs (offloaded file:// assets), emit prime_raw_mm_item envelopes via new renderers.mm_store, and bridge turns through shared merge_multi_modal_data.

multimodal_output on renderer configs selects raw (inference / vLLM) vs processed (lazy HF processor + cache for SFT). Auto/default renderer resolution propagates this flag.

generate() no longer stacks tensors with vLLM-specific encoders; it builds features from hashes, placeholders, and mmraw: refs the endpoint materializes. Prebuilt prompts can pull multi_modal_data from prompt_attribution when omitted.

Packaging: optional vision extra for Pillow/torch when using processed mode. Per-renderer image_cache_max config fields are removed. Tests expect offloaded images and assert raw-ref wire shape plus layout parity with real processors.

^{Reviewed by Cursor Bugbot for commit e3c12e9. Bugbot is set up for automated code reviews on this repo. Configure here.}

…s-renderers

Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tale comments - Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache. - mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers. - client.py: fix stale generate() docstring + comment that referenced the removed None/cache path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py

macroscopeapp · 2026-06-29T16:37:02Z

Approvability

Verdict: Needs human review

This PR introduces a new multimodal rendering mode that changes default behavior from sending processed image tensors to sending image refs/descriptors. The change affects wire format to inference endpoints and introduces new abstractions in mm_store.py. The scope and behavioral impact warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.}

- Fix bridge merge mutating the caller's previous sidecar in place: shared merge_multi_modal_data helper copies inner lists, replacing the three per-renderer merge blocks; bridge test asserts no mutation. - Import Qwen's smart_resize from transformers (torch-free PIL-backend module) instead of maintaining a port. - Resolve and read each raw image asset once per layout describe. - mm_store: full sha256 content-addressed filenames; raise on undecodable base64 instead of silently passing the data URL through. - Drop the dead features/mm_data tuple plumbing in client.generate. - Add layout-math parity test against the real Qwen3-VL and Kimi-K2.5 image processors at rounding-boundary dimensions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Support raw image refs for multimodal rendering

32d5a9d

This was referenced Jun 18, 2026

[codex] Support raw image offload in v1 train client PrimeIntellect-ai/verifiers#1746

Open

Support v1 raw multimodal image offload PrimeIntellect-ai/prime-rl#2836

Open

eligotts and others added 10 commits June 20, 2026 07:41

Emit generic raw multimodal refs

4bc1766

Merge remote-tracking branch 'origin/main' into codex/raw-image-asset…

eaa07bb

…s-renderers

Fix raw image renderer style checks

a8f4386

Remove orphaned image_cache_max config field

f8ca354

It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Align raw multimodal renderer descriptors

c33805d

Merge remote-tracking branch 'origin/main' into codex/raw-image-asset…

8fcd0c7

…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py

feat: support inline raw image refs

e97c812

Clean up raw multimodal offload renderers

673b790

eligotts marked this pull request as ready for review June 29, 2026 16:36

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread renderers/client.py

Comment thread renderers/client.py Outdated

eligotts added 2 commits June 29, 2026 16:57

Clarify raw image asset contract

af84c19

Apply ruff formatting

4e3502f

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread renderers/client.py Outdated

eligotts and others added 6 commits June 29, 2026 17:37

Preserve multimodal sidecar for prebuilt prompts

998e1db

Use URI-based raw image refs

2a19d75

Drop raw multimodal version markers

aa2d44d

Support processed multimodal renderer output

ed5b404

Trim uv lock churn

a7953b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Support raw image refs for multimodal rendering#89

[codex] Support raw image refs for multimodal rendering#89
eligotts wants to merge 19 commits into
mainfrom
codex/raw-image-assets-renderers

eligotts commented Jun 18, 2026 •

edited

Loading

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

eligotts commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design update — inline/offload image storage

Design update — dropped the None/cache-only image path

Summary

Companion PRs

Notes

Validation

Support raw image refs for multimodal rendering in Qwen3-VL, Qwen3.5, and Kimi-K2.5 renderers

Update: review hardening (e3c12e9)

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eligotts commented Jun 18, 2026 •

edited

Loading

Design update — dropped the `None`/cache-only image path

Update: review hardening (`e3c12e9`)

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading