[codex] Support raw image refs for multimodal rendering#89
Conversation
Drop the cache-only None path. Every image (current and prior turns) carries its raw descriptor ref; _descriptor_only_mm_data no longer strips the pointer, so refs carry forward without a rebuild. Removes the now-orphaned materialize_image_refs / materialize_kimi_image_refs and the materialize_all_image_refs flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tale comments - Drop the render-time processor constructor arg from Qwen3VL/Qwen35/Kimi renderers: geometry is computed deterministically from config; no renderer runs the HF image processor at render. Remove Kimi dead _get_processor/_process_image/self._processor/_image_cache. - mm_store: remove all backcompat aliases (MMRAW_PREFIX, MM_RAW_PAYLOAD_KEY/VALUE, mmraw_ref, split_mmraw_ref, image_asset_dir) -- no consumers. - client.py: fix stale generate() docstring + comment that referenced the removed None/cache path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It only sized Kimi per-renderer image cache, which was deleted with the render-time processor path. No consumers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-renderers # Conflicts: # renderers/configs.py # renderers/qwen3_vl.py
ApprovabilityVerdict: Needs human review This PR introduces a new multimodal rendering mode that changes default behavior from sending processed image tensors to sending image refs/descriptors. The change affects wire format to inference endpoints and introduces new abstractions in mm_store.py. The scope and behavioral impact warrant human review. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4e3502f. Configure here.
- Fix bridge merge mutating the caller's previous sidecar in place: shared merge_multi_modal_data helper copies inner lists, replacing the three per-renderer merge blocks; bridge test asserts no mutation. - Import Qwen's smart_resize from transformers (torch-free PIL-backend module) instead of maintaining a port. - Resolve and read each raw image asset once per layout describe. - mm_store: full sha256 content-addressed filenames; raise on undecodable base64 instead of silently passing the data URL through. - Drop the dead features/mm_data tuple plumbing in client.generate. - Add layout-math parity test against the real Qwen3-VL and Kimi-K2.5 image processors at rounding-boundary dimensions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Design update — inline/offload image storage
This PR now supports both raw image transport modes used by prime-rl:
offload: existing behavior, raw image bytes are written to run-scoped image assets and refs carry a file-backed image id.inline: data-image URIs remain inline and raw refs carry the inline source instead of requiringraw_image_id.This repo adds inline-capable
mmraw:v3refs while preservingmmraw:v2parsing, keeps Qwen image hashes aligned to the raw decoded bytes, and emits raw descriptor items with eitherraw_image_idorraw_uri.Validation after latest push:
uv run pytest tests/test_client.py -qpassed (14 passed).Design update — dropped the
None/cache-only image pathThis PR and its companions (prime-rl #2836 / verifiers #1746 / renderers #89) no longer use the "send
Nonefor already-cached images" mechanism. Every image carries its raw descriptor ref at every slot (current and prior turns);/inference/v1/generaterematerializes each ref from disk every request.Why: the
Nonepath coupled correctness to deployment (LRU cache present, single replica / DP-affinity, no eviction) and surfaced a miss as a hard vLLMEngineDeadError(qwen3-vl mrope dereferences aNoneimage_grid_thw) that the retry net couldn't catch across the engine→API IPC. Dropping it is deployment-agnostic (a miss is impossible) and non-hacky. vLLM'smm_hashencoder cache still skips the expensive GPU re-encode for free — we only forgo the cheap IPC/CPU-reprocess dedup.Validated: color-codeword (Qwen3-VL-4B) under DP=2, no affinity / no cache reliance: 0 crashes, 0
data=None, multi-turn accumulation correct, reward ~0.84. Also confirmed under TP.This repo: every image emits a raw descriptor ref at every slot.
_descriptor_only_mm_datano longer strips the pointer (pixel_valueswere never present in v1, so the strip was both stale and the root cause of the descriptor-only/rebuild churn). Removed thematerialize_all_image_refsflag and the now-orphanedmaterialize_image_refs/materialize_kimi_image_refs.Original description
Summary
mmraw:v2raw multimodal refs inrenderers.mm_store, parsed asRawMMRefobjects withfamily,fingerprint,modality, hash, asset id, and adapter-owned payloadprime_raw_mm_itemenvelopes instead of processed image payloads for Qwen-VL and Kimi K2.5 image renderingimage_grid_thwfor Qwen,grid_thws/media token metadata for Kimi)Companion PRs
Notes
file://.../assets/images/...refs before rendering.Validation
uvx ruff@0.15.18 check .passed.uvx ruff@0.15.18 format --check .passed.uvx 'ty<0.0.22' check .exited 0; remaining diagnostics are warning-level advisories under the repo config.PYTHONPATH=/home/ubuntu/renderers uv run --no-project --active pytest -q tests/test_client.pypassed:14 passed./home/ubuntu/renderers,/home/ubuntu/verifiers, and/home/ubuntu/prime-rl-v1-raw-mm-offloadcompleted inference, env rollouts, train batch creation, trainer step 0, and decoded strict trainer-bound raw image refs.Update: review hardening (
e3c12e9)render_completion_updatemutating the caller'sprevious_multi_modal_datain place (shallow dict copy +setdefault().extend()); the merge now lives in a sharedmerge_multi_modal_datahelper inbase.pythat copies inner lists, and the bridge test asserts the previous sidecar is unmutated.mm_storeuses full sha256 content-addressed filenames and raises on undecodable base64.test_raw_layout_math_matches_image_processor: parity against the real Qwen3-VL-4B and Kimi-K2.5 (pinned revision) processors at rounding-boundary dimensions. Full suite:2182 passed.Note
High Risk
Default multimodal sidecar shape changes break callers expecting
pixel_valuesunless they setmultimodal_output='processed', and inference correctness now depends on run-scoped image offload and ref materialization across companion services.Overview
Multimodal inference now defaults to lightweight raw image descriptors instead of embedding processed
pixel_valuesinMultiModalData. Qwen-VL, Qwen3.5, and Kimi K2.5 compute placeholder counts from family layout specs (offloadedfile://assets), emitprime_raw_mm_itemenvelopes via newrenderers.mm_store, and bridge turns through sharedmerge_multi_modal_data.multimodal_outputon renderer configs selectsraw(inference / vLLM) vsprocessed(lazy HF processor + cache for SFT). Auto/default renderer resolution propagates this flag.generate()no longer stacks tensors with vLLM-specific encoders; it buildsfeaturesfrom hashes, placeholders, andmmraw:refs the endpoint materializes. Prebuilt prompts can pullmulti_modal_datafromprompt_attributionwhen omitted.Packaging: optional
visionextra for Pillow/torch when using processed mode. Per-rendererimage_cache_maxconfig fields are removed. Tests expect offloaded images and assert raw-ref wire shape plus layout parity with real processors.Reviewed by Cursor Bugbot for commit e3c12e9. Bugbot is set up for automated code reviews on this repo. Configure here.