Current loader supports text-to-text only. Follow-up to add the model's other input modalities.
- Image (vision) — supportable in the loader now; the transformers impl has the vision path (
vision_tower, embed_vision, pixel_values).
- Video — listed on the model card, but not wired in the current
transformers (5.12.0)DiffusionGemmaEncoderModel (modeling_diffusion_gemma.py:975 — "doesn't support audio or video inputs"), so it needs transformers support first.
- Audio — not supported by the model; out of scope.
Current loader supports text-to-text only. Follow-up to add the model's other input modalities.
vision_tower,embed_vision,pixel_values).transformers(5.12.0)DiffusionGemmaEncoderModel(modeling_diffusion_gemma.py:975— "doesn't support audio or video inputs"), so it needs transformers support first.