Skip to content

feat: add audio content blocks for Gemma multimodal prompts in model_mm#2240

Open
cliu1003 wants to merge 1 commit into
microsoft:mainfrom
cliu1003:feat/model_mm_support_audio_in_gemma
Open

feat: add audio content blocks for Gemma multimodal prompts in model_mm#2240
cliu1003 wants to merge 1 commit into
microsoft:mainfrom
cliu1003:feat/model_mm_support_audio_in_gemma

Conversation

@cliu1003

Copy link
Copy Markdown

Description

GetUserContent() in examples/c/src/common.cpp only emitted image and
text content blocks for the Gemma-style structured-content path. Audio
inputs (num_audios) were silently ignored, so no {"type":"audio"} block
was added and the chat template never rendered an <|audio|> marker.

As a result, for Gemma-4 audio inference the audio soft tokens were appended
at the very front of the templated prompt (via the fallback in
ProcessGemma4Prompt), outside the user turn. The model therefore did not
associate the audio with the request and replied with things like
"Please provide the audio you would like me to transcribe."

This PR emits one {"type":"audio"} block per audio clip so the chat
template inserts the <|audio|> marker at the correct position within the
user turn, allowing the audio soft tokens to be expanded in place.

Changes

  • examples/c/src/common.cpp: add a loop that appends N {"type":"audio"}
    blocks (one per audio clip) in the Gemma-style structured-content branch.

GetUserContent only emitted image and text blocks for the Gemma-style
structured content path, so audio inputs were never inserted into the
chat template. As a result the rendered prompt had no <|audio|> marker
and Gemma-4 audio soft tokens were appended at the very front of the
templated string, causing the model to ignore the audio (e.g. replying
"Please provide the audio...").
Emit one {"type":"audio"} block per audio clip so the chat template
inserts the <|audio|> marker in the correct position within the user
turn. No effect on Gemma-3 since num_audios is 0 for text/vision-only usage.
Copilot AI review requested due to automatic review settings June 24, 2026 08:42
@cliu1003 cliu1003 requested a review from a team as a code owner June 24, 2026 08:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the C multimodal example prompt construction so Gemma-style structured-content messages include audio content blocks, allowing chat templates to place the audio marker(s) within the user turn (instead of relying on fallback prompt processing).

Changes:

  • Extend GetUserContent() structured-content branch to append one {"type":"audio"} block per audio clip.
  • Clarify the structured-content comment to reflect Gemma-3/Gemma-4 intent.

Comment thread examples/c/src/common.cpp
Comment thread examples/c/src/common.cpp
@kunal-vaishnavi

Copy link
Copy Markdown
Contributor

Thank you for your contribution!

Comment thread examples/c/src/common.cpp Outdated
@cliu1003 cliu1003 force-pushed the feat/model_mm_support_audio_in_gemma branch from 7a041ce to 0f470fc Compare June 24, 2026 09:12
@kunal-vaishnavi kunal-vaishnavi enabled auto-merge (squash) June 24, 2026 09:22
@cliu1003

Copy link
Copy Markdown
Author

Hi @kunal-vaishnavi, some failed checks are blocking the merge. Could you please re-run them? Thanks.

@cliu1003

Copy link
Copy Markdown
Author

Hi @kunal-vaishnavi , thanks for you help, there are still one failed. Could you please re-run this? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants