Skip to content

Add a flush/cancel primitive to AudioProducer (Python) for real-time interruption use cases (ai voice agents) #1614

@vipyne

Description

@vipyne

tl;dr:

claude wrote this (and spoiler- it wrote the accompanying PR too).
I'm open to other solutions, but it does seem like AudioProducer.flush() would be clutch to interrupt voice agents.

Summary

moq.AudioProducer.write() (Python bindings, moq-rs 0.2.17) is fire-and-forget — there's no way to drop already-written audio from the encoder/wire/jitter buffer without closing the whole track. For real-time interactive use cases (voice agents, where the user interrupts the bot mid-utterance), this means we can stop generating audio but the already-buffered audio finishes playing on the consumer side.

Repro / context

Voice-agent pipeline:

broadcast = moq.BroadcastProducer()
audio = broadcast.publish_audio(
    "bot-audio",
    moq.AudioEncoderInput(format=moq.AudioFormat.S16, sample_rate=24000, channels=1),
    moq.AudioEncoderOutput(codec=moq.AudioCodec.OPUS, frame_duration_ms=20, ...),
)  
# Bot starts speaking; many `write()`s queue up in the encoder/wire/browser. 
for chunk in tts_stream(): 
    audio.write(moq.AudioFrame(timestamp_us=0, data=chunk))  
   
# User interrupts. We stop generating new audio. But there's no API to  
# drop the in-flight buffer — the user keeps hearing the bot for ~hundreds
# of ms more. 

Attempted workaround: audio.finish() followed by broadcast.publish_audio("bot-audio", ...) raises Error processing frame: duplicate — the broadcast won't accept republishing under the same track name.

Today we work around this by pacing write() calls against a wall-clock virtual timer in Python so the in-flight buffer never exceeds ~20 ms. It works but it's a backpressure mechanism reinvented in userland.

Requested API

Add AudioProducer.flush() / AudioProducer.cancel() — drop any unencoded + unsent frames without closing the track. Consumer-side observable as a brief skip in playback.

Why this matters

This is the main blocker for moq-rs-based real-time conversational AI (voice agents). Without it, interruption latency is bounded by the sum of: pacing buffer + WebTransport send window + browser jitter buffer, and the only thing the bot can control is the first one. We're keeping our pacing budget at 20 ms which makes the pipeline very jitter-sensitive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions