fix(#34542): use model audio_type instead of hardcoded mp3 in Azure TTS#2831
fix(#34542): use model audio_type instead of hardcoded mp3 in Azure TTS#2831agenthaulk wants to merge 0 commit intolanggenius:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the hardcoded 'mp3' response format with a dynamic lookup in the Azure OpenAI TTS implementation. A review comment identifies a potential issue where header-based audio formats (such as wav or flac) may result in corrupted files when concatenated in the streaming path, as the current logic assumes a format that supports simple byte concatenation like mp3.
| client.audio.speech.with_streaming_response.create, | ||
| model=model, | ||
| response_format="mp3", | ||
| response_format=self._get_model_audio_type(model, credentials), |
There was a problem hiding this comment.
When response_format is set to a format that includes a header (such as wav, flac, or opus), simply concatenating the raw bytes from multiple requests in the long-text path will result in a corrupted audio file containing multiple headers. This implementation worked previously because mp3 was hardcoded and supports simple concatenation. Consider adding logic to handle header-based formats or restricting the allowed formats for the multi-sentence streaming path.
41f18b3 to
980783a
Compare
619ff60 to
21d3a7e
Compare
Fix
Replace hardcoded
"mp3"response format with dynamicaudio_typefrom model configuration, and add proper non-streaming TTS path using pydub for correct audio combining.Root Cause
_tts_invoke_streamingand_process_sentencehardcodedresponse_format="mp3", ignoring the audio type configured per model incredentials.Changes
tts.py— full rewrite for consistency with the OpenAI plugin pattern:_tts_invoke_streaming: usesaudio_type(from_get_model_audio_type) instead of hardcoded"mp3"asresponse_format_tts_invoke(new): non-streaming path that collects sentence audio in parallel, then combines withpydub.AudioSegmentfor correct output regardless of format (handles wav/flac header issues)_process_sentence: now passesresponse_format=audio_typeto the API_STREAMABLE_FORMATSwhitelist needed — streaming uses whatever format the user configured; non-streaming uses pydub for proper mergingpyproject.toml— addedpydub~=0.25.1dependency (same version as the OpenAI plugin)manifest.yaml— version bumptests/test_tts.py— new test suite covering:audio_type_process_sentenceforwardsaudio_typeasresponse_format_tts_invokecombines segments via pydubAlignment with OpenAI plugin
This implementation mirrors
models/openai/models/tts/tts.py:pydub~=0.25.1)Scope
Minimal, focused on the reported issue. No behavioral changes beyond fixing the hardcoded format.