feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75
Open
gearhead10 wants to merge 1 commit into
Open
feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75gearhead10 wants to merge 1 commit into
gearhead10 wants to merge 1 commit into
Conversation
cap_im_tg currently drops Telegram voice notes and audio attachments
silently — the agent never sees them. This adds a transcription path so
voice/audio messages are converted to text and routed through the same
inbound text flow as a regular message.
A new optional component `components/common/audio_stt/` implements two
backends behind a single API:
- "openai": multipart upload to /v1/audio/transcriptions (Whisper,
Groq, self-hosted OpenAI-compatible endpoints).
- "deepgram": raw audio body to /v1/listen with detect_language=true
when no language hint is configured (the default 2-general-nova
model is English-only otherwise).
cap_im_tg detects `voice` and `audio` payloads in update objects, queues
them through the existing attachment pipeline, and — when STT is enabled
— transcribes them before publishing as inbound text.
A `keep_audio_in_storage` toggle controls whether the raw .oga is also
written to FATFS. Default off: the audio is streamed straight from
Telegram into the STT API via a new
cap_im_attachment_download_url_to_buffer() helper, skipping a flash
write and shaving ~3s end-to-end.
The feature is fully gated behind CONFIG_APP_CLAW_AUDIO_STT (default y)
and CONFIG_APP_STT_ENABLED (default n) so the build remains unchanged
for users who do not opt in. Configuration is exposed via a new "stt"
config group with its own page in the device web UI, mirroring the LLM
tab (backend / API key / base URL / model / language / keep-audio).
Tested on ESP32-S3 with Telegram voice notes against both backends.
Signed-off-by: Axios Dev [AWR] <83687647+gearhead10@users.noreply.github.com>
327ae58 to
3918345
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cap_im_tg currently drops Telegram voice notes and audio attachments silently — the agent never sees them. This adds a transcription path so voice/audio messages are converted to text and routed through the same inbound text flow as a regular message.
A new optional component
components/common/audio_stt/implements two backends behind a single API:cap_im_tg detects
voiceandaudiopayloads in update objects, queues them through the existing attachment pipeline, and — when STT is enabled — transcribes them before publishing as inbound text.A
keep_audio_in_storagetoggle controls whether the raw .oga is also written to FATFS. Default off: the audio is streamed straight from Telegram into the STT API via a newcap_im_attachment_download_url_to_buffer()helper, skipping a flash write and shaving ~3s end-to-end.The feature is fully gated behind
CONFIG_APP_CLAW_AUDIO_STT(default y) andCONFIG_APP_STT_ENABLED(default n) so the build remains unchanged for users who do not opt in. Configuration is exposed via a new "stt" config group with its own page in the device web UI, mirroring the LLM tab (backend / API key / base URL / model / language / keep-audio).Tested on ESP32-S3 with Telegram voice notes against both backends.
Related
Related to #74
Testing
keep_audio_in_storage=false(default) — buffer path, no FATFS write, ~12.5 s end-to-end.keep_audio_in_storage=true— FATFS path, original.ogasaved under/fatfs/inbox/.CONFIG_APP_CLAW_AUDIO_STT=n—audio_sttnot compiled.