feat(cap_im_tg): inbound voice/audio messages → STT pipeline by gearhead10 · Pull Request #75 · espressif/esp-claw

gearhead10 · 2026-05-13T00:22:05Z

cap_im_tg currently drops Telegram voice notes and audio attachments silently — the agent never sees them. This adds a transcription path so voice/audio messages are converted to text and routed through the same inbound text flow as a regular message.

A new optional component components/common/audio_stt/ implements two backends behind a single API:

"openai": multipart upload to /v1/audio/transcriptions (Whisper, Groq, self-hosted OpenAI-compatible endpoints).
"deepgram": raw audio body to /v1/listen with detect_language=true when no language hint is configured (the default 2-general-nova model is English-only otherwise).

cap_im_tg detects voice and audio payloads in update objects, queues them through the existing attachment pipeline, and — when STT is enabled — transcribes them before publishing as inbound text.

A keep_audio_in_storage toggle controls whether the raw .oga is also written to FATFS. Default off: the audio is streamed straight from Telegram into the STT API via a new cap_im_attachment_download_url_to_buffer() helper, skipping a flash write and shaving ~3s end-to-end.

The feature is fully gated behind CONFIG_APP_CLAW_AUDIO_STT (default y) and CONFIG_APP_STT_ENABLED (default n) so the build remains unchanged for users who do not opt in. Configuration is exposed via a new "stt" config group with its own page in the device web UI, mirroring the LLM tab (backend / API key / base URL / model / language / keep-audio).

Tested on ESP32-S3 with Telegram voice notes against both backends.

Testing

OpenAI / Whisper-compatible backend — Telegram voice note transcribed end-to-end.
Deepgram backend (nova-2 and nova-3) — multi-language auto-detect verified (Spanish/English) by leaving the language field blank.
keep_audio_in_storage=false (default) — buffer path, no FATFS write, ~12.5 s end-to-end.
keep_audio_in_storage=true — FATFS path, original .oga saved under /fatfs/inbox/.
STT disabled — voice messages still produce the legacy attachment event (no regression).
Build with CONFIG_APP_CLAW_AUDIO_STT=n — audio_stt not compiled.
Hardware: ESP32-S3 devkit, ESP-IDF v5.5.4. Configure backend / API key / model in the web UI's new "Speech-to-Text" tab.

cap_im_tg currently drops Telegram voice notes and audio attachments silently — the agent never sees them. This adds a transcription path so voice/audio messages are converted to text and routed through the same inbound text flow as a regular message. A new optional component `components/common/audio_stt/` implements two backends behind a single API: - "openai": multipart upload to /v1/audio/transcriptions (Whisper, Groq, self-hosted OpenAI-compatible endpoints). - "deepgram": raw audio body to /v1/listen with detect_language=true when no language hint is configured (the default 2-general-nova model is English-only otherwise). cap_im_tg detects `voice` and `audio` payloads in update objects, queues them through the existing attachment pipeline, and — when STT is enabled — transcribes them before publishing as inbound text. A `keep_audio_in_storage` toggle controls whether the raw .oga is also written to FATFS. Default off: the audio is streamed straight from Telegram into the STT API via a new cap_im_attachment_download_url_to_buffer() helper, skipping a flash write and shaving ~3s end-to-end. The feature is fully gated behind CONFIG_APP_CLAW_AUDIO_STT (default y) and CONFIG_APP_STT_ENABLED (default n) so the build remains unchanged for users who do not opt in. Configuration is exposed via a new "stt" config group with its own page in the device web UI, mirroring the LLM tab (backend / API key / base URL / model / language / keep-audio). Tested on ESP32-S3 with Telegram voice notes against both backends. Signed-off-by: Axios Dev [AWR] <83687647+gearhead10@users.noreply.github.com>

gearhead10 force-pushed the feat/stt-pipeline branch from 327ae58 to 3918345 Compare May 14, 2026 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75

feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75
gearhead10 wants to merge 1 commit into
espressif:masterfrom
gearhead10:feat/stt-pipeline

gearhead10 commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gearhead10 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gearhead10 commented May 13, 2026 •

edited

Loading