Skip to content

feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75

Open
gearhead10 wants to merge 1 commit into
espressif:masterfrom
gearhead10:feat/stt-pipeline
Open

feat(cap_im_tg): inbound voice/audio messages → STT pipeline#75
gearhead10 wants to merge 1 commit into
espressif:masterfrom
gearhead10:feat/stt-pipeline

Conversation

@gearhead10

@gearhead10 gearhead10 commented May 13, 2026

Copy link
Copy Markdown

cap_im_tg currently drops Telegram voice notes and audio attachments silently — the agent never sees them. This adds a transcription path so voice/audio messages are converted to text and routed through the same inbound text flow as a regular message.

A new optional component components/common/audio_stt/ implements two backends behind a single API:

  • "openai": multipart upload to /v1/audio/transcriptions (Whisper, Groq, self-hosted OpenAI-compatible endpoints).
  • "deepgram": raw audio body to /v1/listen with detect_language=true when no language hint is configured (the default 2-general-nova model is English-only otherwise).

cap_im_tg detects voice and audio payloads in update objects, queues them through the existing attachment pipeline, and — when STT is enabled — transcribes them before publishing as inbound text.

A keep_audio_in_storage toggle controls whether the raw .oga is also written to FATFS. Default off: the audio is streamed straight from Telegram into the STT API via a new cap_im_attachment_download_url_to_buffer() helper, skipping a flash write and shaving ~3s end-to-end.

The feature is fully gated behind CONFIG_APP_CLAW_AUDIO_STT (default y) and CONFIG_APP_STT_ENABLED (default n) so the build remains unchanged for users who do not opt in. Configuration is exposed via a new "stt" config group with its own page in the device web UI, mirroring the LLM tab (backend / API key / base URL / model / language / keep-audio).

Tested on ESP32-S3 with Telegram voice notes against both backends.

Related

Related to #74

Testing

  • OpenAI / Whisper-compatible backend — Telegram voice note transcribed end-to-end.
  • Deepgram backend (nova-2 and nova-3) — multi-language auto-detect verified (Spanish/English) by leaving the language field blank.
  • keep_audio_in_storage=false (default) — buffer path, no FATFS write, ~12.5 s end-to-end.
  • keep_audio_in_storage=true — FATFS path, original .oga saved under /fatfs/inbox/.
  • STT disabled — voice messages still produce the legacy attachment event (no regression).
  • Build with CONFIG_APP_CLAW_AUDIO_STT=naudio_stt not compiled.
  • Hardware: ESP32-S3 devkit, ESP-IDF v5.5.4. Configure backend / API key / model in the web UI's new "Speech-to-Text" tab.

cap_im_tg currently drops Telegram voice notes and audio attachments
silently — the agent never sees them. This adds a transcription path so
voice/audio messages are converted to text and routed through the same
inbound text flow as a regular message.

A new optional component `components/common/audio_stt/` implements two
backends behind a single API:

  - "openai": multipart upload to /v1/audio/transcriptions (Whisper,
    Groq, self-hosted OpenAI-compatible endpoints).
  - "deepgram": raw audio body to /v1/listen with detect_language=true
    when no language hint is configured (the default 2-general-nova
    model is English-only otherwise).

cap_im_tg detects `voice` and `audio` payloads in update objects, queues
them through the existing attachment pipeline, and — when STT is enabled
— transcribes them before publishing as inbound text.

A `keep_audio_in_storage` toggle controls whether the raw .oga is also
written to FATFS. Default off: the audio is streamed straight from
Telegram into the STT API via a new
cap_im_attachment_download_url_to_buffer() helper, skipping a flash
write and shaving ~3s end-to-end.

The feature is fully gated behind CONFIG_APP_CLAW_AUDIO_STT (default y)
and CONFIG_APP_STT_ENABLED (default n) so the build remains unchanged
for users who do not opt in. Configuration is exposed via a new "stt"
config group with its own page in the device web UI, mirroring the LLM
tab (backend / API key / base URL / model / language / keep-audio).

Tested on ESP32-S3 with Telegram voice notes against both backends.

Signed-off-by: Axios Dev [AWR] <83687647+gearhead10@users.noreply.github.com>
@gearhead10 gearhead10 force-pushed the feat/stt-pipeline branch from 327ae58 to 3918345 Compare May 14, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant