Skip to content

Latest commit

 

History

History
218 lines (145 loc) · 14.6 KB

File metadata and controls

218 lines (145 loc) · 14.6 KB

Voice Integration

Voice messages flow through an inbound transcription → outbound voice reply pipeline. This document describes the bridge's role in that pipeline; provider-specific mechanics (TTS/STT backends, voice IDs, languages) are owned by voice provider extensions. This is a first-class extension surface: one companion extension can provide STT fallbacks for inbound voice/audio files and TTS fallbacks for outbound Telegram voice replies without owning a second bot polling loop.

Overview

  1. Inbound: A voice message arrives via Telegram. Inbound handlers transcribe it to text.
  2. Processing: The transcription becomes the agent prompt. The bridge tags the turn if it originated from voice.
  3. Outbound: If voice replies are enabled, the agent's text response is converted to voice and sent back. No text draft appears in Telegram during generation.

The bridge owns Telegram transport, queue integration, reply-mode policy, preview suppression, fallback text delivery, and Settings UI. Provider extensions own STT/TTS calls, speech rewriting, provider-specific menus, transcript preference, and OGG/Opus conversion.

Voice Detection

Voice messages arrive as message.voice in Telegram updates. The bridge's media processing detects these and sets kind: "voice" on the downloaded file. Regular audio files (message.audio) get kind: "audio"; mirror mode treats both voice notes and audio uploads as voice input for reply-policy tagging.

Inbound handlers match kind: "voice" or mime: "audio/*" to run a transcription command:

{
  "inboundHandlers": [
    {
      "mime": "audio/*",
      "template": ["/path/to/stt", "--file={file}", "--mime={mime}"]
    }
  ]
}

The transcription output becomes the raw text of the prompt.

Voice provider extensions can also register STT backends with registerTelegramVoiceTranscriptionProvider() from @llblab/pi-telegram/voice. Inbound command-template handlers and programmatic inbound handlers remain the stronger generic paths and run first; if no matching handler produces output for a voice/audio file, registered transcription providers are tried as fallback in registration order. The first provider that returns non-empty text wins; providers that return undefined pass to the next provider, and provider failures are recorded before trying the next provider. This lets a full voice extension provide both TTS and STT without requiring telegram.json handler templates, while still preserving operator-configured inbound handlers as the stronger choice.

Voice Reply Policy

The bridge decides when to reply with voice from voice.replyMode in TelegramConfig (stored in telegram.json). If config does not set a valid mode, the bridge uses the hidden default: manual behavior without adding voice policy text to the prompt context.

Modes

  • hidden (default): no voice.replyMode is stored; same behavior as manual, no prompt voice context.
  • manual: only reply with voice when the agent authors <!-- telegram_voice --> markup; explicit prompt context is added.
  • mirror: reply with voice when the inbound message was a voice note or audio file. Text input stays on the manual path where explicit telegram_voice markup still works.
  • always: always reply with voice.

Warning: In always mode, the bridge transparently intercepts ALL text replies and converts them to voice on success. Users will only receive voice messages when voice generation succeeds. If voice generation fails, the bridge falls back to sending the planned text reply.

When a message is received, the bridge resolves the active voice reply mode and tags the turn:

  • voiceReplyPreferred: true when mode is mirror and the turn has a voice file
  • voiceReplyRequired: true when mode is always

At agent_end, if the turn is voice-tagged and the agent response has no explicit telegram_voice markup, the bridge transparently intercepts the text reply and converts it to voice. If the agent uses multiple telegram_voice blocks, each becomes a separate voice message. The same reply-mode decision applies to both registered voice synthesis providers and configured outbound voice handlers.

Preview Suppression

When a turn is voice-tagged, the bridge suppresses text preview streaming during LLM generation. This prevents draft text from appearing in Telegram before the voice message is delivered.

Voice Provider Extension Surface

A voice extension may combine three public seams:

  • registerTelegramVoiceTranscriptionProvider() for inbound STT fallback on voice/audio files
  • registerTelegramVoiceSynthesisProvider() for outbound TTS/synthesis fallback to Telegram voice messages
  • registerTelegramSection() for provider-specific Telegram UI such as voice, language, style, transcript, or provider on/off controls

The reply policy itself remains a built-in pi-telegram setting (voice.replyMode) rather than a provider-owned menu.

Outbound Voice Synthesis Provider Registration

Voice synthesis provider extensions register themselves through registerTelegramVoiceSynthesisProvider(). The bridge only provides the registration seam and the actual delivery to Telegram. The provider is fully responsible for:

  • Text optimisation / speech-style rewriting
  • Adding speech tags (when desired)
  • Running TTS + ffmpeg conversion to OGG/Opus
  • Deciding whether to return transcriptText at all based on the bridge-owned voice.sendTranscript preference when the provider has access to the current Telegram config
  • transcriptText (when returned) is attached by the bridge as the voice message caption only. Separate transcript messages are no longer sent.

The bridge shows a record_voice action while delivering and sends the final audio with Telegram sendVoice. When a provider returns transcriptText, the bridge attaches it as the voice caption.

Providers can implement getVoicePromptContribution(view) to inject voice-specific instructions into voice-tagged prompts (for example: "Reply only with the spoken text"). The bridge appends the first non-empty provider contribution when mirror or always mode tags the turn.

Import provider APIs from @llblab/pi-telegram/voice; see the TSDoc on registerTelegramVoiceSynthesisProvider and TelegramVoiceSynthesisProviderResult there for the exact interface.

The provider receives the raw agent text plus optional { lang?, rate? }.

It must return one of:

  • string — path to a ready .ogg or .opus file
  • { audioPath: string, transcriptText?: string }audioPath must be OGG/Opus. When transcriptText is present it is attached as the voice message caption. Providers should treat pi-telegram's voice.sendTranscript as the bridge-owned transcript preference instead of inventing a second reply-policy UI.
  • undefined — skip this text block

Important: Providers are fully responsible for producing a clean, TTS-optimised native voice file. The bridge may also run configured outbound voice command templates for users who prefer process-boundary handlers instead of provider extensions.

File format: Telegram sendVoice requires OGG/Opus to display the message as a native voice note (waveform, inline playback). MP3 and other formats are accepted by the API but render as regular audio attachments (music note icon, filename visible). Providers and outbound voice handlers must return .ogg or .opus files. Returning non-OGG files causes the bridge to throw and fall back to text delivery.

Registration returns a disposer function for cleanup. Stable provider registrations pass a durable id in options; omitted ids remain a compatibility path for older providers and receive generated session-local ids. Extensions should call disposers on shutdown or re-register safely on session start when their runtime is recreated.

Outbound Voice Handlers

Users can also configure outboundHandlers with type: "voice" in telegram.json. This is the command-template path for TTS without a provider extension. Reply modes (manual, mirror, always) affect these handlers the same way they affect providers: explicit telegram_voice blocks and automatic mirror/always interception both produce a voice reply plan, then delivery tries configured outbound voice handlers first and registered synthesis providers as progressive fallbacks.

Voice handlers receive the text on stdin in composed pipelines and can use {text}, {lang}, {rate}, {mp3}, and {ogg} placeholders. Set output to "ogg" or another placeholder name when the template writes to a known path:

{
  "voice": { "replyMode": "mirror" },
  "outboundHandlers": [
    {
      "type": "voice",
      "template": [
        "/path/to/tts --write-media {mp3}",
        "ffmpeg -y -i {mp3} -c:a libopus -b:a 32k -ar 16000 -ac 1 {ogg}"
      ],
      "output": "ogg"
    }
  ]
}

Priority for outbound voice delivery is: configured outboundHandlers with type: "voice" in their telegram.json order, then programmatic voice outbound handlers, then registered voice synthesis providers. Provider extensions are the zero-config tail of the same pipeline: they handle voice when no explicit configured handler succeeds, but they do not override operator-configured handlers. If multiple providers are registered, only one handles a given voice reply: the first provider that returns a valid .ogg/.opus artifact wins. Providers that return undefined explicitly pass to the next provider; providers that throw or return invalid output are recorded and the next fallback is tried.

Provider with transcript caption (controlled by user toggle)

When the user's "Send Transcript" toggle is ON, return the clean spoken text as transcriptText. The bridge attaches it as the caption on the voice message. When the toggle is OFF, return only the audio path (no transcriptText).

import {
  getTelegramVoiceSendTranscript,
  registerTelegramVoiceSynthesisProvider,
} from "@llblab/pi-telegram/voice";

registerTelegramVoiceSynthesisProvider(
  async (text, options) => {
    const rewritten = rewriteWithSpeechTags(text);
    const audioPath = await myTTS(rewritten, { language: options?.lang });
    const sendTranscript = getTelegramVoiceSendTranscript(
      getCurrentTelegramConfigView(),
    );
    return sendTranscript ? { audioPath, transcriptText: text } : { audioPath };
  },
  { id: "my-voice-provider/tts" },
);

getCurrentTelegramConfigView() represents whatever current TelegramConfig view your extension already owns or receives; pi-telegram does not require providers to read config directly. The bridge never sends a separate transcript message. Caption-only is the "ON" behavior.

Surfacing provider diagnostics

Voice provider extensions can record runtime events that appear in /telegram-status alongside pi-telegram's own events:

import { recordTelegramRuntimeEvent } from "@llblab/pi-telegram/outbound";

recordTelegramRuntimeEvent("voice-provider", new Error("TTS failed"), {
  phase: "tts",
  text: text.slice(0, 50),
});

recordTelegramRuntimeEvent writes to the same event ring that pi-telegram uses. Events are visible via /telegram-status in Telegram. Calls are silently dropped if pi-telegram is not loaded.

Voice Extension Section

Voice provider extensions can register a Voice Extension Section (settings UI) via registerTelegramSection. The section can expose provider-specific controls such as TTS voice, language, speech style, transcript behavior, or STT/TTS enablement. Reply mode is a core pi-telegram setting and belongs in the built-in Settings menu.

Note on resume: Because the previous automatic persistent re-registration system has been removed, extensions are responsible for re-registering their Voice Extension Section on session_start if they want the menu to survive a pi resume. See registerTelegramSection from @llblab/pi-telegram/sections.

Prompt Guidance

The bridge keeps voice prompt context compact and policy-owned. It adds [voice] reply mode: ... only when telegram.json explicitly contains a valid voice.replyMode. hidden/no configured mode behaves like manual, but prompts stay silent. When explicit, voice-originated manual turns add [voice] reply mode: manual, voice-originated mirror turns add [voice] reply mode: mirror, and always mode adds [voice] reply mode: always for every turn. If voice context later contains multiple fields, the bridge renders it as a [voice] list. The marker is appended after [outputs] when handler output exists, otherwise after [attachments]. Voice inputs also appear in [attachments] with their downloaded file names, MIME data, and handler output, so agents can infer concrete voice-file context from attachment metadata.

Voice synthesis providers can supply prompt guidance through getVoicePromptContribution(view), but provider text should stay optional and provider-specific. Reply-mode context belongs to pi-telegram.

Fallback Behavior

If voice generation fails

  1. The bridge records the failure via recordRuntimeEvent
  2. The voice sender throws an error, which the runtime catches
  3. The runtime falls back to sending the planned text reply (outbound markup stripped, replyMarkup preserved)

If no voice synthesis provider is registered

  • The voice sender throws because no configured handler or synthesis provider can deliver the voice reply
  • The runtime catches the error and falls back to text delivery

If the provider returns a non-OGG file

  • ensureTelegramVoiceFileFormat rejects the file (only .ogg and .opus are accepted)
  • The voice sender throws and the runtime falls back to text delivery
  • The provider should handle format conversion internally before returning the path

Telegram Voice Limits

  • Duration: Up to ~60 minutes per voice message
  • File size: Up to 20 MB for voice uploads via sendVoice
  • Format: OGG Opus is native; MP3 and other formats render as regular audio attachments
  • Splitting: The bridge does not split long responses into multiple voice messages. Chunking is the provider's responsibility

Configuration

Bridge config (telegram.json)

{
  "voice": {
    "replyMode": "manual"
  }
}

Valid stored values: "manual", "mirror", "always". Missing or invalid values are shown in Settings as hidden, behave like manual, and stay silent in prompt context.

The bridge reads voice.replyMode from the config when building a turn.

Provider config

Provider-specific settings (voice ID, language, speech style, transcript behavior, STT/TTS enablement) are owned by the voice provider extension. Reply mode is owned by pi-telegram's voice.replyMode and configured from the built-in pi-telegram Settings menu, not duplicated in provider UIs.