feat(misc): voice-chat-widget — raw-WS + Vercel SDK versions#53
Conversation
End-to-end Next.js example combining all three Smallest products in parallel: Pulse STT (live transcription with ITN), Electron LLM (OpenAI- compatible chat-completions streaming), and Lightning v3.1 TTS (streaming audio over WebSocket). One SMALLEST_API_KEY powers everything. The README is a deep-dive on the agentic ITN config — finalize_on_words, eou_timeout_ms, close_stream — that customers most commonly get wrong, with worked examples (currency, phone numbers, dates, emails, decimals).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
EntelligenceAI PR SummaryAdds a complete
Confidence Score: 2/5 - Changes NeededNot safe to merge — this PR introduces a real-time voice chat widget with solid architectural ambition, but ships with multiple Medium-severity correctness and race-condition bugs that would produce observable failures in normal usage. The unguarded Key Findings:
Files requiring special attention
|
| } | ||
|
|
||
| export async function POST(req: Request) { | ||
| const { message } = await req.json(); |
There was a problem hiding this comment.
Unguarded
req.json() throws on malformed or empty body
req.json() throws a SyntaxError when the body is empty or not valid JSON, and there is no try/catch, so the Edge runtime surfaces an unhandled rejection instead of a clean error response.
| const { message } = await req.json(); | |
| let message: string | undefined; | |
| try { | |
| ({ message } = await req.json()); | |
| } catch { | |
| return new Response("Invalid JSON body", { status: 400 }); | |
| } | |
| if (!message) { | |
| return new Response("Missing 'message' field", { status: 400 }); | |
| } |
Prompt to fix with AI
Copy this prompt into your AI coding assistant to fix this issue.
In misc/voice-chat-widget/app/api/chat/route.ts, line 62, `const { message } = await req.json();` is not wrapped in a try/catch. Replace it with a try/catch block that returns a 400 response on parse failure, and add a check that `message` is a non-empty string before proceeding. Example:
let message: string | undefined;
try {
({ message } = await req.json());
} catch {
return new Response("Invalid JSON body", { status: 400 });
}
if (!message) {
return new Response("Missing 'message' field", { status: 400 });
}
Insert this before the existing `const key = process.env.SMALLEST_API_KEY;` check.
| onEnd: () => { | ||
| setStatus("idle"); | ||
| setMessages((prev) => | ||
| prev.map((m) => | ||
| m.id !== messageId | ||
| ? m | ||
| : { ...m, words: m.words.map((w) => ({ ...w, spoken: true, current: false })) } |
There was a problem hiding this comment.
onEnd fires on WS close, not audio complete — status goes idle while audio plays
useLightningTTS.ts fires onEnd inside ws.onclose, which triggers as soon as the WebSocket closes (all data received), not when the AudioContext finishes playing the buffered PCM. page.tsx calls setStatus('idle') in onEnd, so the status pill returns to 'idle' and words are marked 'spoken' while the AudioContext is still playing the last several seconds of scheduled audio.
Prompt to fix with AI
Copy this prompt into your AI coding assistant to fix this issue.
In `useLightningTTS.ts`, instead of calling `onEnd?.()` directly in `ws.onclose`, schedule it on the AudioContext timeline: after the final chunk is received and the WS is about to close, compute `delay = Math.max(0, nextStartRef.current - ctx.currentTime)` and call `setTimeout(onEnd, delay * 1000)`. This defers `onEnd` until the AudioContext has actually finished playing all buffered audio, so `page.tsx`'s `setStatus('idle')` and 'spoken' word marking align with real playback completion.
| ws.onclose = () => { | ||
| onEnd?.(); | ||
| setSpeaking(false); | ||
| wsRef.current = null; | ||
| }; |
There was a problem hiding this comment.
Old WS
onclose corrupts new utterance's state when speak() interrupts in-flight TTS
When speak() is called while a WS is active, the old socket is closed and wsRef.current is immediately overwritten with the new WS (line 96). When the old socket's onclose fires asynchronously, it sets setSpeaking(false) and wsRef.current = null, clobbering the new utterance's state. The replay button in page.tsx:368 is a direct trigger: clicking it while audio plays leaves speaking=false and a null wsRef, so a subsequent stop() call silently does nothing.
| ws.onclose = () => { | |
| onEnd?.(); | |
| setSpeaking(false); | |
| wsRef.current = null; | |
| }; | |
| ws.onclose = () => { | |
| if (wsRef.current === ws) { | |
| onEnd?.(); | |
| setSpeaking(false); | |
| wsRef.current = null; | |
| } | |
| }; |
Prompt to fix with AI
Copy this prompt into your AI coding assistant to fix this issue.
In `misc/voice-chat-widget/lib/useLightningTTS.ts`, lines 145-149, the `ws.onclose` handler unconditionally calls `setSpeaking(false)` and sets `wsRef.current = null`. This is wrong when `speak()` is called while a previous WebSocket is still open: the old socket's `onclose` fires after `wsRef.current` has already been replaced with the new WebSocket, so it clobbers the new utterance's state.
Fix: guard the handler so it only updates state if `wsRef.current` still refers to THIS socket:
```ts
ws.onclose = () => {
if (wsRef.current === ws) {
onEnd?.();
setSpeaking(false);
wsRef.current = null;
}
};
This ensures that when speak() interrupts an in-flight TTS (e.g., replay button clicked during active streaming), the stale onclose from the old socket does not flip speaking back to false or null out the new active WebSocket reference.
</details>
| }, []); | ||
|
|
||
| const start = useCallback(async () => { | ||
| if (recording) return; |
There was a problem hiding this comment.
Stale-state double-start guard allows concurrent WS and AudioContext creation
start() is async and setRecording(true) is only reached at line 158 after mic acquisition and worklet setup. Two rapid calls both see recording=false, pass the guard, open two WebSockets and two AudioContexts; the second overwrites wsRef.current at line 87 without closing the first, permanently orphaning it.
| if (recording) return; | |
| const startingRef = useRef(false); | |
| const start = useCallback(async () => { | |
| if (recording || startingRef.current) return; | |
| startingRef.current = true; | |
| try { |
Prompt to fix with AI
Copy this prompt into your AI coding assistant to fix this issue.
In `misc/voice-chat-widget/lib/usePulseSTT.ts`, the `start()` function at line 57 guards double-invocation with `if (recording) return`, but `recording` is React state that is only set to `true` at line 158 after several async operations. Two rapid calls both pass the guard and create duplicate WebSockets and AudioContexts; only the second's WebSocket survives in `wsRef.current`, leaking the first. Fix: add a `useRef<boolean>(false)` ref (e.g., `startingRef`) that is set to `true` immediately on entry and back to `false` after setup completes (or on error). Change the guard to `if (recording || startingRef.current) return;` and wrap the async body in try/finally to reset `startingRef.current = false`.
| const ws = new WebSocket(url); | ||
| ws.binaryType = "arraybuffer"; | ||
| wsRef.current = ws; |
There was a problem hiding this comment.
WebSocket opened before getUserMedia — orphaned on permission denial
The WebSocket is created and stored in wsRef.current before getUserMedia() is awaited. If getUserMedia() throws (permission denied), wsRef.current holds an open socket. The next start() call overwrites wsRef.current at the same line without closing the previous socket, permanently leaking the upstream connection.
| const ws = new WebSocket(url); | |
| ws.binaryType = "arraybuffer"; | |
| wsRef.current = ws; | |
| // Get mic first — if permission is denied, no WS is opened. | |
| const stream = await navigator.mediaDevices.getUserMedia({ | |
| audio: { echoCancellation: true, noiseSuppression: true, channelCount: 1 }, | |
| }); | |
| streamRef.current = stream; | |
| const base = proxyUrl || `ws://${location.hostname}:3031/stt`; | |
| const qs = new URLSearchParams({ | |
| language, | |
| encoding: "linear16", | |
| sample_rate: "16000", | |
| itn_normalize: "true", | |
| finalize_on_words: "false", | |
| eou_timeout_ms: "1000", | |
| }); | |
| const url = `${base}?${qs}`; | |
| const ws = new WebSocket(url); | |
| ws.binaryType = "arraybuffer"; | |
| wsRef.current = ws; |
Prompt to fix with AI
Copy this prompt into your AI coding assistant to fix this issue.
In `misc/voice-chat-widget/lib/usePulseSTT.ts`, the `start()` function opens a WebSocket at line 85 (`const ws = new WebSocket(url)`) before awaiting `navigator.mediaDevices.getUserMedia()` at line 113. If `getUserMedia()` throws, the open WebSocket is stored in `wsRef.current` and leaked when `start()` is retried (the retry overwrites `wsRef.current` without closing the old socket). Fix: move the `getUserMedia()` call and the stream/AudioContext setup (lines 113–119) to come BEFORE the WebSocket construction (lines 84–87), so that if mic permission is denied, no WebSocket is ever opened.
- New folder misc/voice-chat-widget-with-vercel-sdk/ — same UX as the raw-WS sibling shipped in #53, rebuilt on smallestai-vercel-provider + ai + @ai-sdk/openai-compatible. STT goes browser-direct via auth: 'query', LLM via streamText, streaming TTS still raw-WS (SDK doesn't wrap it yet). - Raw-WS README: appended ITN gotcha #8 — spoken 'and' inside dollar amounts ('five hundred and twenty five dollars') breaks the cardinal entity and produces '500 and 25 dollars', not '$525'. Workaround: drop the 'and'. - Both .gitignore files: exclude *.tsbuildinfo.
Summary
Two sibling cookbook examples under
misc/that show the same live voice-chat UX two ways. Same UI, same UX, sameSMALLEST_API_KEYfor all three services (Pulse STT, Electron LLM, Lightning v3.1 TTS) — different data-plumbing layer.misc/voice-chat-widget/misc/voice-chat-widget-with-vercel-sdk/smallestai-vercel-provider+ai+@ai-sdk/openai-compatible. STT goes browser-direct, no STT proxy. Cleanest path for teams already on the Vercel AI SDK.Each folder has its own README. The raw-WS folder has the ITN deep-dive (
itn_normalize,finalize_on_words=false,eou_timeout_ms,close_stream— the agentic pattern from Smallest's docs). The Vercel-SDK folder has a side-by-side param mapping (snake_case strings ↔ camelCase booleans) so anyone porting between the two only has to do mechanical renames.What's in each folder
app/page.tsx— chat UI with push-to-talk mic + live partials in the input + sentence-boundary flush from LLM stream → TTSlib/usePulseSTT.ts— STT hook (raw WS / SDK)lib/useLightningTTS.ts— Lightning TTS streaming over WS (same in both — SDK doesn't yet wrap streaming TTS)app/api/chat/route.ts— LLM proxy (hand-rolled SSE / VercelstreamText)proxy.mjs— WebSocket bridge (/stt+/ttsin raw-WS,/tts-only in SDK version)README.mdTest plan
npm install && npm run devboots Next + proxy cleanly in both foldersauth: 'query'in the proxy logs — that's by designNote for reviewers
tsconfig.tsbuildinfo(TS incremental cache) is now in.gitignorefor both folders; was accidentally committed once and removed..env.localnever committed;.env.exampleis the only env file in either folder.