feat(speech-to-text): add Pulse STT benchmarking scripts#64
Conversation
Two self-contained reference scripts for measuring Pulse STT accuracy
(WER) + latency against a user-supplied evaluation dataset:
ping_pulse_offline.py — POST /waves/v1/stt/ (pre-recorded)
reports WER, per-clip latency, RTF (p50/p90/p95)
ping_pulse_streaming.py — WSS /waves/v1/stt/live (real-time WS)
reports WER, tail latency (p50/p90/p95)
Both scripts:
- Single-file, no repo-internal imports — copy anywhere and run
- Accept a folder layout: DATA_DIR/audio/ + DATA_DIR/metadata.csv
- --language en | hi (the two with mature WER normalisation pipelines)
- Print JSON aggregate at end (pipe-able to jq)
- Use stdlib urllib for HTTP, websockets package for streaming
Linked from the Pulse STT 'Measuring Latency' docs page as the canonical
reference implementation of the methodology described there.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Replace the 'scripts are not yet public' Note with direct links to the two reference scripts now in smallest-inc/cookbook#64: - speech-to-text/benchmarks/ping_pulse_offline.py - speech-to-text/benchmarks/ping_pulse_streaming.py Scripts implement the methodology in this page end-to-end (WER + p50/ p90/p95 latency). Docs page stays concept-first; cookbook holds the runnable code. Mirrors existing pattern: transcribe-python.py, websocket-python.py, mic-input-python.py.
Live end-to-end test on a fresh venv with a 5-clip eval dataset caught two
README errors:
1. metadata.csv columns: header was documented as 'filename,reference'
but both scripts validate for 'audio_filename,transcript' (and exit
with a hard error on mismatch). Fixed README to match the script's
actual contract.
2. JSON output location: README didn't mention the scripts write results
to DATA_DIR/results_<mode>_<lang>.json. Documented the filenames,
shape, and a corrected jq example.
Verified both scripts now run end-to-end against the corrected README:
ping_pulse_offline.py --language en --model pulse-pro
-> 4/5 clips at 0% WER, 1 at 66% (TTS clip too short — model
truncated 'this is a test')
-> corpus WER 10.26%, latency p50/p95 = 0.69/0.78s, RTFx ~32-49
ping_pulse_streaming.py --language en
-> 5/5 clips at 0% WER
-> tail latency p50/p95 = 250/310ms
End-to-end test on fresh venv ✅Created Bug caught + fixed (commit 4cb21c3)The README had Test 1 —
|
20 English read-speech clips + 10 Hindi clips, each with a metadata.csv
in the script's expected schema. Lets users run the benchmark scripts
against real audio without first building their own eval dataset.
Verified end-to-end on a fresh venv:
ping_pulse_offline.py samples/english_samples --language en --model pulse-pro
-> 20 clips, corpus WER 4.1%, latency p50/p95 = 0.76/0.92s, RTF p50 0.09
ping_pulse_streaming.py samples/hindi_samples --language hi
-> 10 clips, corpus WER 7.5%, tail latency p50/p95 = 236/304 ms
README updated with a 'Quick start — using the bundled samples' section
showing exact commands. .gitignore in samples/ keeps runtime
results_*.json files out of the repo.
… methodology (#238) * docs(pulse-stt): add "Measuring Latency" guide (offline vs streaming methodology) Gaurav (Slack thread C0A2U76VBL0/1781523834.377769) asked for a docs page on Pulse STT latency benchmarking — specifically clarifying the asymmetry between offline and streaming, and explaining HOW each metric is calculated (since "TTFB and latency as numbers are vague for streaming"). Scope explicitly per Gaurav: NO benchmark numbers — purely the conceptual / methodology page. Models the structure of Deepgram's measuring-streaming-latency page but uses Pulse's actual wire signals. ## What the page covers 1. **Pre-recorded latency** — RTFx as the cross-vendor-fair metric + wall-clock end-to-end + how to attribute network vs model time via the `metadata.processing_time_ms` field Pulse already returns. 2. **Streaming latency — three distinct metrics**, each tied to a product question: - **Transcript latency** — `audio_cursor - transcript_cursor`, sampled only on `is_final=false` interim frames (since Pulse buffers for `is_final=true` accuracy). Uses the last word's `end` timestamp from interim transcripts as `transcript_cursor`; client accumulates `audio_cursor` from sent chunks. Working Python sketch included. - **End-of-utterance latency** — three patterns: client-side VAD + `{"type":"finalize"}` (recommended for voice agents), server-side `eou_timeout_ms`, or pure-detection approach. Working Python sketch. - **Time to first partial** — startup health check; includes the warning that first-session-of-process is inflated by WS+TLS handshake. 3. **Component breakdown** — 5 buckets (network connection, network per-message, transcription, client, buffer) with per-bucket measurement methodology. Includes regional transit expectations table (Mumbai ap-south-2 / Oregon us-west-2 → typical one-way latency from common origins). 4. **Common measurement pitfalls** — 7 traps: finals-only, missing word_timestamps, first-session warm-up, chunks too small / too large, server idle warm-up, cross-network comparisons. 5. **Use-case → metric lookup table** — which metric to optimise for live captioning, voice agents, post-call analytics, real-time dashboards. 6. Reference scripts placeholder pointing at Discord/support — once Anirudh's streaming.py + offline.py are shared, they'll either inline as worked examples or move to the cookbook repo with a link. ## Surface details - Path: `fern/products/waves/pages/v4.0.0/speech-to-text/measuring-latency.mdx` - Nav: wired under Speech to Text (Pulse) → Benchmarks section - Mirrored to `versions/v4.0.0/speech-to-text/measuring-latency.mdx` - llms.txt regenerated - `fern check` passes (0 errors, only unrelated pre-existing warnings) ## Verification Local `fern docs dev` + Playwright DOM probe against `/waves/documentation/speech-to-text-pulse/benchmarks/measuring-latency`: all 8 structural checks pass — RTFx formula present, three streaming metrics defined, transcript-cursor formula visible, Python examples render, pitfalls + component-breakdown + use-case sections all rendered, cross-links to Pulse model card + Response Format docs work. ## Review intent Per Abhishek's commitment in the Slack thread (reply 10): pinging Gaurav for review after this lands. Hand-off prompt: page is intentionally conceptual; if he wants worked benchmark numbers + a separate "Pulse vs Deepgram WER" page, that's a follow-up. * docs(measuring-latency): rewrite in concept-first style; cut AI-tells + numbers Per Abhishek's feedback (referenced Deepgram's measuring-streaming-latency guide as style inspiration): - Open with the CONCEPT of streaming latency, not a Pulse-specific punchline. Old opening was 'A single latency number for Pulse is misleading' — replaced with a definition-first paragraph that talks about the metric, the use cases (voice agents, live captioning), and why a single average hides problems. - Remove specific latency / RTFx numbers. The page now describes methodology only (what each metric is, how to calculate it) — no 'Pulse Pro typically hits RTFx ≈ 50' or transit-time tables, which were unverified anyway. - Drop the regional transit-time table entirely (it had unverified ap-south-2 / us-west-2 numbers). - Tighten the prose — fewer em-dashes, fewer colons-for-definitions, no 'If you're building X' framing repeated for every use case. Plain factual sentences instead. - Cut the 'two non-obvious things matter' phrasing and the 'Pulse is misleading' framing. Each section now opens with what the metric IS, then shows how to compute it. - Update llms.txt summary to match new description. Page now mirrors the Deepgram style: concept → metric definition → formula → minimal code → pitfalls. Methodology, not benchmarks. * docs(measuring-latency): link to cookbook benchmark scripts Replace the 'scripts are not yet public' Note with direct links to the two reference scripts now in smallest-inc/cookbook#64: - speech-to-text/benchmarks/ping_pulse_offline.py - speech-to-text/benchmarks/ping_pulse_streaming.py Scripts implement the methodology in this page end-to-end (WER + p50/ p90/p95 latency). Docs page stays concept-first; cookbook holds the runnable code. Mirrors existing pattern: transcribe-python.py, websocket-python.py, mic-input-python.py. * docs(pulse-stt): rewrite Measuring Latency page — Deepgram-class structure Reviewer feedback: the previous draft was "mid" — no proper section headings, no clear narrative arc, vague TTFB/latency framing. Researched Deepgram (the only peer with a real streaming-latency page; AssemblyAI/Speechmatics/Gladia have none) and rewrote to beat it on completeness while mirroring its strengths. Structural changes: 1. **New opener** introduces the cursor model up front. Two cursors (audio, transcript), one formula, then everything else is naming. Matches Deepgram's "audio cursor X / transcript cursor Y / latency = X−Y" framing. 2. **"Latency components" promoted** from buried table → its own H2 with four H3 subsections (connection setup / network transit / server transcription / client overhead). Placed BEFORE the measurement code so readers have a debugging mental model first. Includes ASCII pipeline diagram + typical value ranges per component + curl `time_appconnect` one-liner. 3. **End-of-Utterance → End-of-turn (EOT)** rename with "(also called end-of-utterance)" inline. EOT is what Deepgram, AssemblyAI, and voice- agent builders use; EOU is academic. 4. **TTFB removed entirely.** HTTP/web term, not a streaming-STT term, no peer doc uses it. Replaced with "time to first partial" which is unambiguous. 5. **Three EOT patterns** named explicitly (finalize-on-VAD recommended; server-side silence timer; VAD-without-finalize). Was one buried paragraph. 6. **Pitfalls tightened** to 5 cross-cutting bullets. Inline warnings on the most critical ones (interim-only sampling, word_timestamps=true) moved next to the relevant code where Deepgram puts them. 7. **Cross-references promoted** to a sub-section with model-card link added for headline TTFT numbers. Beats Deepgram by adding: RTFx + pre-recorded methodology (Deepgram has none), finalize-message pattern (Deepgram doesn't document it), reference scripts in the cookbook, ASCII pipeline diagram (Deepgram is prose-only). Versions mirror synced. fern check 0 errors, llms.txt regenerated. * fix(measuring-latency): escape `<100 ms` to "sub-100 ms" to fix MDX parse error Fern preview build failed: Failed to parse markdown file products/waves/pages/v4.0.0/speech-to-text/ measuring-latency.mdx: Unexpected character `1` (U+0031) before name, expected a character that can start a name, such as a letter, `$`, or `_` MDX 3 parses `<digit` as the start of a JSX tag (e.g., `<100`), but a digit isn't a valid first character for a tag name — hence the error. Line 91 had **<100 ms** in prose. Replaced with "sub-100 ms" (matches the "Sub-100ms" phrasing the Pulse model card now uses for the same TTFT claim). Versions mirror synced. * docs(measuring-latency): add 4 sections to beat Deepgram comprehensively After head-to-head review against Deepgram's measuring-streaming-latency page, identified four gaps we needed to close. Added all four as new H2 sections: 1. **"Pulse vs Pulse Pro for latency-sensitive workloads"** — Deepgram has "Model Considerations" (Nova-3 vs Flux); we had no model recommendation on the page. Added a decision table: voice agents → Pulse; live captioning → Pulse; multilingual → Pulse; offline English accuracy → Pulse Pro. With the hard constraint surfaced up front: Pulse Pro has no streaming worker (returns 400 on /stt/live). 2. **"Latency expectations"** — Deepgram has a typical-ranges table; we had numbers scattered inline per component. Consolidated table with 7 rows (connection setup, network transit, server transcription, total transcript latency, EOT, time-to-first-partial, RTFx) + p50/p95 reminder. Numbers sourced from existing model-card claims, not invented. 3. **"When latencies are higher than expected"** — Deepgram has remediation bullets; we had pitfalls but no fix-it list. Added 6 actionable bullets keyed back to the components section (sockets / region / capacity / client overhead / EOT pattern / buffer size). Helps readers go from "my number is bad" to "here's what to change". 4. **"Summary"** — Deepgram closes with bullet takeaways; we ended on the cross-references mid-air. Added 7-bullet recap: cursor model, metric distinction, four-component attribution, model selection, finalize-on-VAD recommendation, word_timestamps=true reminder, p50+p95 tracking. Where we now beat Deepgram on every axis they cover: - Cursor model (parity) - Components (parity, with ASCII pipeline diagram they lack) - Model recommendation (parity) - Measurement code (better — three EOT patterns including finalize-on-VAD) - Latency expectations (parity) - Troubleshooting (parity) - Summary (parity) - Tools (better — reference cookbook scripts, not just one-off CLI utilities) Plus we cover RTFx + pre-recorded methodology that Deepgram skips entirely. Versions mirror synced. MDX validated (no `<digit` hazards). fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop Pulse-Pro-no-streaming-worker paragraph Per reviewer: the standalone "Pulse Pro has no streaming worker. Calls to WS /waves/v1/stt/live?model=pulse-pro return 400..." paragraph between the workload-recommendation table and the model-card cross-references felt redundant. The same constraint is already conveyed in the table row ("HTTP-only, no streaming worker") and once more in the Summary section ("Pulse for streaming, Pulse Pro for offline English accuracy. Pulse Pro has no streaming worker.") — three mentions was overkill. Versions mirror synced. No section structure changes. * docs(measuring-latency): drop the cursor-difference formula entirely (Gaurav) Gaurav noted that the documented `transcript_latency = audio_cursor − transcript_cursor` formula breaks during silence in the audio — the audio cursor keeps advancing while the transcript cursor correctly stays put on the last word, so the "latency" reading inflates by the silence duration. His exact illustration: in `---hey how are you---my name is gaurav---`, the audio cursor races ahead during the dashes while the transcript cursor sits on "gaurav"; the formula reports the silence as latency. That's wrong. Researched what the voice-agent industry actually does: - Deepgram (where we lifted the formula from) doesn't caveat silence. Same bug. https://developers.deepgram.com/docs/measuring-streaming-latency - Daily.co's Pipecat STT benchmark — the open-source benchmark voice- agent shops actually run — measures Time-to-Final-Segment (TTFS, = our EOT) per utterance only. No cursor-difference metric. - AssemblyAI documents TTFT + TTCT (= our first-partial + EOT) and explicitly recommends live side-by-side testing over precise per-chunk formulas. https://www.assemblyai.com/docs/streaming/evaluations/voice-agents - Speechmatics has a `max_delay` config but no measurement formula. Conclusion: there is no industry-standard "transcript latency" cursor formula. Documenting our own (broken or fixed) would invent a metric nobody else uses. Per reviewer preference, stay neutral. Changes: - Removed the `### Transcript latency` H3 section + its code block entirely. - Removed the "Transcript latency" row from the metric-selection table. - Replaced the "How streaming latency is measured" cursor-model intro with a "What 'streaming latency' actually means" framing that names the three industry-standard metrics (TTFT, EOT, RTFx) and explicitly notes — with citations to Pipecat + AssemblyAI — why no cursor formula is documented here. - Removed "Total transcript latency" row from the Latency-expectations table. - Updated pitfalls + summary bullets + cookbook-scripts row to drop "transcript latency" references; replaced with TTFT/EOT framing. - Updated frontmatter description to lead with TTFT/EOT/RTFx instead of cursor model. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop the explainer note + cursor-pitfall (Streisand) Reviewer caught that the "we don't document the cursor formula because…" explainer note + the "don't invent a cursor formula" pitfall bullet both ironically re-introduce the bad pattern by naming it. Removed: - The `<Note>` at the top under "What 'streaming latency' actually means" that cited Deepgram's broken formula + Pipecat + AssemblyAI to justify our absence of it. - The pitfall bullet that warned against "inventing a live transcript lag formula from cursors." Page now stays silent on the cursor approach entirely. Customers reading it see only what we DO recommend (TTFT, EOT, RTFx) — no mention of the formula they shouldn't use. If they go look at Deepgram's page separately and try to copy that formula, they'll find their own silence-inflation issue without us drawing attention to it. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop competitor name-drops; keep metric vocab only Reviewer feedback: name-dropping AssemblyAI / Pipecat / Daily.co in reference links pulls customer attention toward competitor sites. Same class of problem as the Streisand fix in the previous commit — naming the alternative draws people to it. Removed two competitor citations: - The opener line that cited "[Pipecat STT benchmark](URL)" and "[AssemblyAI](URL)" as the source of the TTFS / TTCT vocabulary. Replaced with: "Also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT) in the broader voice-agent industry." Metric names stand on their own as industry vocabulary; no attribution needed. - The pitfall bullet that said "that's what AssemblyAI, Pipecat, and Daily.co recommend" — rephrased to direct guidance: "Empirical end-to-end measurement beats a synthetic per-chunk formula." Page now contains zero competitor brand references. Industry-standard metric names (TTFT, TTFS, TTCT, EOT, RTFx) retained as vocabulary. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop ASCII pipeline diagram + concretize numbers (Gaurav) Reviewer feedback round 3: 1. ASCII pipeline diagram in "Latency components" section doesn't make sense (visual without payoff — readers already know audio goes client → server → client). Removed. 2. "sub-100 ms at 1 concurrency" framing is vague. Replaced with a concrete "150 ms" number that matches the actual production measurement. Updated in two places: the Server-transcription component prose (line 92) and the Latency-expectations table (line ~213). 3. Buffer-size practical range was "20–100 ms per chunk" — corrected to "100–300 ms per chunk" per Gaurav's input. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop measurement recipes, keep components + pitfalls Removes the Python measurement snippets, Pattern A/B/C EOT recipes, inline Latency expectations table, and "When latencies are higher than expected" troubleshooting list. The four-component attribution model, the Pulse vs Pulse Pro picker, the Common measurement pitfalls list, and the cookbook reference scripts all stay — that's the load-bearing content for a docs reader. The cookbook scripts (ping_pulse_offline.py, ping_pulse_streaming.py) are the canonical, maintained measurement recipes; this page now points at them instead of duplicating the Python end-to-end. Updates the intro accordingly. Drops the Summary bullet that referenced the "three EOT patterns" since Pattern A/B/C are no longer defined on the page.
Summary
Adds two self-contained reference scripts for measuring Pulse STT accuracy + latency. Companion to the Measuring Latency docs page — docs explain the methodology, this repo holds the runnable scripts.
Plus a `README.md` covering install, dataset layout, and how to interpret output.
Why split docs ↔ cookbook
Scripts are 350+ lines each with heavy deps (`jiwer`, `librosa`, `whisper-normalizer`, `websockets`, `tqdm`). Inlining into the docs page would dominate the methodology section and burn the concept-first design. Cookbook lets:
Matches existing cookbook pattern: `transcribe-python.py`, `websocket-python.py`, `mic-input-python.py` already live alongside their respective docs pages.
Test plan
Coordination
The docs page that references these scripts is in smallest-inc/smallest-ai-documentation#238. I'll update that PR's 'Reference scripts' section to link to these once this PR merges.