feat(speech-to-text): add Pulse STT benchmarking scripts by abhishekmishragithub · Pull Request #64 · smallest-inc/cookbook

abhishekmishragithub · 2026-06-17T05:46:25Z

Summary

Adds two self-contained reference scripts for measuring Pulse STT accuracy + latency. Companion to the Measuring Latency docs page — docs explain the methodology, this repo holds the runnable scripts.

Script	Endpoint	Reports
`ping_pulse_offline.py`	`POST /waves/v1/stt/` (pre-recorded HTTP)	WER (corpus), per-clip latency, RTF — p50/p90/p95
`ping_pulse_streaming.py`	`WSS /waves/v1/stt/live` (real-time WebSocket)	WER (corpus), tail latency — p50/p90/p95

Plus a `README.md` covering install, dataset layout, and how to interpret output.

Why split docs ↔ cookbook

Scripts are 350+ lines each with heavy deps (`jiwer`, `librosa`, `whisper-normalizer`, `websockets`, `tqdm`). Inlining into the docs page would dominate the methodology section and burn the concept-first design. Cookbook lets:

docs reader stay focused on the WHAT/WHY
engineer who wants to benchmark just clones the cookbook
scripts evolve independently from docs

Matches existing cookbook pattern: `transcribe-python.py`, `websocket-python.py`, `mic-input-python.py` already live alongside their respective docs pages.

Test plan

Both scripts pass `python3 -m py_compile` (no syntax errors)
Run `ping_pulse_offline.py --language en --model pulse-pro` against a small (5-10 clip) eval dataset, confirm JSON aggregate prints cleanly
Run `ping_pulse_streaming.py --language en` against same dataset, confirm WS connect + final transcripts arrive
README's quick-start steps work end-to-end on a fresh venv

Coordination

The docs page that references these scripts is in smallest-inc/smallest-ai-documentation#238. I'll update that PR's 'Reference scripts' section to link to these once this PR merges.

Two self-contained reference scripts for measuring Pulse STT accuracy (WER) + latency against a user-supplied evaluation dataset: ping_pulse_offline.py — POST /waves/v1/stt/ (pre-recorded) reports WER, per-clip latency, RTF (p50/p90/p95) ping_pulse_streaming.py — WSS /waves/v1/stt/live (real-time WS) reports WER, tail latency (p50/p90/p95) Both scripts: - Single-file, no repo-internal imports — copy anywhere and run - Accept a folder layout: DATA_DIR/audio/ + DATA_DIR/metadata.csv - --language en | hi (the two with mature WER normalisation pipelines) - Print JSON aggregate at end (pipe-able to jq) - Use stdlib urllib for HTTP, websockets package for streaming Linked from the Pulse STT 'Measuring Latency' docs page as the canonical reference implementation of the methodology described there.

vercel · 2026-06-17T05:46:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
smallest-showcase	Ready	Preview, Comment	Jun 17, 2026 10:40am

Replace the 'scripts are not yet public' Note with direct links to the two reference scripts now in smallest-inc/cookbook#64: - speech-to-text/benchmarks/ping_pulse_offline.py - speech-to-text/benchmarks/ping_pulse_streaming.py Scripts implement the methodology in this page end-to-end (WER + p50/ p90/p95 latency). Docs page stays concept-first; cookbook holds the runnable code. Mirrors existing pattern: transcribe-python.py, websocket-python.py, mic-input-python.py.

Live end-to-end test on a fresh venv with a 5-clip eval dataset caught two README errors: 1. metadata.csv columns: header was documented as 'filename,reference' but both scripts validate for 'audio_filename,transcript' (and exit with a hard error on mismatch). Fixed README to match the script's actual contract. 2. JSON output location: README didn't mention the scripts write results to DATA_DIR/results_<mode>_<lang>.json. Documented the filenames, shape, and a corrected jq example. Verified both scripts now run end-to-end against the corrected README: ping_pulse_offline.py --language en --model pulse-pro -> 4/5 clips at 0% WER, 1 at 66% (TTS clip too short — model truncated 'this is a test') -> corpus WER 10.26%, latency p50/p95 = 0.69/0.78s, RTFx ~32-49 ping_pulse_streaming.py --language en -> 5/5 clips at 0% WER -> tail latency p50/p95 = 250/310ms

abhishekmishragithub · 2026-06-17T09:57:36Z

End-to-end test on fresh venv ✅

Created /tmp/cookbook-test-venv from scratch, installed deps per the README quick-start (jiwer tqdm numpy librosa websockets whisper-normalizer), built a 5-clip English eval dataset by synthesising via Lightning TTS, then ran both scripts.

Bug caught + fixed (commit `4cb21c3`)

The README had metadata.csv columns as filename,reference. Both scripts actually validate for audio_filename,transcript and exit hard on mismatch. README updated to match the script's real contract. Also added the JSON output filename + shape (the scripts write to DATA_DIR/results_<mode>_<lang>.json, which the original README didn't mention).

Test 1 — `ping_pulse_offline.py --language en --model pulse-pro`

clip_001.wav: WER=66.67%  latency=0.781s  RTF=0.3753  server_rtfx=44.5
  ref: hello world this is a test
  hyp: hello world                      ← Pulse truncated; TTS clip might be too short
clip_002.wav: WER=0.0%   latency=0.693s  RTF=0.2706  server_rtfx=32.5  ✓
clip_003.wav: WER=0.0%   latency=0.689s  RTF=0.2051  server_rtfx=49.5  ✓
clip_004.wav: WER=0.0%   latency=0.702s  RTF=0.1753  server_rtfx=54.3  ✓
clip_005.wav: WER=0.0%   latency=0.665s  RTF=0.198   server_rtfx=42.8  ✓

── Aggregate ──
clips                   5
corpus WER              10.26%
latency p50/p90/p95     0.693 / 0.781 / 0.781 s
RTF p50/p90/p95         0.2051 / 0.3753 / 0.3753

Wrote /tmp/eval-dataset/results_batch_pulse-pro_en.json

Test 2 — `ping_pulse_streaming.py --language en`

clip_001-005.wav: WER=0.0%  all 5 transcribed cleanly via WS

── Aggregate ──
clips                   5
corpus WER              0.0%
tail p50/p90/p95        0.25 / 0.31 / 0.31 s

Wrote /tmp/eval-dataset/results_streaming_en.json

Verdict

Both scripts work end-to-end on a fresh venv following the (now-corrected) README. JSON outputs land where the README says. Aggregate stats look reasonable for a 5-clip set. README is now safe to follow as written.

Test plan checkboxes in the original PR body now actually checked.

20 English read-speech clips + 10 Hindi clips, each with a metadata.csv in the script's expected schema. Lets users run the benchmark scripts against real audio without first building their own eval dataset. Verified end-to-end on a fresh venv: ping_pulse_offline.py samples/english_samples --language en --model pulse-pro -> 20 clips, corpus WER 4.1%, latency p50/p95 = 0.76/0.92s, RTF p50 0.09 ping_pulse_streaming.py samples/hindi_samples --language hi -> 10 clips, corpus WER 7.5%, tail latency p50/p95 = 236/304 ms README updated with a 'Quick start — using the bundled samples' section showing exact commands. .gitignore in samples/ keeps runtime results_*.json files out of the repo.

… methodology (#238) * docs(pulse-stt): add "Measuring Latency" guide (offline vs streaming methodology) Gaurav (Slack thread C0A2U76VBL0/1781523834.377769) asked for a docs page on Pulse STT latency benchmarking — specifically clarifying the asymmetry between offline and streaming, and explaining HOW each metric is calculated (since "TTFB and latency as numbers are vague for streaming"). Scope explicitly per Gaurav: NO benchmark numbers — purely the conceptual / methodology page. Models the structure of Deepgram's measuring-streaming-latency page but uses Pulse's actual wire signals. ## What the page covers 1. **Pre-recorded latency** — RTFx as the cross-vendor-fair metric + wall-clock end-to-end + how to attribute network vs model time via the `metadata.processing_time_ms` field Pulse already returns. 2. **Streaming latency — three distinct metrics**, each tied to a product question: - **Transcript latency** — `audio_cursor - transcript_cursor`, sampled only on `is_final=false` interim frames (since Pulse buffers for `is_final=true` accuracy). Uses the last word's `end` timestamp from interim transcripts as `transcript_cursor`; client accumulates `audio_cursor` from sent chunks. Working Python sketch included. - **End-of-utterance latency** — three patterns: client-side VAD + `{"type":"finalize"}` (recommended for voice agents), server-side `eou_timeout_ms`, or pure-detection approach. Working Python sketch. - **Time to first partial** — startup health check; includes the warning that first-session-of-process is inflated by WS+TLS handshake. 3. **Component breakdown** — 5 buckets (network connection, network per-message, transcription, client, buffer) with per-bucket measurement methodology. Includes regional transit expectations table (Mumbai ap-south-2 / Oregon us-west-2 → typical one-way latency from common origins). 4. **Common measurement pitfalls** — 7 traps: finals-only, missing word_timestamps, first-session warm-up, chunks too small / too large, server idle warm-up, cross-network comparisons. 5. **Use-case → metric lookup table** — which metric to optimise for live captioning, voice agents, post-call analytics, real-time dashboards. 6. Reference scripts placeholder pointing at Discord/support — once Anirudh's streaming.py + offline.py are shared, they'll either inline as worked examples or move to the cookbook repo with a link. ## Surface details - Path: `fern/products/waves/pages/v4.0.0/speech-to-text/measuring-latency.mdx` - Nav: wired under Speech to Text (Pulse) → Benchmarks section - Mirrored to `versions/v4.0.0/speech-to-text/measuring-latency.mdx` - llms.txt regenerated - `fern check` passes (0 errors, only unrelated pre-existing warnings) ## Verification Local `fern docs dev` + Playwright DOM probe against `/waves/documentation/speech-to-text-pulse/benchmarks/measuring-latency`: all 8 structural checks pass — RTFx formula present, three streaming metrics defined, transcript-cursor formula visible, Python examples render, pitfalls + component-breakdown + use-case sections all rendered, cross-links to Pulse model card + Response Format docs work. ## Review intent Per Abhishek's commitment in the Slack thread (reply 10): pinging Gaurav for review after this lands. Hand-off prompt: page is intentionally conceptual; if he wants worked benchmark numbers + a separate "Pulse vs Deepgram WER" page, that's a follow-up. * docs(measuring-latency): rewrite in concept-first style; cut AI-tells + numbers Per Abhishek's feedback (referenced Deepgram's measuring-streaming-latency guide as style inspiration): - Open with the CONCEPT of streaming latency, not a Pulse-specific punchline. Old opening was 'A single latency number for Pulse is misleading' — replaced with a definition-first paragraph that talks about the metric, the use cases (voice agents, live captioning), and why a single average hides problems. - Remove specific latency / RTFx numbers. The page now describes methodology only (what each metric is, how to calculate it) — no 'Pulse Pro typically hits RTFx ≈ 50' or transit-time tables, which were unverified anyway. - Drop the regional transit-time table entirely (it had unverified ap-south-2 / us-west-2 numbers). - Tighten the prose — fewer em-dashes, fewer colons-for-definitions, no 'If you're building X' framing repeated for every use case. Plain factual sentences instead. - Cut the 'two non-obvious things matter' phrasing and the 'Pulse is misleading' framing. Each section now opens with what the metric IS, then shows how to compute it. - Update llms.txt summary to match new description. Page now mirrors the Deepgram style: concept → metric definition → formula → minimal code → pitfalls. Methodology, not benchmarks. * docs(measuring-latency): link to cookbook benchmark scripts Replace the 'scripts are not yet public' Note with direct links to the two reference scripts now in smallest-inc/cookbook#64: - speech-to-text/benchmarks/ping_pulse_offline.py - speech-to-text/benchmarks/ping_pulse_streaming.py Scripts implement the methodology in this page end-to-end (WER + p50/ p90/p95 latency). Docs page stays concept-first; cookbook holds the runnable code. Mirrors existing pattern: transcribe-python.py, websocket-python.py, mic-input-python.py. * docs(pulse-stt): rewrite Measuring Latency page — Deepgram-class structure Reviewer feedback: the previous draft was "mid" — no proper section headings, no clear narrative arc, vague TTFB/latency framing. Researched Deepgram (the only peer with a real streaming-latency page; AssemblyAI/Speechmatics/Gladia have none) and rewrote to beat it on completeness while mirroring its strengths. Structural changes: 1. **New opener** introduces the cursor model up front. Two cursors (audio, transcript), one formula, then everything else is naming. Matches Deepgram's "audio cursor X / transcript cursor Y / latency = X−Y" framing. 2. **"Latency components" promoted** from buried table → its own H2 with four H3 subsections (connection setup / network transit / server transcription / client overhead). Placed BEFORE the measurement code so readers have a debugging mental model first. Includes ASCII pipeline diagram + typical value ranges per component + curl `time_appconnect` one-liner. 3. **End-of-Utterance → End-of-turn (EOT)** rename with "(also called end-of-utterance)" inline. EOT is what Deepgram, AssemblyAI, and voice- agent builders use; EOU is academic. 4. **TTFB removed entirely.** HTTP/web term, not a streaming-STT term, no peer doc uses it. Replaced with "time to first partial" which is unambiguous. 5. **Three EOT patterns** named explicitly (finalize-on-VAD recommended; server-side silence timer; VAD-without-finalize). Was one buried paragraph. 6. **Pitfalls tightened** to 5 cross-cutting bullets. Inline warnings on the most critical ones (interim-only sampling, word_timestamps=true) moved next to the relevant code where Deepgram puts them. 7. **Cross-references promoted** to a sub-section with model-card link added for headline TTFT numbers. Beats Deepgram by adding: RTFx + pre-recorded methodology (Deepgram has none), finalize-message pattern (Deepgram doesn't document it), reference scripts in the cookbook, ASCII pipeline diagram (Deepgram is prose-only). Versions mirror synced. fern check 0 errors, llms.txt regenerated. * fix(measuring-latency): escape `<100 ms` to "sub-100 ms" to fix MDX parse error Fern preview build failed: Failed to parse markdown file products/waves/pages/v4.0.0/speech-to-text/ measuring-latency.mdx: Unexpected character `1` (U+0031) before name, expected a character that can start a name, such as a letter, `$`, or `_` MDX 3 parses `<digit` as the start of a JSX tag (e.g., `<100`), but a digit isn't a valid first character for a tag name — hence the error. Line 91 had **<100 ms** in prose. Replaced with "sub-100 ms" (matches the "Sub-100ms" phrasing the Pulse model card now uses for the same TTFT claim). Versions mirror synced. * docs(measuring-latency): add 4 sections to beat Deepgram comprehensively After head-to-head review against Deepgram's measuring-streaming-latency page, identified four gaps we needed to close. Added all four as new H2 sections: 1. **"Pulse vs Pulse Pro for latency-sensitive workloads"** — Deepgram has "Model Considerations" (Nova-3 vs Flux); we had no model recommendation on the page. Added a decision table: voice agents → Pulse; live captioning → Pulse; multilingual → Pulse; offline English accuracy → Pulse Pro. With the hard constraint surfaced up front: Pulse Pro has no streaming worker (returns 400 on /stt/live). 2. **"Latency expectations"** — Deepgram has a typical-ranges table; we had numbers scattered inline per component. Consolidated table with 7 rows (connection setup, network transit, server transcription, total transcript latency, EOT, time-to-first-partial, RTFx) + p50/p95 reminder. Numbers sourced from existing model-card claims, not invented. 3. **"When latencies are higher than expected"** — Deepgram has remediation bullets; we had pitfalls but no fix-it list. Added 6 actionable bullets keyed back to the components section (sockets / region / capacity / client overhead / EOT pattern / buffer size). Helps readers go from "my number is bad" to "here's what to change". 4. **"Summary"** — Deepgram closes with bullet takeaways; we ended on the cross-references mid-air. Added 7-bullet recap: cursor model, metric distinction, four-component attribution, model selection, finalize-on-VAD recommendation, word_timestamps=true reminder, p50+p95 tracking. Where we now beat Deepgram on every axis they cover: - Cursor model (parity) - Components (parity, with ASCII pipeline diagram they lack) - Model recommendation (parity) - Measurement code (better — three EOT patterns including finalize-on-VAD) - Latency expectations (parity) - Troubleshooting (parity) - Summary (parity) - Tools (better — reference cookbook scripts, not just one-off CLI utilities) Plus we cover RTFx + pre-recorded methodology that Deepgram skips entirely. Versions mirror synced. MDX validated (no `<digit` hazards). fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop Pulse-Pro-no-streaming-worker paragraph Per reviewer: the standalone "Pulse Pro has no streaming worker. Calls to WS /waves/v1/stt/live?model=pulse-pro return 400..." paragraph between the workload-recommendation table and the model-card cross-references felt redundant. The same constraint is already conveyed in the table row ("HTTP-only, no streaming worker") and once more in the Summary section ("Pulse for streaming, Pulse Pro for offline English accuracy. Pulse Pro has no streaming worker.") — three mentions was overkill. Versions mirror synced. No section structure changes. * docs(measuring-latency): drop the cursor-difference formula entirely (Gaurav) Gaurav noted that the documented `transcript_latency = audio_cursor − transcript_cursor` formula breaks during silence in the audio — the audio cursor keeps advancing while the transcript cursor correctly stays put on the last word, so the "latency" reading inflates by the silence duration. His exact illustration: in `---hey how are you---my name is gaurav---`, the audio cursor races ahead during the dashes while the transcript cursor sits on "gaurav"; the formula reports the silence as latency. That's wrong. Researched what the voice-agent industry actually does: - Deepgram (where we lifted the formula from) doesn't caveat silence. Same bug. https://developers.deepgram.com/docs/measuring-streaming-latency - Daily.co's Pipecat STT benchmark — the open-source benchmark voice- agent shops actually run — measures Time-to-Final-Segment (TTFS, = our EOT) per utterance only. No cursor-difference metric. - AssemblyAI documents TTFT + TTCT (= our first-partial + EOT) and explicitly recommends live side-by-side testing over precise per-chunk formulas. https://www.assemblyai.com/docs/streaming/evaluations/voice-agents - Speechmatics has a `max_delay` config but no measurement formula. Conclusion: there is no industry-standard "transcript latency" cursor formula. Documenting our own (broken or fixed) would invent a metric nobody else uses. Per reviewer preference, stay neutral. Changes: - Removed the `### Transcript latency` H3 section + its code block entirely. - Removed the "Transcript latency" row from the metric-selection table. - Replaced the "How streaming latency is measured" cursor-model intro with a "What 'streaming latency' actually means" framing that names the three industry-standard metrics (TTFT, EOT, RTFx) and explicitly notes — with citations to Pipecat + AssemblyAI — why no cursor formula is documented here. - Removed "Total transcript latency" row from the Latency-expectations table. - Updated pitfalls + summary bullets + cookbook-scripts row to drop "transcript latency" references; replaced with TTFT/EOT framing. - Updated frontmatter description to lead with TTFT/EOT/RTFx instead of cursor model. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop the explainer note + cursor-pitfall (Streisand) Reviewer caught that the "we don't document the cursor formula because…" explainer note + the "don't invent a cursor formula" pitfall bullet both ironically re-introduce the bad pattern by naming it. Removed: - The `<Note>` at the top under "What 'streaming latency' actually means" that cited Deepgram's broken formula + Pipecat + AssemblyAI to justify our absence of it. - The pitfall bullet that warned against "inventing a live transcript lag formula from cursors." Page now stays silent on the cursor approach entirely. Customers reading it see only what we DO recommend (TTFT, EOT, RTFx) — no mention of the formula they shouldn't use. If they go look at Deepgram's page separately and try to copy that formula, they'll find their own silence-inflation issue without us drawing attention to it. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop competitor name-drops; keep metric vocab only Reviewer feedback: name-dropping AssemblyAI / Pipecat / Daily.co in reference links pulls customer attention toward competitor sites. Same class of problem as the Streisand fix in the previous commit — naming the alternative draws people to it. Removed two competitor citations: - The opener line that cited "[Pipecat STT benchmark](URL)" and "[AssemblyAI](URL)" as the source of the TTFS / TTCT vocabulary. Replaced with: "Also called Time-to-Final-Segment (TTFS) or Time-to-Complete-Transcript (TTCT) in the broader voice-agent industry." Metric names stand on their own as industry vocabulary; no attribution needed. - The pitfall bullet that said "that's what AssemblyAI, Pipecat, and Daily.co recommend" — rephrased to direct guidance: "Empirical end-to-end measurement beats a synthetic per-chunk formula." Page now contains zero competitor brand references. Industry-standard metric names (TTFT, TTFS, TTCT, EOT, RTFx) retained as vocabulary. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop ASCII pipeline diagram + concretize numbers (Gaurav) Reviewer feedback round 3: 1. ASCII pipeline diagram in "Latency components" section doesn't make sense (visual without payoff — readers already know audio goes client → server → client). Removed. 2. "sub-100 ms at 1 concurrency" framing is vague. Replaced with a concrete "150 ms" number that matches the actual production measurement. Updated in two places: the Server-transcription component prose (line 92) and the Latency-expectations table (line ~213). 3. Buffer-size practical range was "20–100 ms per chunk" — corrected to "100–300 ms per chunk" per Gaurav's input. Versions mirror synced. fern check 0 errors, llms.txt regenerated. * docs(measuring-latency): drop measurement recipes, keep components + pitfalls Removes the Python measurement snippets, Pattern A/B/C EOT recipes, inline Latency expectations table, and "When latencies are higher than expected" troubleshooting list. The four-component attribution model, the Pulse vs Pulse Pro picker, the Common measurement pitfalls list, and the cookbook reference scripts all stay — that's the load-bearing content for a docs reader. The cookbook scripts (ping_pulse_offline.py, ping_pulse_streaming.py) are the canonical, maintained measurement recipes; this page now points at them instead of duplicating the Python end-to-end. Updates the intro accordingly. Drops the Summary bullet that referenced the "three EOT patterns" since Pattern A/B/C are no longer defined on the page.

vercel Bot deployed to Preview June 17, 2026 09:57 View deployment

vercel Bot deployed to Preview June 17, 2026 10:40 View deployment

abhishekmishragithub merged commit 7f73bf3 into main Jun 17, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(speech-to-text): add Pulse STT benchmarking scripts#64

feat(speech-to-text): add Pulse STT benchmarking scripts#64
abhishekmishragithub merged 3 commits into
mainfrom
feat/pulse-stt-benchmarking-scripts

abhishekmishragithub commented Jun 17, 2026

Uh oh!

vercel Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

abhishekmishragithub commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abhishekmishragithub commented Jun 17, 2026

Summary

Why split docs ↔ cookbook

Test plan

Coordination

Uh oh!

vercel Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhishekmishragithub commented Jun 17, 2026

End-to-end test on fresh venv ✅

Bug caught + fixed (commit 4cb21c3)

Test 1 — ping_pulse_offline.py --language en --model pulse-pro

Test 2 — ping_pulse_streaming.py --language en

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 17, 2026 •

edited

Loading

Bug caught + fixed (commit `4cb21c3`)

Test 1 — `ping_pulse_offline.py --language en --model pulse-pro`

Test 2 — `ping_pulse_streaming.py --language en`