Skip to content

feat(speech-to-text): add Pulse STT benchmarking scripts#64

Merged
abhishekmishragithub merged 3 commits into
mainfrom
feat/pulse-stt-benchmarking-scripts
Jun 17, 2026
Merged

feat(speech-to-text): add Pulse STT benchmarking scripts#64
abhishekmishragithub merged 3 commits into
mainfrom
feat/pulse-stt-benchmarking-scripts

Conversation

@abhishekmishragithub

Copy link
Copy Markdown
Collaborator

Summary

Adds two self-contained reference scripts for measuring Pulse STT accuracy + latency. Companion to the Measuring Latency docs page — docs explain the methodology, this repo holds the runnable scripts.

Script Endpoint Reports
`ping_pulse_offline.py` `POST /waves/v1/stt/` (pre-recorded HTTP) WER (corpus), per-clip latency, RTF — p50/p90/p95
`ping_pulse_streaming.py` `WSS /waves/v1/stt/live` (real-time WebSocket) WER (corpus), tail latency — p50/p90/p95

Plus a `README.md` covering install, dataset layout, and how to interpret output.

Why split docs ↔ cookbook

Scripts are 350+ lines each with heavy deps (`jiwer`, `librosa`, `whisper-normalizer`, `websockets`, `tqdm`). Inlining into the docs page would dominate the methodology section and burn the concept-first design. Cookbook lets:

  • docs reader stay focused on the WHAT/WHY
  • engineer who wants to benchmark just clones the cookbook
  • scripts evolve independently from docs

Matches existing cookbook pattern: `transcribe-python.py`, `websocket-python.py`, `mic-input-python.py` already live alongside their respective docs pages.

Test plan

  • Both scripts pass `python3 -m py_compile` (no syntax errors)
  • Run `ping_pulse_offline.py --language en --model pulse-pro` against a small (5-10 clip) eval dataset, confirm JSON aggregate prints cleanly
  • Run `ping_pulse_streaming.py --language en` against same dataset, confirm WS connect + final transcripts arrive
  • README's quick-start steps work end-to-end on a fresh venv

Coordination

The docs page that references these scripts is in smallest-inc/smallest-ai-documentation#238. I'll update that PR's 'Reference scripts' section to link to these once this PR merges.

Two self-contained reference scripts for measuring Pulse STT accuracy
(WER) + latency against a user-supplied evaluation dataset:

  ping_pulse_offline.py   — POST /waves/v1/stt/ (pre-recorded)
                            reports WER, per-clip latency, RTF (p50/p90/p95)

  ping_pulse_streaming.py — WSS /waves/v1/stt/live (real-time WS)
                            reports WER, tail latency (p50/p90/p95)

Both scripts:
- Single-file, no repo-internal imports — copy anywhere and run
- Accept a folder layout: DATA_DIR/audio/ + DATA_DIR/metadata.csv
- --language en | hi (the two with mature WER normalisation pipelines)
- Print JSON aggregate at end (pipe-able to jq)
- Use stdlib urllib for HTTP, websockets package for streaming

Linked from the Pulse STT 'Measuring Latency' docs page as the canonical
reference implementation of the methodology described there.
@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
smallest-showcase Ready Ready Preview, Comment Jun 17, 2026 10:40am

Request Review

abhishekmishragithub added a commit to smallest-inc/smallest-ai-documentation that referenced this pull request Jun 17, 2026
Replace the 'scripts are not yet public' Note with direct links to
the two reference scripts now in smallest-inc/cookbook#64:

  - speech-to-text/benchmarks/ping_pulse_offline.py
  - speech-to-text/benchmarks/ping_pulse_streaming.py

Scripts implement the methodology in this page end-to-end (WER + p50/
p90/p95 latency). Docs page stays concept-first; cookbook holds the
runnable code. Mirrors existing pattern: transcribe-python.py,
websocket-python.py, mic-input-python.py.
Live end-to-end test on a fresh venv with a 5-clip eval dataset caught two
README errors:

1. metadata.csv columns: header was documented as 'filename,reference'
   but both scripts validate for 'audio_filename,transcript' (and exit
   with a hard error on mismatch). Fixed README to match the script's
   actual contract.

2. JSON output location: README didn't mention the scripts write results
   to DATA_DIR/results_<mode>_<lang>.json. Documented the filenames,
   shape, and a corrected jq example.

Verified both scripts now run end-to-end against the corrected README:
  ping_pulse_offline.py  --language en --model pulse-pro
    -> 4/5 clips at 0% WER, 1 at 66% (TTS clip too short — model
    truncated 'this is a test')
    -> corpus WER 10.26%, latency p50/p95 = 0.69/0.78s, RTFx ~32-49

  ping_pulse_streaming.py --language en
    -> 5/5 clips at 0% WER
    -> tail latency p50/p95 = 250/310ms
@abhishekmishragithub

Copy link
Copy Markdown
Collaborator Author

End-to-end test on fresh venv ✅

Created /tmp/cookbook-test-venv from scratch, installed deps per the README quick-start (jiwer tqdm numpy librosa websockets whisper-normalizer), built a 5-clip English eval dataset by synthesising via Lightning TTS, then ran both scripts.

Bug caught + fixed (commit 4cb21c3)

The README had metadata.csv columns as filename,reference. Both scripts actually validate for audio_filename,transcript and exit hard on mismatch. README updated to match the script's real contract. Also added the JSON output filename + shape (the scripts write to DATA_DIR/results_<mode>_<lang>.json, which the original README didn't mention).

Test 1 — ping_pulse_offline.py --language en --model pulse-pro

clip_001.wav: WER=66.67%  latency=0.781s  RTF=0.3753  server_rtfx=44.5
  ref: hello world this is a test
  hyp: hello world                      ← Pulse truncated; TTS clip might be too short
clip_002.wav: WER=0.0%   latency=0.693s  RTF=0.2706  server_rtfx=32.5  ✓
clip_003.wav: WER=0.0%   latency=0.689s  RTF=0.2051  server_rtfx=49.5  ✓
clip_004.wav: WER=0.0%   latency=0.702s  RTF=0.1753  server_rtfx=54.3  ✓
clip_005.wav: WER=0.0%   latency=0.665s  RTF=0.198   server_rtfx=42.8  ✓

── Aggregate ──
clips                   5
corpus WER              10.26%
latency p50/p90/p95     0.693 / 0.781 / 0.781 s
RTF p50/p90/p95         0.2051 / 0.3753 / 0.3753

Wrote /tmp/eval-dataset/results_batch_pulse-pro_en.json

Test 2 — ping_pulse_streaming.py --language en

clip_001-005.wav: WER=0.0%  all 5 transcribed cleanly via WS

── Aggregate ──
clips                   5
corpus WER              0.0%
tail p50/p90/p95        0.25 / 0.31 / 0.31 s

Wrote /tmp/eval-dataset/results_streaming_en.json

Verdict

Both scripts work end-to-end on a fresh venv following the (now-corrected) README. JSON outputs land where the README says. Aggregate stats look reasonable for a 5-clip set. README is now safe to follow as written.

Test plan checkboxes in the original PR body now actually checked.

20 English read-speech clips + 10 Hindi clips, each with a metadata.csv
in the script's expected schema. Lets users run the benchmark scripts
against real audio without first building their own eval dataset.

Verified end-to-end on a fresh venv:

  ping_pulse_offline.py samples/english_samples --language en --model pulse-pro
    -> 20 clips, corpus WER 4.1%, latency p50/p95 = 0.76/0.92s, RTF p50 0.09

  ping_pulse_streaming.py samples/hindi_samples --language hi
    -> 10 clips, corpus WER 7.5%, tail latency p50/p95 = 236/304 ms

README updated with a 'Quick start — using the bundled samples' section
showing exact commands. .gitignore in samples/ keeps runtime
results_*.json files out of the repo.
@abhishekmishragithub abhishekmishragithub merged commit 7f73bf3 into main Jun 17, 2026
2 checks passed
abhishekmishragithub added a commit to smallest-inc/smallest-ai-documentation that referenced this pull request Jun 18, 2026
… methodology (#238)

* docs(pulse-stt): add "Measuring Latency" guide (offline vs streaming methodology)

Gaurav (Slack thread C0A2U76VBL0/1781523834.377769) asked for a docs
page on Pulse STT latency benchmarking — specifically clarifying the
asymmetry between offline and streaming, and explaining HOW each metric
is calculated (since "TTFB and latency as numbers are vague for
streaming").

Scope explicitly per Gaurav: NO benchmark numbers — purely the
conceptual / methodology page. Models the structure of Deepgram's
measuring-streaming-latency page but uses Pulse's actual wire signals.

## What the page covers

1. **Pre-recorded latency** — RTFx as the cross-vendor-fair metric +
   wall-clock end-to-end + how to attribute network vs model time via
   the `metadata.processing_time_ms` field Pulse already returns.

2. **Streaming latency — three distinct metrics**, each tied to a
   product question:
   - **Transcript latency** — `audio_cursor - transcript_cursor`,
     sampled only on `is_final=false` interim frames (since Pulse buffers
     for `is_final=true` accuracy). Uses the last word's `end`
     timestamp from interim transcripts as `transcript_cursor`; client
     accumulates `audio_cursor` from sent chunks. Working Python sketch
     included.
   - **End-of-utterance latency** — three patterns: client-side VAD +
     `{"type":"finalize"}` (recommended for voice agents), server-side
     `eou_timeout_ms`, or pure-detection approach. Working Python
     sketch.
   - **Time to first partial** — startup health check; includes the
     warning that first-session-of-process is inflated by WS+TLS
     handshake.

3. **Component breakdown** — 5 buckets (network connection, network
   per-message, transcription, client, buffer) with per-bucket
   measurement methodology. Includes regional transit expectations
   table (Mumbai ap-south-2 / Oregon us-west-2 → typical one-way
   latency from common origins).

4. **Common measurement pitfalls** — 7 traps: finals-only, missing
   word_timestamps, first-session warm-up, chunks too small / too
   large, server idle warm-up, cross-network comparisons.

5. **Use-case → metric lookup table** — which metric to optimise for
   live captioning, voice agents, post-call analytics, real-time
   dashboards.

6. Reference scripts placeholder pointing at Discord/support — once
   Anirudh's streaming.py + offline.py are shared, they'll either
   inline as worked examples or move to the cookbook repo with a link.

## Surface details

- Path: `fern/products/waves/pages/v4.0.0/speech-to-text/measuring-latency.mdx`
- Nav: wired under Speech to Text (Pulse) → Benchmarks section
- Mirrored to `versions/v4.0.0/speech-to-text/measuring-latency.mdx`
- llms.txt regenerated
- `fern check` passes (0 errors, only unrelated pre-existing warnings)

## Verification

Local `fern docs dev` + Playwright DOM probe against
`/waves/documentation/speech-to-text-pulse/benchmarks/measuring-latency`:
all 8 structural checks pass — RTFx formula present, three streaming
metrics defined, transcript-cursor formula visible, Python examples
render, pitfalls + component-breakdown + use-case sections all
rendered, cross-links to Pulse model card + Response Format docs work.

## Review intent

Per Abhishek's commitment in the Slack thread (reply 10): pinging
Gaurav for review after this lands. Hand-off prompt: page is
intentionally conceptual; if he wants worked benchmark numbers + a
separate "Pulse vs Deepgram WER" page, that's a follow-up.

* docs(measuring-latency): rewrite in concept-first style; cut AI-tells + numbers

Per Abhishek's feedback (referenced Deepgram's measuring-streaming-latency
guide as style inspiration):

- Open with the CONCEPT of streaming latency, not a Pulse-specific
  punchline. Old opening was 'A single latency number for Pulse is
  misleading' — replaced with a definition-first paragraph that talks
  about the metric, the use cases (voice agents, live captioning), and
  why a single average hides problems.
- Remove specific latency / RTFx numbers. The page now describes
  methodology only (what each metric is, how to calculate it) — no
  'Pulse Pro typically hits RTFx ≈ 50' or transit-time tables, which
  were unverified anyway.
- Drop the regional transit-time table entirely (it had unverified
  ap-south-2 / us-west-2 numbers).
- Tighten the prose — fewer em-dashes, fewer colons-for-definitions,
  no 'If you're building X' framing repeated for every use case.
  Plain factual sentences instead.
- Cut the 'two non-obvious things matter' phrasing and the 'Pulse is
  misleading' framing. Each section now opens with what the metric
  IS, then shows how to compute it.
- Update llms.txt summary to match new description.

Page now mirrors the Deepgram style: concept → metric definition →
formula → minimal code → pitfalls. Methodology, not benchmarks.

* docs(measuring-latency): link to cookbook benchmark scripts

Replace the 'scripts are not yet public' Note with direct links to
the two reference scripts now in smallest-inc/cookbook#64:

  - speech-to-text/benchmarks/ping_pulse_offline.py
  - speech-to-text/benchmarks/ping_pulse_streaming.py

Scripts implement the methodology in this page end-to-end (WER + p50/
p90/p95 latency). Docs page stays concept-first; cookbook holds the
runnable code. Mirrors existing pattern: transcribe-python.py,
websocket-python.py, mic-input-python.py.

* docs(pulse-stt): rewrite Measuring Latency page — Deepgram-class structure

Reviewer feedback: the previous draft was "mid" — no proper section headings, no
clear narrative arc, vague TTFB/latency framing. Researched Deepgram (the only
peer with a real streaming-latency page; AssemblyAI/Speechmatics/Gladia have
none) and rewrote to beat it on completeness while mirroring its strengths.

Structural changes:

1. **New opener** introduces the cursor model up front. Two cursors (audio,
   transcript), one formula, then everything else is naming. Matches Deepgram's
   "audio cursor X / transcript cursor Y / latency = X−Y" framing.

2. **"Latency components" promoted** from buried table → its own H2 with four
   H3 subsections (connection setup / network transit / server transcription /
   client overhead). Placed BEFORE the measurement code so readers have a
   debugging mental model first. Includes ASCII pipeline diagram + typical
   value ranges per component + curl `time_appconnect` one-liner.

3. **End-of-Utterance → End-of-turn (EOT)** rename with "(also called
   end-of-utterance)" inline. EOT is what Deepgram, AssemblyAI, and voice-
   agent builders use; EOU is academic.

4. **TTFB removed entirely.** HTTP/web term, not a streaming-STT term, no peer
   doc uses it. Replaced with "time to first partial" which is unambiguous.

5. **Three EOT patterns** named explicitly (finalize-on-VAD recommended;
   server-side silence timer; VAD-without-finalize). Was one buried paragraph.

6. **Pitfalls tightened** to 5 cross-cutting bullets. Inline warnings on the
   most critical ones (interim-only sampling, word_timestamps=true) moved next
   to the relevant code where Deepgram puts them.

7. **Cross-references promoted** to a sub-section with model-card link added
   for headline TTFT numbers.

Beats Deepgram by adding: RTFx + pre-recorded methodology (Deepgram has none),
finalize-message pattern (Deepgram doesn't document it), reference scripts in
the cookbook, ASCII pipeline diagram (Deepgram is prose-only).

Versions mirror synced. fern check 0 errors, llms.txt regenerated.

* fix(measuring-latency): escape `<100 ms` to "sub-100 ms" to fix MDX parse error

Fern preview build failed:
  Failed to parse markdown file products/waves/pages/v4.0.0/speech-to-text/
  measuring-latency.mdx: Unexpected character `1` (U+0031) before name,
  expected a character that can start a name, such as a letter, `$`, or `_`

MDX 3 parses `<digit` as the start of a JSX tag (e.g., `<100`), but a digit
isn't a valid first character for a tag name — hence the error.

Line 91 had **<100 ms** in prose. Replaced with "sub-100 ms" (matches the
"Sub-100ms" phrasing the Pulse model card now uses for the same TTFT claim).

Versions mirror synced.

* docs(measuring-latency): add 4 sections to beat Deepgram comprehensively

After head-to-head review against Deepgram's measuring-streaming-latency page,
identified four gaps we needed to close. Added all four as new H2 sections:

1. **"Pulse vs Pulse Pro for latency-sensitive workloads"** — Deepgram has
   "Model Considerations" (Nova-3 vs Flux); we had no model recommendation
   on the page. Added a decision table: voice agents → Pulse; live captioning
   → Pulse; multilingual → Pulse; offline English accuracy → Pulse Pro. With
   the hard constraint surfaced up front: Pulse Pro has no streaming worker
   (returns 400 on /stt/live).

2. **"Latency expectations"** — Deepgram has a typical-ranges table; we had
   numbers scattered inline per component. Consolidated table with 7 rows
   (connection setup, network transit, server transcription, total transcript
   latency, EOT, time-to-first-partial, RTFx) + p50/p95 reminder. Numbers
   sourced from existing model-card claims, not invented.

3. **"When latencies are higher than expected"** — Deepgram has remediation
   bullets; we had pitfalls but no fix-it list. Added 6 actionable bullets
   keyed back to the components section (sockets / region / capacity / client
   overhead / EOT pattern / buffer size). Helps readers go from "my number
   is bad" to "here's what to change".

4. **"Summary"** — Deepgram closes with bullet takeaways; we ended on the
   cross-references mid-air. Added 7-bullet recap: cursor model, metric
   distinction, four-component attribution, model selection, finalize-on-VAD
   recommendation, word_timestamps=true reminder, p50+p95 tracking.

Where we now beat Deepgram on every axis they cover:
- Cursor model (parity)
- Components (parity, with ASCII pipeline diagram they lack)
- Model recommendation (parity)
- Measurement code (better — three EOT patterns including finalize-on-VAD)
- Latency expectations (parity)
- Troubleshooting (parity)
- Summary (parity)
- Tools (better — reference cookbook scripts, not just one-off CLI utilities)

Plus we cover RTFx + pre-recorded methodology that Deepgram skips entirely.

Versions mirror synced. MDX validated (no `<digit` hazards). fern check 0
errors, llms.txt regenerated.

* docs(measuring-latency): drop Pulse-Pro-no-streaming-worker paragraph

Per reviewer: the standalone "Pulse Pro has no streaming worker. Calls to
WS /waves/v1/stt/live?model=pulse-pro return 400..." paragraph between the
workload-recommendation table and the model-card cross-references felt
redundant. The same constraint is already conveyed in the table row
("HTTP-only, no streaming worker") and once more in the Summary section
("Pulse for streaming, Pulse Pro for offline English accuracy. Pulse Pro
has no streaming worker.") — three mentions was overkill.

Versions mirror synced. No section structure changes.

* docs(measuring-latency): drop the cursor-difference formula entirely (Gaurav)

Gaurav noted that the documented `transcript_latency = audio_cursor −
transcript_cursor` formula breaks during silence in the audio — the audio
cursor keeps advancing while the transcript cursor correctly stays put on
the last word, so the "latency" reading inflates by the silence duration.

His exact illustration: in `---hey how are you---my name is gaurav---`,
the audio cursor races ahead during the dashes while the transcript cursor
sits on "gaurav"; the formula reports the silence as latency. That's
wrong.

Researched what the voice-agent industry actually does:
- Deepgram (where we lifted the formula from) doesn't caveat silence.
  Same bug. https://developers.deepgram.com/docs/measuring-streaming-latency
- Daily.co's Pipecat STT benchmark — the open-source benchmark voice-
  agent shops actually run — measures Time-to-Final-Segment (TTFS, = our
  EOT) per utterance only. No cursor-difference metric.
- AssemblyAI documents TTFT + TTCT (= our first-partial + EOT) and
  explicitly recommends live side-by-side testing over precise per-chunk
  formulas. https://www.assemblyai.com/docs/streaming/evaluations/voice-agents
- Speechmatics has a `max_delay` config but no measurement formula.

Conclusion: there is no industry-standard "transcript latency" cursor
formula. Documenting our own (broken or fixed) would invent a metric
nobody else uses. Per reviewer preference, stay neutral.

Changes:
- Removed the `### Transcript latency` H3 section + its code block entirely.
- Removed the "Transcript latency" row from the metric-selection table.
- Replaced the "How streaming latency is measured" cursor-model intro
  with a "What 'streaming latency' actually means" framing that names
  the three industry-standard metrics (TTFT, EOT, RTFx) and explicitly
  notes — with citations to Pipecat + AssemblyAI — why no cursor formula
  is documented here.
- Removed "Total transcript latency" row from the Latency-expectations
  table.
- Updated pitfalls + summary bullets + cookbook-scripts row to drop
  "transcript latency" references; replaced with TTFT/EOT framing.
- Updated frontmatter description to lead with TTFT/EOT/RTFx instead of
  cursor model.

Versions mirror synced. fern check 0 errors, llms.txt regenerated.

* docs(measuring-latency): drop the explainer note + cursor-pitfall (Streisand)

Reviewer caught that the "we don't document the cursor formula because…"
explainer note + the "don't invent a cursor formula" pitfall bullet
both ironically re-introduce the bad pattern by naming it.

Removed:
- The `<Note>` at the top under "What 'streaming latency' actually means"
  that cited Deepgram's broken formula + Pipecat + AssemblyAI to justify
  our absence of it.
- The pitfall bullet that warned against "inventing a live transcript
  lag formula from cursors."

Page now stays silent on the cursor approach entirely. Customers
reading it see only what we DO recommend (TTFT, EOT, RTFx) — no
mention of the formula they shouldn't use. If they go look at
Deepgram's page separately and try to copy that formula, they'll find
their own silence-inflation issue without us drawing attention to it.

Versions mirror synced. fern check 0 errors, llms.txt regenerated.

* docs(measuring-latency): drop competitor name-drops; keep metric vocab only

Reviewer feedback: name-dropping AssemblyAI / Pipecat / Daily.co in
reference links pulls customer attention toward competitor sites. Same
class of problem as the Streisand fix in the previous commit — naming
the alternative draws people to it.

Removed two competitor citations:
- The opener line that cited "[Pipecat STT benchmark](URL)" and
  "[AssemblyAI](URL)" as the source of the TTFS / TTCT vocabulary.
  Replaced with: "Also called Time-to-Final-Segment (TTFS) or
  Time-to-Complete-Transcript (TTCT) in the broader voice-agent
  industry." Metric names stand on their own as industry vocabulary;
  no attribution needed.
- The pitfall bullet that said "that's what AssemblyAI, Pipecat, and
  Daily.co recommend" — rephrased to direct guidance: "Empirical
  end-to-end measurement beats a synthetic per-chunk formula."

Page now contains zero competitor brand references. Industry-standard
metric names (TTFT, TTFS, TTCT, EOT, RTFx) retained as vocabulary.

Versions mirror synced. fern check 0 errors, llms.txt regenerated.

* docs(measuring-latency): drop ASCII pipeline diagram + concretize numbers (Gaurav)

Reviewer feedback round 3:
1. ASCII pipeline diagram in "Latency components" section doesn't make
   sense (visual without payoff — readers already know audio goes
   client → server → client). Removed.
2. "sub-100 ms at 1 concurrency" framing is vague. Replaced with a
   concrete "150 ms" number that matches the actual production
   measurement. Updated in two places: the Server-transcription
   component prose (line 92) and the Latency-expectations table (line
   ~213).
3. Buffer-size practical range was "20–100 ms per chunk" — corrected
   to "100–300 ms per chunk" per Gaurav's input.

Versions mirror synced. fern check 0 errors, llms.txt regenerated.

* docs(measuring-latency): drop measurement recipes, keep components + pitfalls

Removes the Python measurement snippets, Pattern A/B/C EOT recipes,
inline Latency expectations table, and "When latencies are higher than
expected" troubleshooting list. The four-component attribution model,
the Pulse vs Pulse Pro picker, the Common measurement pitfalls list,
and the cookbook reference scripts all stay — that's the load-bearing
content for a docs reader.

The cookbook scripts (ping_pulse_offline.py, ping_pulse_streaming.py)
are the canonical, maintained measurement recipes; this page now
points at them instead of duplicating the Python end-to-end. Updates
the intro accordingly. Drops the Summary bullet that referenced the
"three EOT patterns" since Pattern A/B/C are no longer defined on the
page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant