Skip to content

jina-ai/dataroom-tpu-models

Repository files navigation

Dataroom: GCP TPU for jina-embeddings-v5 Inference

Feasibility and cost-efficiency analysis of serving the jina-embeddings-v5 family (v5-text and v5-omni) on Google Cloud TPU instead of the current NVIDIA L4 / Cloud Run deployment. Grounded in real, dated GCP and Jina pricing.

Bottom line: Stay on L4 / Cloud Run. As of 2026-05-05 the feasibility wall fell for v5-text-small (decoder embedding serving shipped upstream, tpu-inference#2420), so the blocker there is now economics, not capability - and the economics still favor L4. v5-text-nano and v5-omni remain feasibility-blocked. See Executive Summary.

Captured 2026-05-28. All prices on-demand USD unless stated; re-verify before any commitment.

Table of Contents

# Section What it answers
- Executive Summary The whole story in one page
01 Models What v5-text/v5-omni are; current L4 serving
02 Hardware L4 vs TPU specs; throughput ceiling if software-unblocked
03 Pricing Real L4 and TPU prices, cost ratios
04 Feasibility Can it run on TPU at all? (the blocker)
05 Cost Analysis TPS, token value, margin, break-even
06 Recommendations The decision and revisit triggers
- Sources Raw text captures of every cited fact
- Figures Fact-based plots (reproducible via matplotlib)
- data/ Structured JSON: pricing, models

Reading paths

Key findings at a glance

  1. Decoder embedding serving on TPU shipped 2026-05-05 (tpu-inference#2420, merged, addresses #899) - StepPool + a vLLM-Pooler-on-JAX-Qwen3 hybrid, verified for Qwen3-Embedding-8B on v6e/v7x. This unblocks v5-text-small at the capability level (remaining: integrate v5 weights + benchmark). v5-text-nano (EuroBERT encoder-only, vllm#20869) and v5-omni (multimodal) stay blocked - #2420 added decoder pooling only.
  2. TPU is not on Cloud Run, so a migration loses min=0 scale-to-zero. For the low-utilization per-task topology this dominates the economics. (Unchanged.)
  3. With real prices, v5e must run >=1.7x an L4's tokens/s (v6e >=3.8x) just to reach tokens-per-dollar parity - now measurable for v5-text-small, still to be benchmarked.
  4. Even if the software gate were removed: on compute-per-dollar (the metric that governs embedding economy) v5e is parity with L4 (0.96x) and only v6e exceeds it (1.99x ceiling, ~1.3-1.5x realistic), and only on a continuously-saturated lane. See throughput ceiling analysis.

Per-dollar performance vs L4

Effective cost vs utilization: L4 scale-to-zero vs always-on TPU

More plots (break-even throughput, margin-vs-throughput curves) in figures/ and inline in the cost analysis.

Scope and honesty notes

  • The cost side uses real, sourced GCP/Jina prices.
  • The throughput (tokens/s) side is not invented. Where a number is needed to show the shape of the margin curve it is an explicit placeholder in cost-model.csv and must be replaced by measurement (benchmark plan).
  • Internal architecture facts come from the sefo_gcp embeddings_v5 executor and the website model pages; external facts are captured verbatim in sources/.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages