Feasibility and cost-efficiency analysis of serving the jina-embeddings-v5 family (v5-text and v5-omni) on Google Cloud TPU instead of the current NVIDIA L4 / Cloud Run deployment. Grounded in real, dated GCP and Jina pricing.
Bottom line: Stay on L4 / Cloud Run. As of 2026-05-05 the feasibility wall fell for v5-text-small (decoder embedding serving shipped upstream, tpu-inference#2420), so the blocker there is now economics, not capability - and the economics still favor L4. v5-text-nano and v5-omni remain feasibility-blocked. See Executive Summary.
Captured 2026-05-28. All prices on-demand USD unless stated; re-verify before any commitment.
| # | Section | What it answers |
|---|---|---|
| - | Executive Summary | The whole story in one page |
| 01 | Models | What v5-text/v5-omni are; current L4 serving |
| 02 | Hardware | L4 vs TPU specs; throughput ceiling if software-unblocked |
| 03 | Pricing | Real L4 and TPU prices, cost ratios |
| 04 | Feasibility | Can it run on TPU at all? (the blocker) |
| 05 | Cost Analysis | TPS, token value, margin, break-even |
| 06 | Recommendations | The decision and revisit triggers |
| - | Sources | Raw text captures of every cited fact |
| - | Figures | Fact-based plots (reproducible via matplotlib) |
| - | data/ | Structured JSON: pricing, models |
- Decision-maker: Executive Summary -> Decision.
- Engineer: Feasibility -> vLLM TPU blocker -> Throughput ceiling: if the gate were removed -> Alternative paths -> Benchmark plan.
- Finance / FinOps: Pricing -> Why isn't TPU cost-competitive? -> Break-even model -> Token economics -> cost-model.csv.
- Decoder embedding serving on TPU shipped 2026-05-05 (tpu-inference#2420, merged, addresses #899) - StepPool + a vLLM-Pooler-on-JAX-Qwen3 hybrid, verified for Qwen3-Embedding-8B on v6e/v7x. This unblocks v5-text-small at the capability level (remaining: integrate v5 weights + benchmark). v5-text-nano (EuroBERT encoder-only, vllm#20869) and v5-omni (multimodal) stay blocked - #2420 added decoder pooling only.
- TPU is not on Cloud Run, so a migration loses
min=0scale-to-zero. For the low-utilization per-task topology this dominates the economics. (Unchanged.) - With real prices, v5e must run >=1.7x an L4's tokens/s (v6e >=3.8x) just to reach tokens-per-dollar parity - now measurable for v5-text-small, still to be benchmarked.
- Even if the software gate were removed: on compute-per-dollar (the metric that governs embedding economy) v5e is parity with L4 (0.96x) and only v6e exceeds it (1.99x ceiling, ~1.3-1.5x realistic), and only on a continuously-saturated lane. See throughput ceiling analysis.
More plots (break-even throughput, margin-vs-throughput curves) in figures/ and inline in the cost analysis.
- The cost side uses real, sourced GCP/Jina prices.
- The throughput (tokens/s) side is not invented. Where a number is needed to show the shape of the margin curve it is an explicit placeholder in cost-model.csv and must be replaced by measurement (benchmark plan).
- Internal architecture facts come from the
sefo_gcpembeddings_v5 executor and the website model pages; external facts are captured verbatim in sources/.

