⚠️ Experimental — This feature is functional but not recommended for production deployments.
Prefill/Decode disaggregation separates the prefill (P) and decode (D) phases to different vLLM instances, improving resource utilization.
┌────────┐ ┌────────┐ ┌────────┐
│ Router │ ──1──→ │ P │ │ D │
│ │ ←─2─── │ │ │ │
│ │ │ async │ │ │
│ │ │ save │ │ │
│ │ ←─3─── │ done! │ │ │
│ │ ──4──────────────────────→ │ │
└────────┘ └────────┘ └────────┘
↓ ↓
PegaEngine (shared CPU storage)
- Router sends request to P node (max_tokens=1)
- P returns first token immediately (non-blocking)
- P's save worker completes async KV write, callbacks Router
- Router receives callback, forwards request to D node
- D node's
get_num_new_matched_tokens()queries PegaEngine, finds KV exists - D loads KV via
start_load_kv()and continues decode
wait_for_save() blocking would hurt throughput. Callback allows P to continue processing other requests while KV is being saved.
D node receives the same prompt, computes the same block_hashes (via vLLM's internal logic), and queries PegaEngine directly. No need to pass block_hashes through Router.
As long as all P/D instances:
- Connect to the same PegaEngine
- Use the same TP size
- Use the same block_size
Router only needs to do load balancing.
# P node
PEGAFLOW_ROUTER_ENDPOINT=http://router:8080
# D node (no special config needed)The async callback path (_notify_router → /kv_ready) is not yet implemented in the connector.
The current Router uses a synchronous flow: it waits for P's HTTP response before forwarding to D.
The Rust router lives at pegaflow-server/src/bin/pegaflow-router.rs. It is a standalone binary (not part of the default build) that can be run with:
cargo run --release --bin pegaflow-router -- \
--prefill http://p-node:8000 \
--decode http://d-node:8001See examples/run_vllm_pd_with_pega.py for a complete multi-GPU launch script.
| Configuration | TTFT mean (ms) | TPOT mean (ms) | TPOT p99 (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| P/D (1P+1D) | 573.78 | 15.68 | 15.89 | 21.71 |
| Baseline (DP2) | 438.24 | 22.67 | 24.32 | 142.70 |
The P/D setup trades higher TTFT for significantly more stable decode latency — TPOT p99 drops from 24.32ms to 15.89ms, and ITL p99 improves dramatically from 142.70ms to 21.71ms.
- Router uses synchronous P→D handoff (no async KV-ready callback yet)
- No built-in timeout/retry for P or D node failures
- No Prometheus metrics for P/D latency breakdown