PegaFlow Roadmap
PegaFlow is a high-performance KV cache storage and transfer engine for LLM
inference. The ecosystem spans the Rust core engine, the Python vLLM connector,
the P/D disaggregation path, pegaflow-transfer (RDMA), and cross-node P2P KV
sharing.
We recommend reading through the high-level plan first to get an overview, then
using this roadmap to look up specific components of interest.
Table of Contents
Performance
Each I/O path should have an isolated, reproducible microbenchmark that measures raw
hardware throughput and latency independent of the inference engine. These
benchmarks are the foundation for capacity planning, regression detection, and
optimization work.
- D2H (GPU → CPU save)
- Single-GPU benchmark measuring pinned-memory write throughput and latency.
- Variables: block size, layer count, NUMA placement, hugepages on/off.
- Output: throughput (GB/s), latency distribution (p50/p99).
- H2D (CPU → GPU load)
- Single-GPU benchmark measuring pinned-memory read throughput and latency.
- Same variable matrix as D2H for direct comparison.
- Output: throughput (GB/s), latency distribution (p50/p99).
- SSD (NVMe io_uring)
- Isolated io_uring write and read (prefetch) benchmarks against the raw block
device, without the cache tier.
- Variables: queue depth, inflight count, block size, O_DIRECT vs buffered.
- Output: read/write IOPS, throughput (GB/s), latency distribution (p50/p99).
- RDMA (node-to-node)
- Bare
pegaflow-transfer v2 throughput and latency benchmark: one-sided
RDMA READ/WRITE between two nodes, no inference engine.
- Variables: NIC count, QPs per peer, transfer size, NUMA affinity
(same-HCA vs cross-HCA), hugepages on/off.
- Output: throughput (GB/s), per-NIC throughput, one-sided READ/WRITE latency
distribution (p50/p99), effective bandwidth utilization vs link speed.
P/D Disaggregation
Layer-wise RDMA-push Prefill/Decode serving, moving from experimental to stable.
KV movement is pipelined with prefill compute layer by layer — P pushes
layer_i KV to D via RDMA WRITE while computing layer_{i+1}. D resumes decode
after P sends the final IMM.
Current status: pd_connector is functional end-to-end (small-model validated)
with scheduler/worker state, chunk tracking, HND layout handling, and native
PdRdmaEngine on top of pegaflow-transfer v2. Remaining: production router
path, formal benchmarking, and reliability hardening.
- Production serving path
- Replace
pd_connector.proxy with pegaflow-router as the normal entry point (keep proxy for local debugging).
- D-first routing: router selects P/D endpoints, sends request to D only with
kv_transfer_params, D allocates KV slots and triggers P.
- Support
/v1/completions and /v1/chat/completions with streaming; preserve request IDs across router, D, P.
- Attention and KV layout coverage
- Keep FlashAttention HND as the first stable target.
- Add support for more attention/KV layouts beyond HND, including MLA-style
layouts used by DeepSeek-family models.
- Make layout detection and validation explicit so unsupported attention
backends fail before serving traffic.
- Cross-layer KV layout support
- Support models where KV layout or address mapping is not a simple
independent per-layer table.
- Define connector metadata for cross-layer addressing without hard-coding one
layout into the RDMA path.
- Pipeline parallel support
- Explore P/D handoff when the serving engine uses pipeline parallelism.
- Define how handshakes, layer ownership, and RDMA push scheduling map to PP
stages.
- Validate that layer-wise push still overlaps with prefill compute when
layers are split across pipeline stages.
- Conditional P/D
- Policy layer to choose local vs remote prefill per request: prompt length, P/D load, cache state, RDMA health, SLO target.
- P-side prefix caching re-enabled: D tells P which prefix blocks are available so P only pushes missing KV blocks.
- Role-flexible workers: any worker can serve prefill or decode on demand.
Core KV Cache
P2P KV Cache Sharing
- MetaServer high availability
- Improve MetaServer availability and recovery speed for P2P KV cache sharing.
- Keep concrete HA implementation choices out of this roadmap draft.
- MetaServer as a cache orchestration point
- Explore using MetaServer's global view of block locations to guide cache
orchestration decisions.
- Track cache redundancy: which blocks are over-replicated, under-replicated,
or concentrated on fragile nodes.
- Surface global cache health signals that can help future routing,
replication, cleanup, and recovery policies.
RDMA Transfer
Unify v1/v2 Transfer Layer
- v1/v2 convergence
pegaflow-transfer currently has a stable v1 path and a more experimental
v2 path for PD push. The roadmap should converge them instead of letting two
RDMA APIs grow independently.
- Keep the useful v2 capabilities, especially IMM completion, scatter/paged
transfers, and explicit routing, but expose them through a stable transfer
abstraction.
Ecosystem & Integrations
vLLM Connector
- Hybrid KV cache manager (HMA) support
- vLLM's
HybridKVCacheCoordinator splits KV cache into type-specific groups
for models with mixed attention (sliding window + full, Mamba + full, etc.).
PegaFlow connector must implement the SupportsHMA interface
(request_finished_all_groups) to be compatible when the hybrid manager is
enabled.
- This is non-trivial: the connector must track per-group block IDs, handle
type-specific eviction semantics, and coordinate save/load across groups that
have different lifetimes. The prefix-cache hit detection for hybrid models
involves a fixed-point intersection between full-attention and
sliding-window/local-attention groups.
- P/D connector merge
- Merge
pegaflow.connector (save/load path) and pegaflow.pd_connector
(RDMA-push path) into a single unified connector with mode selection.
User Experience
- Documentation site (GitHub Pages)
- Publish a GitHub Pages site with clean navigation: Concepts
→ Quick Start → Deployment Guides → Configuration Reference → Troubleshooting.
- Keep docs in the repo (
docs/) so they version alongside the code. Render
with a static site generator (e.g. mdBook, VitePress, or Just the Docs).
- Deployment guides
- Write end-to-end deployment recipes in English for the major use cases:
single-node KV cache offload, multi-node P2P KV sharing, and P/D
disaggregation.
- Each guide should cover: prerequisites (NIC, hugepages, NUMA), CLI flags,
vLLM connector config, verification steps, and common failure modes.
- User docs vs implementation notes
- Separate user-facing docs (server CLI, connector config, metrics reference,
topology setup) from internal design notes. docs/ is for users;
implementation rationale stays in CLAUDE.md or ARCHITECTURE.md.
Observability
- RDMA performance monitoring
- How do we detect RDMA performance degradation or blockage before it surfaces
as a P/D timeout? Explore what RDMA-level signals are available — completion
queue depth, QP stall indicators, retransmit counters, NIC port counters —
and surface the actionable ones as Prometheus metrics.
- Define what "healthy" RDMA looks like numerically so operators can set
alerting thresholds rather than guessing.
- Log hygiene
- Audit the log surface: every log line should either drive an action or
provide context that metrics cannot. Errors must include enough structured
fields (request_id, tp_rank, rdma_req_id, etc.) to trace root cause
without grep-guessing.
- Remove or demote info logs whose signal is already covered by metrics.
- Tracing
- The load path already has enough instrumentation to reconstruct a request's
full trajectory (cache query → prefetch → transfer → GPU ready).
- Tail-sample traces: emit as structured logs at ~1% probability, and always
emit when end-to-end latency exceeds a configurable threshold. This gives
operators a concrete per-request timeline without a full tracing
infrastructure dependency.
PegaFlow Roadmap
Table of Contents
Performance
Each I/O path should have an isolated, reproducible microbenchmark that measures raw
hardware throughput and latency independent of the inference engine. These
benchmarks are the foundation for capacity planning, regression detection, and
optimization work.
device, without the cache tier.
pegaflow-transferv2 throughput and latency benchmark: one-sidedRDMA READ/WRITE between two nodes, no inference engine.
(same-HCA vs cross-HCA), hugepages on/off.
distribution (p50/p99), effective bandwidth utilization vs link speed.
P/D Disaggregation
Layer-wise RDMA-push Prefill/Decode serving, moving from experimental to stable.
KV movement is pipelined with prefill compute layer by layer — P pushes
layer_i KV to D via RDMA WRITE while computing layer_{i+1}. D resumes decode
after P sends the final IMM.
Current status:
pd_connectoris functional end-to-end (small-model validated)with scheduler/worker state, chunk tracking, HND layout handling, and native
PdRdmaEngineon top ofpegaflow-transferv2. Remaining: production routerpath, formal benchmarking, and reliability hardening.
pd_connector.proxywithpegaflow-routeras the normal entry point (keep proxy for local debugging).kv_transfer_params, D allocates KV slots and triggers P./v1/completionsand/v1/chat/completionswith streaming; preserve request IDs across router, D, P.layouts used by DeepSeek-family models.
backends fail before serving traffic.
independent per-layer table.
layout into the RDMA path.
stages.
layers are split across pipeline stages.
Core KV Cache
P2P KV Cache Sharing
orchestration decisions.
or concentrated on fragile nodes.
replication, cleanup, and recovery policies.
RDMA Transfer
Unify v1/v2 Transfer Layer
pegaflow-transfercurrently has a stable v1 path and a more experimentalv2 path for PD push. The roadmap should converge them instead of letting two
RDMA APIs grow independently.
transfers, and explicit routing, but expose them through a stable transfer
abstraction.
Ecosystem & Integrations
vLLM Connector
HybridKVCacheCoordinatorsplits KV cache into type-specific groupsfor models with mixed attention (sliding window + full, Mamba + full, etc.).
PegaFlow connector must implement the
SupportsHMAinterface(
request_finished_all_groups) to be compatible when the hybrid manager isenabled.
type-specific eviction semantics, and coordinate save/load across groups that
have different lifetimes. The prefix-cache hit detection for hybrid models
involves a fixed-point intersection between full-attention and
sliding-window/local-attention groups.
pegaflow.connector(save/load path) andpegaflow.pd_connector(RDMA-push path) into a single unified connector with mode selection.
User Experience
→ Quick Start → Deployment Guides → Configuration Reference → Troubleshooting.
docs/) so they version alongside the code. Renderwith a static site generator (e.g. mdBook, VitePress, or Just the Docs).
single-node KV cache offload, multi-node P2P KV sharing, and P/D
disaggregation.
vLLM connector config, verification steps, and common failure modes.
topology setup) from internal design notes.
docs/is for users;implementation rationale stays in CLAUDE.md or ARCHITECTURE.md.
Observability
as a P/D timeout? Explore what RDMA-level signals are available — completion
queue depth, QP stall indicators, retransmit counters, NIC port counters —
and surface the actionable ones as Prometheus metrics.
alerting thresholds rather than guessing.
provide context that metrics cannot. Errors must include enough structured
fields (
request_id,tp_rank,rdma_req_id, etc.) to trace root causewithout grep-guessing.
full trajectory (cache query → prefetch → transfer → GPU ready).
emit when end-to-end latency exceeds a configurable threshold. This gives
operators a concrete per-request timeline without a full tracing
infrastructure dependency.