PegaFlow Roadmap

# PegaFlow Roadmap

> PegaFlow is a high-performance KV cache storage and transfer engine for LLM
> inference. The ecosystem spans the Rust core engine, the Python vLLM connector,
> the P/D disaggregation path, pegaflow-transfer (RDMA), and cross-node P2P KV
> sharing.
>
> We recommend reading through the high-level plan first to get an overview, then
> using this roadmap to look up specific components of interest.

## Table of Contents

- [Performance](#performance)
- [P/D Disaggregation](#pd-disaggregation)
- [Core KV Cache](#core-kv-cache)
  - [P2P KV Cache Sharing](#p2p-kv-cache-sharing)
- [RDMA Transfer](#rdma-transfer)
- [Ecosystem & Integrations](#ecosystem--integrations)
  - [vLLM Connector](#vllm-connector)
- [User Experience](#user-experience)
- [Observability](#observability)

## Performance

Each I/O path should have an isolated, reproducible microbenchmark that measures raw
hardware throughput and latency independent of the inference engine. These
benchmarks are the foundation for capacity planning, regression detection, and
optimization work.

- **D2H (GPU → CPU save)**
  - Single-GPU benchmark measuring pinned-memory write throughput and latency.
  - Variables: block size, layer count, NUMA placement, hugepages on/off.
  - Output: throughput (GB/s), latency distribution (p50/p99).
- **H2D (CPU → GPU load)**
  - Single-GPU benchmark measuring pinned-memory read throughput and latency.
  - Same variable matrix as D2H for direct comparison.
  - Output: throughput (GB/s), latency distribution (p50/p99).
- **SSD (NVMe io_uring)**
  - Isolated io_uring write and read (prefetch) benchmarks against the raw block
    device, without the cache tier.
  - Variables: queue depth, inflight count, block size, O_DIRECT vs buffered.
  - Output: read/write IOPS, throughput (GB/s), latency distribution (p50/p99).
- **RDMA (node-to-node)**
  - Bare `pegaflow-transfer` v2 throughput and latency benchmark: one-sided
    RDMA READ/WRITE between two nodes, no inference engine.
  - Variables: NIC count, QPs per peer, transfer size, NUMA affinity
    (same-HCA vs cross-HCA), hugepages on/off.
  - Output: throughput (GB/s), per-NIC throughput, one-sided READ/WRITE latency
    distribution (p50/p99), effective bandwidth utilization vs link speed.

## P/D Disaggregation

Layer-wise RDMA-push Prefill/Decode serving, moving from experimental to stable.
KV movement is pipelined with prefill compute layer by layer — P pushes
layer_i KV to D via RDMA WRITE while computing layer_{i+1}. D resumes decode
after P sends the final IMM.

Current status: `pd_connector` is functional end-to-end (small-model validated)
with scheduler/worker state, chunk tracking, HND layout handling, and native
`PdRdmaEngine` on top of `pegaflow-transfer` v2. Remaining: production router
path, formal benchmarking, and reliability hardening.

- **Production serving path**
  - Replace `pd_connector.proxy` with `pegaflow-router` as the normal entry point (keep proxy for local debugging).
  - D-first routing: router selects P/D endpoints, sends request to D only with `kv_transfer_params`, D allocates KV slots and triggers P.
  - Support `/v1/completions` and `/v1/chat/completions` with streaming; preserve request IDs across router, D, P.
- **Attention and KV layout coverage**
  - Keep FlashAttention HND as the first stable target.
  - Add support for more attention/KV layouts beyond HND, including MLA-style
    layouts used by DeepSeek-family models.
  - Make layout detection and validation explicit so unsupported attention
    backends fail before serving traffic.
- **Cross-layer KV layout support**
  - Support models where KV layout or address mapping is not a simple
    independent per-layer table.
  - Define connector metadata for cross-layer addressing without hard-coding one
    layout into the RDMA path.
- **Pipeline parallel support**
  - Explore P/D handoff when the serving engine uses pipeline parallelism.
  - Define how handshakes, layer ownership, and RDMA push scheduling map to PP
    stages.
  - Validate that layer-wise push still overlaps with prefill compute when
    layers are split across pipeline stages.
- **Conditional P/D**
  - Policy layer to choose local vs remote prefill per request: prompt length, P/D load, cache state, RDMA health, SLO target.
  - P-side prefix caching re-enabled: D tells P which prefix blocks are available so P only pushes missing KV blocks.
  - Role-flexible workers: any worker can serve prefill or decode on demand.

## Core KV Cache

### P2P KV Cache Sharing

- **MetaServer high availability**
  - Improve MetaServer availability and recovery speed for P2P KV cache sharing.
  - Keep concrete HA implementation choices out of this roadmap draft.
- **MetaServer as a cache orchestration point**
  - Explore using MetaServer's global view of block locations to guide cache
    orchestration decisions.
  - Track cache redundancy: which blocks are over-replicated, under-replicated,
    or concentrated on fragile nodes.
  - Surface global cache health signals that can help future routing,
    replication, cleanup, and recovery policies.

## RDMA Transfer

### Unify v1/v2 Transfer Layer

- **v1/v2 convergence**
  - `pegaflow-transfer` currently has a stable v1 path and a more experimental
    v2 path for PD push. The roadmap should converge them instead of letting two
    RDMA APIs grow independently.
  - Keep the useful v2 capabilities, especially IMM completion, scatter/paged
    transfers, and explicit routing, but expose them through a stable transfer
    abstraction.

## Ecosystem & Integrations

### vLLM Connector

- **Hybrid KV cache manager (HMA) support**
  - vLLM's `HybridKVCacheCoordinator` splits KV cache into type-specific groups
    for models with mixed attention (sliding window + full, Mamba + full, etc.).
    PegaFlow connector must implement the `SupportsHMA` interface
    (`request_finished_all_groups`) to be compatible when the hybrid manager is
    enabled.
  - This is non-trivial: the connector must track per-group block IDs, handle
    type-specific eviction semantics, and coordinate save/load across groups that
    have different lifetimes. The prefix-cache hit detection for hybrid models
    involves a fixed-point intersection between full-attention and
    sliding-window/local-attention groups.
- **P/D connector merge**
  - Merge `pegaflow.connector` (save/load path) and `pegaflow.pd_connector`
    (RDMA-push path) into a single unified connector with mode selection.

## User Experience

- **Documentation site (GitHub Pages)**
  - Publish a GitHub Pages site with clean navigation: Concepts
    → Quick Start → Deployment Guides → Configuration Reference → Troubleshooting.
  - Keep docs in the repo (`docs/`) so they version alongside the code. Render
    with a static site generator (e.g. mdBook, VitePress, or Just the Docs).
- **Deployment guides**
  - Write end-to-end deployment recipes in English for the major use cases:
    single-node KV cache offload, multi-node P2P KV sharing, and P/D
    disaggregation.
  - Each guide should cover: prerequisites (NIC, hugepages, NUMA), CLI flags,
    vLLM connector config, verification steps, and common failure modes.
- **User docs vs implementation notes**
  - Separate user-facing docs (server CLI, connector config, metrics reference,
    topology setup) from internal design notes. `docs/` is for users;
    implementation rationale stays in CLAUDE.md or ARCHITECTURE.md.

## Observability

- **RDMA performance monitoring**
  - How do we detect RDMA performance degradation or blockage before it surfaces
    as a P/D timeout? Explore what RDMA-level signals are available — completion
    queue depth, QP stall indicators, retransmit counters, NIC port counters —
    and surface the actionable ones as Prometheus metrics.
  - Define what "healthy" RDMA looks like numerically so operators can set
    alerting thresholds rather than guessing.
- **Log hygiene**
  - Audit the log surface: every log line should either drive an action or
    provide context that metrics cannot. Errors must include enough structured
    fields (`request_id`, `tp_rank`, `rdma_req_id`, etc.) to trace root cause
    without grep-guessing.
  - Remove or demote info logs whose signal is already covered by metrics.
- **Tracing**
  - The load path already has enough instrumentation to reconstruct a request's
    full trajectory (cache query → prefetch → transfer → GPU ready).
  - Tail-sample traces: emit as structured logs at ~1% probability, and always
    emit when end-to-end latency exceeds a configurable threshold. This gives
    operators a concrete per-request timeline without a full tracing
    infrastructure dependency.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PegaFlow Roadmap #314

PegaFlow Roadmap

Table of Contents

Performance

P/D Disaggregation

Core KV Cache

P2P KV Cache Sharing

RDMA Transfer

Unify v1/v2 Transfer Layer

Ecosystem & Integrations

vLLM Connector

User Experience

Observability

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

PegaFlow Roadmap #314

Description

PegaFlow Roadmap

Table of Contents

Performance

P/D Disaggregation

Core KV Cache

P2P KV Cache Sharing

RDMA Transfer

Unify v1/v2 Transfer Layer

Ecosystem & Integrations

vLLM Connector

User Experience

Observability

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions