Skip to content

PegaFlow Roadmap #314

Description

@xiaguan

PegaFlow Roadmap

PegaFlow is a high-performance KV cache storage and transfer engine for LLM
inference. The ecosystem spans the Rust core engine, the Python vLLM connector,
the P/D disaggregation path, pegaflow-transfer (RDMA), and cross-node P2P KV
sharing.

We recommend reading through the high-level plan first to get an overview, then
using this roadmap to look up specific components of interest.

Table of Contents

Performance

Each I/O path should have an isolated, reproducible microbenchmark that measures raw
hardware throughput and latency independent of the inference engine. These
benchmarks are the foundation for capacity planning, regression detection, and
optimization work.

  • D2H (GPU → CPU save)
    • Single-GPU benchmark measuring pinned-memory write throughput and latency.
    • Variables: block size, layer count, NUMA placement, hugepages on/off.
    • Output: throughput (GB/s), latency distribution (p50/p99).
  • H2D (CPU → GPU load)
    • Single-GPU benchmark measuring pinned-memory read throughput and latency.
    • Same variable matrix as D2H for direct comparison.
    • Output: throughput (GB/s), latency distribution (p50/p99).
  • SSD (NVMe io_uring)
    • Isolated io_uring write and read (prefetch) benchmarks against the raw block
      device, without the cache tier.
    • Variables: queue depth, inflight count, block size, O_DIRECT vs buffered.
    • Output: read/write IOPS, throughput (GB/s), latency distribution (p50/p99).
  • RDMA (node-to-node)
    • Bare pegaflow-transfer v2 throughput and latency benchmark: one-sided
      RDMA READ/WRITE between two nodes, no inference engine.
    • Variables: NIC count, QPs per peer, transfer size, NUMA affinity
      (same-HCA vs cross-HCA), hugepages on/off.
    • Output: throughput (GB/s), per-NIC throughput, one-sided READ/WRITE latency
      distribution (p50/p99), effective bandwidth utilization vs link speed.

P/D Disaggregation

Layer-wise RDMA-push Prefill/Decode serving, moving from experimental to stable.
KV movement is pipelined with prefill compute layer by layer — P pushes
layer_i KV to D via RDMA WRITE while computing layer_{i+1}. D resumes decode
after P sends the final IMM.

Current status: pd_connector is functional end-to-end (small-model validated)
with scheduler/worker state, chunk tracking, HND layout handling, and native
PdRdmaEngine on top of pegaflow-transfer v2. Remaining: production router
path, formal benchmarking, and reliability hardening.

  • Production serving path
    • Replace pd_connector.proxy with pegaflow-router as the normal entry point (keep proxy for local debugging).
    • D-first routing: router selects P/D endpoints, sends request to D only with kv_transfer_params, D allocates KV slots and triggers P.
    • Support /v1/completions and /v1/chat/completions with streaming; preserve request IDs across router, D, P.
  • Attention and KV layout coverage
    • Keep FlashAttention HND as the first stable target.
    • Add support for more attention/KV layouts beyond HND, including MLA-style
      layouts used by DeepSeek-family models.
    • Make layout detection and validation explicit so unsupported attention
      backends fail before serving traffic.
  • Cross-layer KV layout support
    • Support models where KV layout or address mapping is not a simple
      independent per-layer table.
    • Define connector metadata for cross-layer addressing without hard-coding one
      layout into the RDMA path.
  • Pipeline parallel support
    • Explore P/D handoff when the serving engine uses pipeline parallelism.
    • Define how handshakes, layer ownership, and RDMA push scheduling map to PP
      stages.
    • Validate that layer-wise push still overlaps with prefill compute when
      layers are split across pipeline stages.
  • Conditional P/D
    • Policy layer to choose local vs remote prefill per request: prompt length, P/D load, cache state, RDMA health, SLO target.
    • P-side prefix caching re-enabled: D tells P which prefix blocks are available so P only pushes missing KV blocks.
    • Role-flexible workers: any worker can serve prefill or decode on demand.

Core KV Cache

P2P KV Cache Sharing

  • MetaServer high availability
    • Improve MetaServer availability and recovery speed for P2P KV cache sharing.
    • Keep concrete HA implementation choices out of this roadmap draft.
  • MetaServer as a cache orchestration point
    • Explore using MetaServer's global view of block locations to guide cache
      orchestration decisions.
    • Track cache redundancy: which blocks are over-replicated, under-replicated,
      or concentrated on fragile nodes.
    • Surface global cache health signals that can help future routing,
      replication, cleanup, and recovery policies.

RDMA Transfer

Unify v1/v2 Transfer Layer

  • v1/v2 convergence
    • pegaflow-transfer currently has a stable v1 path and a more experimental
      v2 path for PD push. The roadmap should converge them instead of letting two
      RDMA APIs grow independently.
    • Keep the useful v2 capabilities, especially IMM completion, scatter/paged
      transfers, and explicit routing, but expose them through a stable transfer
      abstraction.

Ecosystem & Integrations

vLLM Connector

  • Hybrid KV cache manager (HMA) support
    • vLLM's HybridKVCacheCoordinator splits KV cache into type-specific groups
      for models with mixed attention (sliding window + full, Mamba + full, etc.).
      PegaFlow connector must implement the SupportsHMA interface
      (request_finished_all_groups) to be compatible when the hybrid manager is
      enabled.
    • This is non-trivial: the connector must track per-group block IDs, handle
      type-specific eviction semantics, and coordinate save/load across groups that
      have different lifetimes. The prefix-cache hit detection for hybrid models
      involves a fixed-point intersection between full-attention and
      sliding-window/local-attention groups.
  • P/D connector merge
    • Merge pegaflow.connector (save/load path) and pegaflow.pd_connector
      (RDMA-push path) into a single unified connector with mode selection.

User Experience

  • Documentation site (GitHub Pages)
    • Publish a GitHub Pages site with clean navigation: Concepts
      → Quick Start → Deployment Guides → Configuration Reference → Troubleshooting.
    • Keep docs in the repo (docs/) so they version alongside the code. Render
      with a static site generator (e.g. mdBook, VitePress, or Just the Docs).
  • Deployment guides
    • Write end-to-end deployment recipes in English for the major use cases:
      single-node KV cache offload, multi-node P2P KV sharing, and P/D
      disaggregation.
    • Each guide should cover: prerequisites (NIC, hugepages, NUMA), CLI flags,
      vLLM connector config, verification steps, and common failure modes.
  • User docs vs implementation notes
    • Separate user-facing docs (server CLI, connector config, metrics reference,
      topology setup) from internal design notes. docs/ is for users;
      implementation rationale stays in CLAUDE.md or ARCHITECTURE.md.

Observability

  • RDMA performance monitoring
    • How do we detect RDMA performance degradation or blockage before it surfaces
      as a P/D timeout? Explore what RDMA-level signals are available — completion
      queue depth, QP stall indicators, retransmit counters, NIC port counters —
      and surface the actionable ones as Prometheus metrics.
    • Define what "healthy" RDMA looks like numerically so operators can set
      alerting thresholds rather than guessing.
  • Log hygiene
    • Audit the log surface: every log line should either drive an action or
      provide context that metrics cannot. Errors must include enough structured
      fields (request_id, tp_rank, rdma_req_id, etc.) to trace root cause
      without grep-guessing.
    • Remove or demote info logs whose signal is already covered by metrics.
  • Tracing
    • The load path already has enough instrumentation to reconstruct a request's
      full trajectory (cache query → prefetch → transfer → GPU ready).
    • Tail-sample traces: emit as structured logs at ~1% probability, and always
      emit when end-to-end latency exceeds a configurable threshold. This gives
      operators a concrete per-request timeline without a full tracing
      infrastructure dependency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions