Releases: novitalabs/pegaflow
v0.22.10
Release of the pegaflow workspace / pegaflow-llm 0.22.10 — 12 commits since v0.22.9, centered on MLA KV-cache storage efficiency, model-aware transfer-backend selection, and cross-node redundancy observability.
English
✨ Features
- MLA KV page-first storage (#360) — store MLA KV cache page-first so per-block metadata collapses, cutting metadata overhead for MLA models.
- Per-layer MLA TP save distribution (#359) — spread MLA tensor-parallel save work across ranks by layer to balance save load.
- Model-aware KV transfer backend (#357) — the connector auto-selects the KV transfer backend per model; the server no longer needs a static backend setting.
- Metaserver block-redundancy metrics (#361) — new
pegaflow_metaserver_block_redundancy{owners="1|2|3|>=4"}distribution pluspegaflow_metaserver_block_redundancy_avggauge, surfacing the cross-node KV replication factor (how much effective cache capacity shrinks). - P/D handshake wire schema (#345) — seal the prefill/decode handshake wire schema in
pegaflow-pd-wire. - Transfer benchmarks (#349,
eb69309) — p2p RDMA fetch example plus native D2H/H2D transfer-path measurement.
🐛 Fixes
- Drop late duplicate saves (#358) — skip late duplicate saves of already-resident blocks, avoiding redundant work.
♻️ Refactors
- Restructure
pd_connectorfor maintainability (#355). SealedBlockowns itsRawBlockslots (#352).- Use
usizeforblock_ids, validated at the RPC boundary (#351).
🔧 Chore
- Bump version
0.22.9→0.22.10(#362).
⚠️ Strict version handshake: client and server must match onCARGO_PKG_VERSIONat registration — upgrade both sides together.
中文
✨ 新功能
- MLA KV page-first 存储 (#360) — MLA KV cache 按 page-first 布局存储,合并每块元数据,降低 MLA 模型的元数据开销。
- MLA TP save 按层跨 rank 分摊 (#359) — 把 MLA 张量并行的 save 工作按层分散到各 rank,均衡 save 负载。
- 按模型自动选 KV 传输 backend (#357) — connector 按模型自动选择 KV 传输 backend,server 不再需要静态指定。
- Metaserver 块冗余度指标 (#361) — 新增
pegaflow_metaserver_block_redundancy{owners="1|2|3|>=4"}分布与pegaflow_metaserver_block_redundancy_avg,反映跨节点 KV 副本数(即有效缓存容量缩水倍数)。 - P/D 握手 wire schema (#345) — 在
pegaflow-pd-wire中固化 prefill/decode 握手协议。 - 传输 benchmark (#349,
eb69309) — p2p RDMA fetch 示例 + 原生 D2H/H2D 传输路径测量。
🐛 修复
- 丢弃迟到的重复 save (#358) — 跳过对已驻留块的迟到重复 save,避免冗余工作。
♻️ 重构
- 重构
pd_connector提升可维护性 (#355)。 SealedBlock自持RawBlockslot (#352)。block_ids改用usize,在 RPC 边界校验 (#351)。
🔧 杂项
- 版本
0.22.9→0.22.10(#362)。
⚠️ 严格版本握手:注册时 client 与 server 必须CARGO_PKG_VERSION完全一致——升级请两端同时进行。
Full Changelog: v0.22.9...v0.22.10
v0.22.9
What's Changed
- refactor(core): replace KVCacheRegistration with validated KVCacheLayout by @xiaguan in #340
- feat(pd): support mtp split connector by @GentleCold in #341
- fix(core)!: seal instance layer topology from registered layers by @xiaguan in #346
- feat(connector): register layer-split KV caches from cache config by @feifei-111 in #343
- chore: bump version to 0.22.9 by @xiaguan in #347
Full Changelog: v0.22.8...v0.22.9
v0.22.8
chore: bump version to 0.22.8 (#337)
## Summary
- bump workspace and Python package versions to 0.22.8
- refresh Cargo.lock package versions for workspace crates
## Tests
- cargo metadata --locked --no-deps
- uv run --no-project python -c "import pathlib, tomllib;
p=tomllib.loads(pathlib.Path('python/pyproject.toml').read_text());
assert p['project']['version']=='0.22.8'; assert
p['tool']['commitizen']['version']=='0.22.8'; print('python metadata
version ok')"
- pre-commit hooks during commit, including cargo test --release
v0.22.7
chore(release): prepare 0.22.7
v0.22.6
0.22.5 -> 0.22.6 对照
0.22.6 是 P/D 分离的第一个 stable 版本:这版把 P/D RDMA push 路径的性能稳定性、正确性保护、布局兼容和发布包完整性一起补齐。0.22.5 是上一版 release baseline;0.22.6 是推荐用于 P/D 分离的正式版本。
| 维度 | 0.22.5 | 0.22.6 |
|---|---|---|
| P/D 分离 | 基础发布状态,P/D 路径仍在收敛 | 第一版 stable:P/D RDMA push、scheduler/worker 协议、prefill/decode 流程、proxy 和 native RDMA binding 都经过稳定化 |
| 性能 | RDMA push 和布局映射还缺少本轮优化 | 支持 MLA cache layout;优化 layout mapping;补齐 connector metrics;提升 RDMA push 发送、完成通知、等待路径和链路利用率观测 |
| 正确性 | 对版本不匹配、零命中 query probe、RDMA-only query path 等边界保护不足 | vLLM/server 版本不匹配 fail early;零命中 query probe 不再错误释放 lease;RDMA-only query path 加 cfg-gate;vLLM 启动失败更早暴露 |
| 发包 | CUDA 13 wheel 使用 --no-default-features --features cuda-13,会关掉默认 rdma |
CUDA 13 CI/release wheel 显式使用 --features cuda-13,rdma;cu12 和 cu13 包都带 RDMA |
| 其他修复 | SSD cache 单路径 | SSD cache 支持多路径;清理未使用 Rust 依赖;测试门禁和 e2e cargo feature 对齐 |
0.22.5 -> 0.22.6 Comparison
0.22.6 is the first stable P/D disaggregation release. It closes the loop on P/D RDMA push performance, correctness guards, cache-layout compatibility, and package completeness. 0.22.5 is the previous release baseline; 0.22.6 is the recommended release for stable P/D disaggregation.
| Area | 0.22.5 | 0.22.6 |
|---|---|---|
| P/D disaggregation | Baseline release state, with the P/D path still converging | First stable release: P/D RDMA push, scheduler/worker protocol, prefill/decode flow, proxy, and native RDMA binding have been stabilized |
| Performance | RDMA push and layout mapping had not received this round of optimization | Adds MLA cache layout support; optimizes layout mapping; adds connector metrics; improves RDMA push sending, completion signaling, wait behavior, and link-utilization visibility |
| Correctness | Edge cases such as version mismatch, zero-hit query probes, and RDMA-only query paths were under-protected | Fails early on vLLM/server version mismatch; avoids incorrect lease release for zero-hit query probes; cfg-gates RDMA-only query behavior; fails faster when vLLM startup dies |
| Packaging | CUDA 13 wheels used --no-default-features --features cuda-13, which disabled default rdma |
CUDA 13 CI/release wheels now explicitly use --features cuda-13,rdma; both cu12 and cu13 packages include RDMA |
| Other fixes | Single SSD cache path | Multiple SSD cache paths; unused Rust dependencies removed; test gates and e2e cargo features aligned |
Artifacts
- GitHub Release includes 10 wheels: Python 3.10-3.14 for both
pegaflow-llmandpegaflow-llm-cu13. - PyPI published
pegaflow-llm==0.22.6andpegaflow-llm-cu13==0.22.6.
v0.22.5
chore(release): prepare 0.22.5 (#316) ## Summary - Bump workspace and Python package versions to 0.22.5 - Refresh Cargo.lock package versions ## Test - cargo check --workspace --all-targets --locked Release workflow note: .github/workflows/release.yml publishes on v* tag push after this PR is merged. Co-authored-by: root <root@host-192-168-172-86.tail58a9b0.ts.net>
0.22.4
✨ Highlights
- Disaggregated P/D over RDMA push (#297) — New PdConnector plus a v2 transfer engine that pushes KV prefill→decode layer-by-layer, overlapping transfer with compute. Added TTFT is 2–4× lower than NIXL on H20/Qwen3-8B.
- Query leases (#284, #288) — Pin refcounts replaced by lease-backed query/load/release; query results are Loading/Ready only, with TTL-based reclaim.
- Save-only mode (#300) — New pegaflow.mode lets an instance populate the cache without serving reads.
🚀 Features
- Sharded SSD cache across multiple files (#299)
- Per-peer N QPs with WQE-level round-robin, --qps-per-peer (default 2) (#291)
- Metaserver node-lifecycle fencing with heartbeat UUIDs, --node-stale-secs (#285)
🐛 Fixes
- Preserve non-MLA KV layout registration, e.g. GLM-4.7-FP8 (#295)
- Allocate pinned pools on GPU-local NUMA nodes (#293)
- Handle split physical KV blocks for FlashMLA (#292)
- Allow query lease consume once per worker (multi-worker loads) (#288)
- Validate --nics; fail on RDMA init error instead of silently disabling P2P (#283)
- Remove scheduler save limit (#282); demote cache_lookup_reuse log to debug (#280)
⚡ Performance
- CPU-path benchmarks + long-block save optimizations (#290): query 12.3 → 6.1 ms, save 21.3 → 13.1 ms
- Query API is now Loading/Ready only; pin/unpin semantics removed (#284)
- Release RPC returns FailedPrecondition for unknown/expired leases (#289)
- --nics rejects empty entries and fails on RDMA init error (#283)
- New flags: --qps-per-peer, --node-stale-secs; new config pegaflow.mode
- TinyLFU admission is now off unless explicitly enabled (#287)
v0.22.3
v0.22.3
v0.22.2
Release v0.22.2
v0.22.1
Release v0.22.1