Problem
PegaFlow should support restarting pegaflow-server while the vLLM process continues running.
Today the vLLM connector creates an EngineRpcClient during connector initialization, opens a long-lived session stream, and registers CUDA KV cache contexts against the server. If the server restarts, the server-side in-memory state is lost and the existing client/session/registration state may no longer be valid.
Relevant current code paths:
python/src/lib.rs eagerly connects EngineRpcClient, reuses a tonic client/channel, and holds the Session stream in start_session_watcher.
python/pegaflow/connector/__init__.py starts the session watcher once during scheduler connector initialization.
python/pegaflow/connector/state_manager.py can mark the service unavailable and health-check it, but its current scope is mainly scheduler query fallback.
python/pegaflow/connector/worker.py registers CUDA KV cache tensors and performs load/save RPCs using the existing client.
Requested feature
Make the connector recover automatically after a pegaflow-server restart without requiring a vLLM restart.
Required behavior:
- Detect server disconnects and failed RPCs as service unavailability.
- Reconnect or recreate the underlying engine client/channel after the server becomes healthy again.
- Re-open the scheduler
Session stream after reconnecting.
- Re-register worker KV cache contexts after reconnecting, because the restarted server loses its CUDA IPC registry and engine state.
- During downtime, degrade predictably:
- scheduler cache queries should behave as misses;
- load/save paths should fail without hanging;
- connector state should stay internally consistent.
- After recovery, new requests should be able to query, save, and load through PegaFlow again while the same vLLM process keeps running.
Expected behavior
- Operators can restart or roll
pegaflow-server without restarting vLLM.
- Old in-memory cache contents may be lost unless backed by durable storage, but the connector should recover to a correct empty-cache or rebuilt-cache state.
- Recovery should be observable through logs and metrics.
- Add tests that simulate server unavailability and recovery across scheduler and worker paths.
Problem
PegaFlow should support restarting
pegaflow-serverwhile the vLLM process continues running.Today the vLLM connector creates an
EngineRpcClientduring connector initialization, opens a long-lived session stream, and registers CUDA KV cache contexts against the server. If the server restarts, the server-side in-memory state is lost and the existing client/session/registration state may no longer be valid.Relevant current code paths:
python/src/lib.rseagerly connectsEngineRpcClient, reuses a tonic client/channel, and holds theSessionstream instart_session_watcher.python/pegaflow/connector/__init__.pystarts the session watcher once during scheduler connector initialization.python/pegaflow/connector/state_manager.pycan mark the service unavailable and health-check it, but its current scope is mainly scheduler query fallback.python/pegaflow/connector/worker.pyregisters CUDA KV cache tensors and performs load/save RPCs using the existing client.Requested feature
Make the connector recover automatically after a
pegaflow-serverrestart without requiring a vLLM restart.Required behavior:
Sessionstream after reconnecting.Expected behavior
pegaflow-serverwithout restarting vLLM.