Support restarting pegaflow-server while vLLM keeps running

## Problem

PegaFlow should support restarting `pegaflow-server` while the vLLM process continues running.

Today the vLLM connector creates an `EngineRpcClient` during connector initialization, opens a long-lived session stream, and registers CUDA KV cache contexts against the server. If the server restarts, the server-side in-memory state is lost and the existing client/session/registration state may no longer be valid.

Relevant current code paths:
- `python/src/lib.rs` eagerly connects `EngineRpcClient`, reuses a tonic client/channel, and holds the `Session` stream in `start_session_watcher`.
- `python/pegaflow/connector/__init__.py` starts the session watcher once during scheduler connector initialization.
- `python/pegaflow/connector/state_manager.py` can mark the service unavailable and health-check it, but its current scope is mainly scheduler query fallback.
- `python/pegaflow/connector/worker.py` registers CUDA KV cache tensors and performs load/save RPCs using the existing client.

## Requested feature

Make the connector recover automatically after a `pegaflow-server` restart without requiring a vLLM restart.

Required behavior:

- Detect server disconnects and failed RPCs as service unavailability.
- Reconnect or recreate the underlying engine client/channel after the server becomes healthy again.
- Re-open the scheduler `Session` stream after reconnecting.
- Re-register worker KV cache contexts after reconnecting, because the restarted server loses its CUDA IPC registry and engine state.
- During downtime, degrade predictably:
  - scheduler cache queries should behave as misses;
  - load/save paths should fail without hanging;
  - connector state should stay internally consistent.
- After recovery, new requests should be able to query, save, and load through PegaFlow again while the same vLLM process keeps running.

## Expected behavior

- Operators can restart or roll `pegaflow-server` without restarting vLLM.
- Old in-memory cache contents may be lost unless backed by durable storage, but the connector should recover to a correct empty-cache or rebuilt-cache state.
- Recovery should be observable through logs and metrics.
- Add tests that simulate server unavailability and recovery across scheduler and worker paths.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support restarting pegaflow-server while vLLM keeps running #273

Problem

Requested feature

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Support restarting pegaflow-server while vLLM keeps running #273

Description

Problem

Requested feature

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions