Skip to content

E2E harness: fail fast on client/server version mismatch instead of a buried log 3 minutes in #338

Description

@xiaguan

Problem

The vLLM E2E gate depends on two artifacts that must come from the same source tree, but nothing checks this before spending minutes on startup:

  1. target/release/pegaflow-server — picked up by python/tests/conftest.py::find_server_binary (the installed-package lookup typically fails with ImportError in dev venvs, so the cargo target binary wins).
  2. The pegaflow extension imported by the spawned vllm serve process — whatever is installed in the active venv, which may be a stale wheel.

When they diverge, the run fails ~3 minutes in, at vLLM server startup, and the actual cause is buried in a log file:

RuntimeError: vLLM server exited with code 1, see /tmp/pytest-of-.../e2e_logs0/pegaflow.log

with the real error only inside that log:

pegaflow.PegaFlowError: register_context_batch RPC failed: code: 'The system is not in
a state required for the operation's execution', message: "PegaFlow version mismatch:
client=0.22.5 server=0.22.8"

Hit on 2026-06-10: rebuilt the server binary from a feature branch (0.22.8) while the dev venv still had a 0.22.5 wheel installed. All 4 E2E tests errored at setup; recovery required knowing to run maturin develop into the right venv (the root repo venv that vllm serve uses, not the python/.venv that pytest itself runs in — an extra trap).

Proposed fix

Preflight check in the E2E fixtures, before spawning anything:

  1. Resolve the server binary (existing find_server_binary() logic) and get its version, e.g. via pegaflow-server --version.
  2. Compare against the client extension version (pegaflow.__version__ or equivalent) as resolved by the interpreter that will run vllm serve, not the pytest interpreter.
  3. On mismatch, fail immediately with an actionable message, e.g.:
PegaFlow version mismatch before E2E startup:
  server binary : 0.22.8  (target/release/pegaflow-server)
  client wheel  : 0.22.5  (/data/.../.venv/lib/python3.13/site-packages/pegaflow)
Rebuild the client into the venv vLLM uses:
  cd python && maturin develop --uv --no-default-features --features cuda-13,rdma

Optional hardening, independent of the preflight:

  • VLLMServer._wait_for_ready already knows the log path on failure; tail the last N lines (or grep for PegaFlowError|panicked) into the raised RuntimeError so setup failures are self-explanatory without opening the log.
  • Print the chosen server binary path + version in the pytest header so it's visible which artifact the run is actually exercising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions