Problem
The vLLM E2E gate depends on two artifacts that must come from the same source tree, but nothing checks this before spending minutes on startup:
target/release/pegaflow-server — picked up by python/tests/conftest.py::find_server_binary (the installed-package lookup typically fails with ImportError in dev venvs, so the cargo target binary wins).
- The
pegaflow extension imported by the spawned vllm serve process — whatever is installed in the active venv, which may be a stale wheel.
When they diverge, the run fails ~3 minutes in, at vLLM server startup, and the actual cause is buried in a log file:
RuntimeError: vLLM server exited with code 1, see /tmp/pytest-of-.../e2e_logs0/pegaflow.log
with the real error only inside that log:
pegaflow.PegaFlowError: register_context_batch RPC failed: code: 'The system is not in
a state required for the operation's execution', message: "PegaFlow version mismatch:
client=0.22.5 server=0.22.8"
Hit on 2026-06-10: rebuilt the server binary from a feature branch (0.22.8) while the dev venv still had a 0.22.5 wheel installed. All 4 E2E tests errored at setup; recovery required knowing to run maturin develop into the right venv (the root repo venv that vllm serve uses, not the python/.venv that pytest itself runs in — an extra trap).
Proposed fix
Preflight check in the E2E fixtures, before spawning anything:
- Resolve the server binary (existing
find_server_binary() logic) and get its version, e.g. via pegaflow-server --version.
- Compare against the client extension version (
pegaflow.__version__ or equivalent) as resolved by the interpreter that will run vllm serve, not the pytest interpreter.
- On mismatch, fail immediately with an actionable message, e.g.:
PegaFlow version mismatch before E2E startup:
server binary : 0.22.8 (target/release/pegaflow-server)
client wheel : 0.22.5 (/data/.../.venv/lib/python3.13/site-packages/pegaflow)
Rebuild the client into the venv vLLM uses:
cd python && maturin develop --uv --no-default-features --features cuda-13,rdma
Optional hardening, independent of the preflight:
VLLMServer._wait_for_ready already knows the log path on failure; tail the last N lines (or grep for PegaFlowError|panicked) into the raised RuntimeError so setup failures are self-explanatory without opening the log.
- Print the chosen server binary path + version in the pytest header so it's visible which artifact the run is actually exercising.
Problem
The vLLM E2E gate depends on two artifacts that must come from the same source tree, but nothing checks this before spending minutes on startup:
target/release/pegaflow-server— picked up bypython/tests/conftest.py::find_server_binary(the installed-package lookup typically fails withImportErrorin dev venvs, so the cargo target binary wins).pegaflowextension imported by the spawnedvllm serveprocess — whatever is installed in the active venv, which may be a stale wheel.When they diverge, the run fails ~3 minutes in, at vLLM server startup, and the actual cause is buried in a log file:
with the real error only inside that log:
Hit on 2026-06-10: rebuilt the server binary from a feature branch (0.22.8) while the dev venv still had a 0.22.5 wheel installed. All 4 E2E tests errored at setup; recovery required knowing to run
maturin developinto the right venv (the root repo venv thatvllm serveuses, not thepython/.venvthat pytest itself runs in — an extra trap).Proposed fix
Preflight check in the E2E fixtures, before spawning anything:
find_server_binary()logic) and get its version, e.g. viapegaflow-server --version.pegaflow.__version__or equivalent) as resolved by the interpreter that will runvllm serve, not the pytest interpreter.Optional hardening, independent of the preflight:
VLLMServer._wait_for_readyalready knows the log path on failure; tail the last N lines (or grep forPegaFlowError|panicked) into the raisedRuntimeErrorso setup failures are self-explanatory without opening the log.