[connector/sglang] fix cross-rank collective hang on manager API failures#102
[connector/sglang] fix cross-rank collective hang on manager API failures#102wangxiyu191 merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
👋 Review Summary
Nice work tracking down and fixing this cross-rank hang -- it's the kind of bug that's easy to miss and extremely painful in production. The approach of wrapping manager API calls in try-except to guarantee collective consistency is the right one, and the multi-rank test harness with timeout-based hang detection is a solid addition.
🛡️ Key Risks & Issues
Cross-rank return-value divergence on finish_write_cache failure
The one material concern is the flag = False override at connector.py:391, which happens after all_reduce has already synchronized the data-transfer result across ranks. When finish_write_cache throws on rank 0, rank 0 reports failure while other ranks report success. MR-4 acknowledges this, but the downstream implication (rank 1's caller believes cache write succeeded, yet the manager never committed it) could confuse higher-level caching logic. See inline comment for two options.
This is non-blocking -- the PR's primary goal (preventing hangs) is achieved, and the inconsistency only manifests on the error path -- but it's worth a conscious decision on whether to address it in this PR or a follow-up.
🧪 Verification Advice
The fault injection and multi-rank tests cover the two most critical paths (start_write_cache failure → MR-3, finish_write_cache failure → MR-4). However, three of the five try-except blocks added in connector.py currently have no direct test coverage:
- SaveKvCaches failure (lines 343-366): This is arguably the most likely production failure mode (network issues during large data transfers) and guards the
all_reducecollective. Consider adding a test that monkeypatchestransfer_client.SaveKvCachesto raise, verifying thatflagis correctly set toFalseand theall_reducecompletes without hanging. save_indices is Nonepath (lines 302-318): Triggered by inconsistentblock_mask. Would need the manager to return malformed data -- may not be easily testable with current debug service.unmatched == 0early-return path (lines 322-336): Triggered when all blocks are already cached. Could be tested by callingbatch_set_v1twice on the same keys, then injecting a FinishWriteCache fault before the second call.
💡 Thoughts & Suggestions
- The
assert len(uris) == len(buffers)at line 352 sits inside the data-transfer try-except. If it ever fails, the error message will read "Data transfer (SaveKvCaches) failed" which masks the real cause (invariant violation). Minor, but something to be aware of when debugging. - The
subprocess.run(f"rm -rf {manager_log_dir}/*", shell=True)pattern in test.py (line 83) usesshell=Truewith an environment-variable-derived path. In a test file with controlled defaults this is low risk, but switching toshutil.rmtreeorshlex.quote()would be a defensive improvement. - The
_GlooTPGroupmonkey-patch in multi-rank tests is well-commented and pragmatic. Just be aware it will break if sglang changes how_TPis managed internally.
🤖 Generated by Qoder • View workflow run
…ures When the KV Cache Manager gRPC API (start_write_cache, SaveKvCaches, finish_write_cache) throws an exception on rank 0, the subsequent broadcast / all_reduce collective operations are skipped, leaving other TP ranks hanging forever. Fix by wrapping each API call in try-except so that: - start_write_cache: on failure, result is set to None and the broadcast still executes; all ranks return [False]. - SaveKvCaches (data transfer): on failure, flag is set to False and the all_reduce still executes. - finish_write_cache (3 call sites): on failure, errors are logged and the connector returns gracefully. Note: finish_write_cache is rank-0-only and runs after all_reduce, so its failure can cause rank 0 to return [False] while other ranks return [True]. This is an accepted inconsistency documented in the code -- adding a second all_reduce for this rare error path would penalise the hot path. 🤖 Generated with [Qoder][https://qoder.com]
Add comprehensive test coverage for the cross-rank hang fix: - Extract hardcoded paths (manager binary, config files, log dir) as environment variables for portability across environments. - Add DebugServiceClient that talks to the manager's debug HTTP API to inject/remove faults at runtime. - Add 7 fault injection tests (FI-1 ~ FI-7): StartWriteCache, GetCacheLocation, FinishWriteCache faults with and without prefix, plus recovery verification. - Add 5 multi-rank tests (MR-1 ~ MR-5) using torch.multiprocessing with gloo backend to simulate 2 TP ranks on a single GPU: - MR-1/2: normal set/get across ranks - MR-3: StartWriteCache fault (the main hang scenario) - MR-4: FinishWriteCache fault (known inconsistency, no hang) - MR-5: recovery after faults - Process-based timeout detection (join + kill) to catch gloo hangs reliably. - Move distributed init from module level into __main__ to avoid conflicts with mp.spawn child processes. 🤖 Generated with [Qoder][https://qoder.com]
dd2909c to
7e47727
Compare
Summary
start_write_cache,SaveKvCaches,finish_write_cache) throw exceptions on rank 0, collective operations (broadcast/all_reduce) were skipped, leaving other TP ranks hanging forever. Wrap each API call in try-except to ensure collectives always execute consistently across all ranks.StartWriteCache,GetCacheLocation,FinishWriteCacheat runtime, verifying the connector returns graceful errors instead of crashing.torch.multiprocessing.Processwith gloo backend to simulate 2 TP ranks on a single GPU, verifying collective operations remain consistent even under fault injection. Includes process-based timeout detection to catch gloo hangs.Test plan
All basic tests pass (single-rank set/get/exists)
FI-1 ~ FI-7 fault injection tests pass
MR-1 ~ MR-5 multi-rank tests pass
Bug reproduction verified: temporarily reverting the fix causes MR-3 to timeout (detecting the hang), restoring the fix resolves it
🤖 Generated with [Qoder][https://qoder.com]