[connector/sglang] fix cross-rank collective hang on manager API failures by wangxiyu191 · Pull Request #102 · alibaba/tair-kvcache

wangxiyu191 · 2026-04-07T10:50:34Z

Summary

Fix cross-rank NCCL/gloo hang: When manager gRPC APIs (start_write_cache, SaveKvCaches, finish_write_cache) throw exceptions on rank 0, collective operations (broadcast/all_reduce) were skipped, leaving other TP ranks hanging forever. Wrap each API call in try-except to ensure collectives always execute consistently across all ranks.
Add fault injection tests (FI-1 ~ FI-7): Leverage the manager's debug service to inject faults into StartWriteCache, GetCacheLocation, FinishWriteCache at runtime, verifying the connector returns graceful errors instead of crashing.
Add multi-rank tests (MR-1 ~ MR-5): Use torch.multiprocessing.Process with gloo backend to simulate 2 TP ranks on a single GPU, verifying collective operations remain consistent even under fault injection. Includes process-based timeout detection to catch gloo hangs.

Test plan

All basic tests pass (single-rank set/get/exists)
FI-1 ~ FI-7 fault injection tests pass
MR-1 ~ MR-5 multi-rank tests pass
Bug reproduction verified: temporarily reverting the fix causes MR-3 to timeout (detecting the hang), restoring the fix resolves it

🤖 Generated with [Qoder][https://qoder.com]

qoderai

👋 Review Summary

Nice work tracking down and fixing this cross-rank hang -- it's the kind of bug that's easy to miss and extremely painful in production. The approach of wrapping manager API calls in try-except to guarantee collective consistency is the right one, and the multi-rank test harness with timeout-based hang detection is a solid addition.

🛡️ Key Risks & Issues

Cross-rank return-value divergence on finish_write_cache failure

The one material concern is the flag = False override at connector.py:391, which happens after all_reduce has already synchronized the data-transfer result across ranks. When finish_write_cache throws on rank 0, rank 0 reports failure while other ranks report success. MR-4 acknowledges this, but the downstream implication (rank 1's caller believes cache write succeeded, yet the manager never committed it) could confuse higher-level caching logic. See inline comment for two options.

This is non-blocking -- the PR's primary goal (preventing hangs) is achieved, and the inconsistency only manifests on the error path -- but it's worth a conscious decision on whether to address it in this PR or a follow-up.

🧪 Verification Advice

The fault injection and multi-rank tests cover the two most critical paths (start_write_cache failure → MR-3, finish_write_cache failure → MR-4). However, three of the five try-except blocks added in connector.py currently have no direct test coverage:

SaveKvCaches failure (lines 343-366): This is arguably the most likely production failure mode (network issues during large data transfers) and guards the all_reduce collective. Consider adding a test that monkeypatches transfer_client.SaveKvCaches to raise, verifying that flag is correctly set to False and the all_reduce completes without hanging.
save_indices is None path (lines 302-318): Triggered by inconsistent block_mask. Would need the manager to return malformed data -- may not be easily testable with current debug service.
unmatched == 0 early-return path (lines 322-336): Triggered when all blocks are already cached. Could be tested by calling batch_set_v1 twice on the same keys, then injecting a FinishWriteCache fault before the second call.

💡 Thoughts & Suggestions

The assert len(uris) == len(buffers) at line 352 sits inside the data-transfer try-except. If it ever fails, the error message will read "Data transfer (SaveKvCaches) failed" which masks the real cause (invariant violation). Minor, but something to be aware of when debugging.
The subprocess.run(f"rm -rf {manager_log_dir}/*", shell=True) pattern in test.py (line 83) uses shell=True with an environment-variable-derived path. In a test file with controlled defaults this is low risk, but switching to shutil.rmtree or shlex.quote() would be a defensive improvement.
The _GlooTPGroup monkey-patch in multi-rank tests is well-commented and pragmatic. Just be aware it will break if sglang changes how _TP is managed internally.

🤖 Generated by Qoder • View workflow run

kv_cache_manager/py_connector/sglang/connector.py

…ures When the KV Cache Manager gRPC API (start_write_cache, SaveKvCaches, finish_write_cache) throws an exception on rank 0, the subsequent broadcast / all_reduce collective operations are skipped, leaving other TP ranks hanging forever. Fix by wrapping each API call in try-except so that: - start_write_cache: on failure, result is set to None and the broadcast still executes; all ranks return [False]. - SaveKvCaches (data transfer): on failure, flag is set to False and the all_reduce still executes. - finish_write_cache (3 call sites): on failure, errors are logged and the connector returns gracefully. Note: finish_write_cache is rank-0-only and runs after all_reduce, so its failure can cause rank 0 to return [False] while other ranks return [True]. This is an accepted inconsistency documented in the code -- adding a second all_reduce for this rare error path would penalise the hot path. 🤖 Generated with [Qoder][https://qoder.com]

Add comprehensive test coverage for the cross-rank hang fix: - Extract hardcoded paths (manager binary, config files, log dir) as environment variables for portability across environments. - Add DebugServiceClient that talks to the manager's debug HTTP API to inject/remove faults at runtime. - Add 7 fault injection tests (FI-1 ~ FI-7): StartWriteCache, GetCacheLocation, FinishWriteCache faults with and without prefix, plus recovery verification. - Add 5 multi-rank tests (MR-1 ~ MR-5) using torch.multiprocessing with gloo backend to simulate 2 TP ranks on a single GPU: - MR-1/2: normal set/get across ranks - MR-3: StartWriteCache fault (the main hang scenario) - MR-4: FinishWriteCache fault (known inconsistency, no hang) - MR-5: recovery after faults - Process-based timeout detection (join + kill) to catch gloo hangs reliably. - Move distributed init from module level into __main__ to avoid conflicts with mp.spawn child processes. 🤖 Generated with [Qoder][https://qoder.com]

qoderai bot reviewed Apr 7, 2026

View reviewed changes

kv_cache_manager/py_connector/sglang/connector.py Show resolved Hide resolved

github-actions bot added the ai reviewed AI has reviewed this PR label Apr 7, 2026

wangxiyu191 added 2 commits April 7, 2026 21:09

wangxiyu191 force-pushed the fix/sglang-connector-cross-rank-hang branch from dd2909c to 7e47727 Compare April 7, 2026 13:09

wangxiyu191 merged commit b793b4a into main Apr 8, 2026
8 checks passed

wangxiyu191 deleted the fix/sglang-connector-cross-rank-hang branch April 8, 2026 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/sglang] fix cross-rank collective hang on manager API failures#102

[connector/sglang] fix cross-rank collective hang on manager API failures#102
wangxiyu191 merged 2 commits intomainfrom
fix/sglang-connector-cross-rank-hang

wangxiyu191 commented Apr 7, 2026

Uh oh!

qoderai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangxiyu191 commented Apr 7, 2026

Summary

Test plan

Uh oh!

qoderai bot left a comment

Choose a reason for hiding this comment

👋 Review Summary

🛡️ Key Risks & Issues

🧪 Verification Advice

💡 Thoughts & Suggestions

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant