Skip to content

[connector/sglang] fix cross-rank collective hang on manager API failures#102

Merged
wangxiyu191 merged 2 commits intomainfrom
fix/sglang-connector-cross-rank-hang
Apr 8, 2026
Merged

[connector/sglang] fix cross-rank collective hang on manager API failures#102
wangxiyu191 merged 2 commits intomainfrom
fix/sglang-connector-cross-rank-hang

Conversation

@wangxiyu191
Copy link
Copy Markdown
Collaborator

Summary

  • Fix cross-rank NCCL/gloo hang: When manager gRPC APIs (start_write_cache, SaveKvCaches, finish_write_cache) throw exceptions on rank 0, collective operations (broadcast/all_reduce) were skipped, leaving other TP ranks hanging forever. Wrap each API call in try-except to ensure collectives always execute consistently across all ranks.
  • Add fault injection tests (FI-1 ~ FI-7): Leverage the manager's debug service to inject faults into StartWriteCache, GetCacheLocation, FinishWriteCache at runtime, verifying the connector returns graceful errors instead of crashing.
  • Add multi-rank tests (MR-1 ~ MR-5): Use torch.multiprocessing.Process with gloo backend to simulate 2 TP ranks on a single GPU, verifying collective operations remain consistent even under fault injection. Includes process-based timeout detection to catch gloo hangs.

Test plan

  • All basic tests pass (single-rank set/get/exists)

  • FI-1 ~ FI-7 fault injection tests pass

  • MR-1 ~ MR-5 multi-rank tests pass

  • Bug reproduction verified: temporarily reverting the fix causes MR-3 to timeout (detecting the hang), restoring the fix resolves it

    🤖 Generated with [Qoder][https://qoder.com]

Copy link
Copy Markdown

@qoderai qoderai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Review Summary

Nice work tracking down and fixing this cross-rank hang -- it's the kind of bug that's easy to miss and extremely painful in production. The approach of wrapping manager API calls in try-except to guarantee collective consistency is the right one, and the multi-rank test harness with timeout-based hang detection is a solid addition.

🛡️ Key Risks & Issues

Cross-rank return-value divergence on finish_write_cache failure

The one material concern is the flag = False override at connector.py:391, which happens after all_reduce has already synchronized the data-transfer result across ranks. When finish_write_cache throws on rank 0, rank 0 reports failure while other ranks report success. MR-4 acknowledges this, but the downstream implication (rank 1's caller believes cache write succeeded, yet the manager never committed it) could confuse higher-level caching logic. See inline comment for two options.

This is non-blocking -- the PR's primary goal (preventing hangs) is achieved, and the inconsistency only manifests on the error path -- but it's worth a conscious decision on whether to address it in this PR or a follow-up.

🧪 Verification Advice

The fault injection and multi-rank tests cover the two most critical paths (start_write_cache failure → MR-3, finish_write_cache failure → MR-4). However, three of the five try-except blocks added in connector.py currently have no direct test coverage:

  • SaveKvCaches failure (lines 343-366): This is arguably the most likely production failure mode (network issues during large data transfers) and guards the all_reduce collective. Consider adding a test that monkeypatches transfer_client.SaveKvCaches to raise, verifying that flag is correctly set to False and the all_reduce completes without hanging.
  • save_indices is None path (lines 302-318): Triggered by inconsistent block_mask. Would need the manager to return malformed data -- may not be easily testable with current debug service.
  • unmatched == 0 early-return path (lines 322-336): Triggered when all blocks are already cached. Could be tested by calling batch_set_v1 twice on the same keys, then injecting a FinishWriteCache fault before the second call.

💡 Thoughts & Suggestions

  • The assert len(uris) == len(buffers) at line 352 sits inside the data-transfer try-except. If it ever fails, the error message will read "Data transfer (SaveKvCaches) failed" which masks the real cause (invariant violation). Minor, but something to be aware of when debugging.
  • The subprocess.run(f"rm -rf {manager_log_dir}/*", shell=True) pattern in test.py (line 83) uses shell=True with an environment-variable-derived path. In a test file with controlled defaults this is low risk, but switching to shutil.rmtree or shlex.quote() would be a defensive improvement.
  • The _GlooTPGroup monkey-patch in multi-rank tests is well-commented and pragmatic. Just be aware it will break if sglang changes how _TP is managed internally.

🤖 Generated by QoderView workflow run

@github-actions github-actions bot added the ai reviewed AI has reviewed this PR label Apr 7, 2026
…ures

When the KV Cache Manager gRPC API (start_write_cache, SaveKvCaches,
finish_write_cache) throws an exception on rank 0, the subsequent
broadcast / all_reduce collective operations are skipped, leaving
other TP ranks hanging forever.

Fix by wrapping each API call in try-except so that:
- start_write_cache: on failure, result is set to None and the
  broadcast still executes; all ranks return [False].
- SaveKvCaches (data transfer): on failure, flag is set to False
  and the all_reduce still executes.
- finish_write_cache (3 call sites): on failure, errors are logged
  and the connector returns gracefully.

Note: finish_write_cache is rank-0-only and runs after all_reduce,
so its failure can cause rank 0 to return [False] while other ranks
return [True]. This is an accepted inconsistency documented in the
code -- adding a second all_reduce for this rare error path would
penalise the hot path.

🤖 Generated with [Qoder][https://qoder.com]
Add comprehensive test coverage for the cross-rank hang fix:

- Extract hardcoded paths (manager binary, config files, log dir) as
  environment variables for portability across environments.
- Add DebugServiceClient that talks to the manager's debug HTTP API
  to inject/remove faults at runtime.
- Add 7 fault injection tests (FI-1 ~ FI-7): StartWriteCache,
  GetCacheLocation, FinishWriteCache faults with and without prefix,
  plus recovery verification.
- Add 5 multi-rank tests (MR-1 ~ MR-5) using torch.multiprocessing
  with gloo backend to simulate 2 TP ranks on a single GPU:
  - MR-1/2: normal set/get across ranks
  - MR-3: StartWriteCache fault (the main hang scenario)
  - MR-4: FinishWriteCache fault (known inconsistency, no hang)
  - MR-5: recovery after faults
- Process-based timeout detection (join + kill) to catch gloo hangs
  reliably.
- Move distributed init from module level into __main__ to avoid
  conflicts with mp.spawn child processes.

🤖 Generated with [Qoder][https://qoder.com]
@wangxiyu191 wangxiyu191 force-pushed the fix/sglang-connector-cross-rank-hang branch from dd2909c to 7e47727 Compare April 7, 2026 13:09
@wangxiyu191 wangxiyu191 merged commit b793b4a into main Apr 8, 2026
8 checks passed
@wangxiyu191 wangxiyu191 deleted the fix/sglang-connector-cross-rank-hang branch April 8, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant