Fix FP8 MoE scale patching for compressed-tensors models#551
Fix FP8 MoE scale patching for compressed-tensors models#551danielhanchen wants to merge 6 commits intomainfrom
Conversation
The scale patcher was not firing for two reasons: 1. quant_method comparison failed because transformers returns a QuantizationMethod enum, not the string "compressed-tensors". Normalize to string before comparing. 2. The probe checked only the first routed layer for scale keys, but some layers are in the quantization ignore list and have no scales. Now scans all layers and filters to those that actually have scale keys in the checkpoint. Also adds support for sharded safetensors checkpoints via model.safetensors.index.json, so this works with both single-file and multi-shard models. Tested on GLM-4.7-Flash-FP8-Dynamic: 43 FP8 expert layers now correctly get scales attached, forward pass and 21-step LoRA training both pass with no NaN. Normal (non-FP8) models unaffected.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the functionality and reliability of FP8 Mixture-of-Experts (MoE) models, particularly those utilizing Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in the FP8 MoE scale patching for compressed-tensors models, which was silently failing. The changes include normalizing the quant_method, scanning all routed layers, supporting sharded checkpoints, and adding helper functions for safetensors file resolution. The code has been reviewed, and suggestions have been made to improve error handling and code clarity.
| if any(t is None for t in (gate, gate_scale, up, up_scale, down, down_scale)): | ||
| return False |
There was a problem hiding this comment.
|
|
||
| try: | ||
| module_name = "unsloth_cached_moe_utils" | ||
| module_name = "unsloth_zoo.temporary_patches._cached_moe_utils" |
| except Exception: | ||
| return None |
There was a problem hiding this comment.
| except RuntimeError: | ||
| return None |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 001479bec6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
|
|
||
|
|
||
| def forward_moe_backend_fp8(self, hidden_states, top_k_index, top_k_weights): |
There was a problem hiding this comment.
Preserve forward annotations for strict MoE patching
get_forward_moe_backend now routes to forward_moe_backend_fp8, but this function’s signature has no torch.Tensor annotations, and patch_function uses strict fingerprint matching (including annotations) by default. In practice, the unannotated replacement is rejected for annotated expert forwards like DeepseekV3NaiveMoe and Qwen3MoeExperts (patched without force=True), so those models silently skip the backend patch and lose the intended MoE/LoRA path.
Useful? React with 👍 / 👎.
|
|
||
| _CACHED_FORWARD_MOE_BACKEND = None | ||
| _CACHED_MOE_UTILS_MODULE = None | ||
| _CACHED_MOE_UTILS_FP8_MODULE = None |
| module_name = "unsloth_zoo.temporary_patches._cached_moe_utils_fp8" | ||
| module = sys.modules.get(module_name, None) | ||
| if module is not None and os.path.abspath(getattr(module, "__file__", "")) == cache_file: | ||
| _CACHED_MOE_UTILS_FP8_MODULE = module |
| module.__package__ = "unsloth_zoo.temporary_patches" | ||
| sys.modules[module_name] = module | ||
| spec.loader.exec_module(module) | ||
| _CACHED_MOE_UTILS_FP8_MODULE = module |
| return _TORCH_SCALED_GROUPED_MM_SUPPORTED | ||
|
|
||
| if not _TORCH_SCALED_GROUPED_MM_AVAILABLE: | ||
| _TORCH_SCALED_GROUPED_MM_SUPPORTED = False |
| _TORCH_SCALED_GROUPED_MM_SUPPORTED = False | ||
| return False | ||
| if not torch.cuda.is_available(): | ||
| _TORCH_SCALED_GROUPED_MM_SUPPORTED = False |
| # context. Keep the FP8 scaled_grouped_mm path off on pre-Hopper parts. | ||
| major, _minor = torch.cuda.get_device_capability(torch.cuda.current_device()) | ||
| if major < 9: | ||
| _TORCH_SCALED_GROUPED_MM_SUPPORTED = False |
| is_2d_input = hidden_states.dim() == 2 | ||
| if is_2d_input: | ||
| sequence_length, hidden_dim = hidden_states.shape | ||
| batch_size = 1 |
| return | ||
| try: | ||
| tensor_like.block_size = block_size | ||
| except (AttributeError, RuntimeError): |
| return forward_moe_backend_fp8( | ||
| self, hidden_states, top_k_index, top_k_weights | ||
| ) | ||
| except Exception: |
Summary
Fixes the FP8 MoE scale patching in
moe_utils_fp8.pywhich was silently failing for all compressed-tensors FP8 MoE models (e.g., GLM-4.7-Flash-FP8-Dynamic). Based on the cleanup branch from #548.The scale patcher (
_maybe_patch_glm4_stacked_moe_fp8_scales) was not firing due to two bugs:Enum vs string comparison:
quantization_config.quant_methodreturns aQuantizationMethodenum in newer transformers, not the string"compressed-tensors". The comparisonquant_method != "compressed-tensors"always returned True, causing an early return.Wrong layer probe: The probe for scale keys in safetensors checked only the first routed layer. Some layers (e.g., layers 1, 39, 46 in GLM-4.7-Flash) are in the quantization ignore list and have no scales. If the first routed layer happened to be ignored, the probe returned False and the entire patcher bailed out.
Additionally, the patcher only supported single-file
model.safetensorscheckpoints. Sharded checkpoints (model-00001-of-NNNNN.safetensors) would crash withEntryNotFoundError.Changes
quant_methodto string before comparison (handles both enum and string)_resolve_safetensors_files()supporting both single-file and sharded layouts viamodel.safetensors.index.json_open_safetensors_for_keys()that opens only the relevant shard(s)Test results
Tested on NVIDIA RTX PRO 6000 Blackwell (SM 12.0, 98 GB):
Backwards compatibility