Skip to content

Add Bnb4bit support for MoE models on transformers v5 - #4032 #527

Open
sensai99 wants to merge 23 commits intounslothai:mainfrom
sensai99:moeFix
Open

Add Bnb4bit support for MoE models on transformers v5 - #4032 #527
sensai99 wants to merge 23 commits intounslothai:mainfrom
sensai99:moeFix

Conversation

@sensai99
Copy link
Copy Markdown

@sensai99 sensai99 commented Mar 2, 2026

Hi!

This PR adds support for quantization for MoE parameters of nn.Parameter.

With transformers v5, MoE parameters of nn.Parameter won't get quantized. This PR adds support for the quantization by doing the folllowing:

  • Converts the expert params nn.Parameter to nn.Params4bit
  • Handles the quantization and dequantization accordingly for peft lora compatibility

Analysis using GLM-4.7-Flash:

Screenshot 2026-03-15 at 9 02 22 AM Screenshot 2026-03-15 at 9 02 49 AM

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial functionality to enable 4-bit quantization for Mixture-of-Experts (MoE) models within the Transformers library, particularly for versions 5 and above. It addresses the challenge of quantizing MoE expert parameters that are defined as nn.Parameter rather than nn.Linear layers, which were previously not handled. The changes ensure that these parameters are correctly processed, quantized, and dequantized, maintaining full compatibility with PEFT LoRA adaptations by preserving necessary shape information throughout the quantization pipeline.

Highlights

  • MoE Parameter Quantization: Added comprehensive support for 4-bit quantization of Mixture-of-Experts (MoE) parameters, specifically targeting nn.Parameter instances in Transformers v5 models.
  • PEFT LoRA Compatibility: Ensured compatibility with PEFT LoRA by correctly handling the shape and attributes of quantized MoE expert parameters, including preserving _original_shape during quantization and reporting correct 3D shapes.
  • Bitsandbytes Integration: Implemented patches to transformers' bitsandbytes quantization logic to recognize, prepare, and convert MoE expert parameters (e.g., gate_up_proj, down_proj) into Params4bit placeholders.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/temporary_patches/init.py
    • Imported the new moe_bnb_transformers module to integrate its patching functionalities.
  • unsloth_zoo/temporary_patches/misc.py
    • Introduced _ParamShapeProxy to correctly expose 3D shapes for 4-bit MoE parameters, ensuring compatibility with PEFT's ParamWrapper.
    • Added patch_peft_param_wrapper_4bit_expert_shape to apply the shape proxy, allowing ParamWrapper to correctly derive dimensions for quantized MoE parameters.
  • unsloth_zoo/temporary_patches/moe_bnb_transformers.py
    • Added a new module dedicated to patching transformers' bitsandbytes quantization for MoE expert parameters.
    • Implemented _is_expert_module to identify MoE expert modules based on nn.Parameter attributes.
    • Created replace_expert_params_with_bnb_params to prepare MoE expert parameters by replacing them with Params4bit placeholders on a meta device before weight loading.
    • Developed patch_bnb4bit_quantize_convert to modify the Bnb4bitQuantize.convert method, ensuring correct quantization and preservation of _original_shape for MoE expert parameters.
    • Included patch_bnb4bit_quantizer_param_needs_quantization to extend Bnb4BitHfQuantizer's logic to recognize Params4bit expert placeholders as needing quantization.
    • Added patch_bnb4bit_quantizer_process_model to integrate the expert parameter replacement into Bnb4BitHfQuantizer._process_model_before_weight_loading.
  • unsloth_zoo/temporary_patches/moe_utils.py
    • Integrated bitsandbytes availability checks and Params4bit import.
    • Modified _get_base_weight to include dequantization logic for Params4bit instances.
    • Updated _is_moe_experts_module to correctly identify 4-bit quantized MoE expert parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces support for quantization of MoE parameters in transformers v5 by converting expert parameters to nn.Params4bit and handling quantization/dequantization for PEFT LoRA compatibility. It includes a new module moe_bnb_transformers.py with patching functions and modifications to misc.py and __init__.py to integrate the new functionality.

Comment on lines +425 to +428
# If the parameter is a Params4bit, dequantize it
if _check_bnb_available() and isinstance(param, Params4bit):
# Dequantize the parameter
return bnb.functional.dequantize_4bit(param.data, param.quant_state)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consider adding a check to ensure param.quant_state is not None before dequantizing. If quant_state is None, it could lead to an error during dequantization.

Suggested change
# If the parameter is a Params4bit, dequantize it
if _check_bnb_available() and isinstance(param, Params4bit):
# Dequantize the parameter
return bnb.functional.dequantize_4bit(param.data, param.quant_state)
# If the parameter is a Params4bit, dequantize it
if _check_bnb_available() and isinstance(param, Params4bit) and param.quant_state is not None:
# Dequantize the parameter
return bnb.functional.dequantize_4bit(param.data, param.quant_state)

Comment on lines +78 to +79
except Exception as e:
return raise_error("transformers.quantizers.quantizers_utils.should_convert_module", e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a more specific exception type instead of a general Exception to catch only the expected error, which would prevent masking other potential issues.

Comment on lines +144 to +145
if not has_been_replaced:
logger.warning(f"Unsloth: No expert parameters were found to be replaced for {model.name_or_path}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good to log a warning when no expert parameters are found. However, consider adding more context to the warning message, such as the specific layers or modules that were expected to have expert parameters, to aid in debugging.

Comment on lines +207 to +208
except Exception as e:
logger.warning(f"Unsloth: Error handling expert param quantization for {full_layer_name}: {e}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The except block catches a general Exception, which might hide unexpected errors. It's better to catch specific exceptions like KeyError or AttributeError that you anticipate and handle them accordingly. This can prevent masking other potential issues.

Comment on lines +256 to +259
# TODO: Can we raise an error here?
logger.warning(
f"Unsloth: Error checking MoE expert param_needs_quantization for {param_name}: {e}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO comment suggests there might be a better way to handle the exception. Consider raising an error to prevent unexpected behavior or investigate why the exception is occurring in the first place.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3f2c6eba9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5b567c528

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb69ead7ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant