fix: skip non-attention layers in _get_vllm_state_dict (fixes unslothai/unsloth#4073)#510
fix: skip non-attention layers in _get_vllm_state_dict (fixes unslothai/unsloth#4073)#510stakeswky wants to merge 2 commits intounslothai:mainfrom
Conversation
…oundLocalError Layers without self_attn or cross_attn (e.g. Mamba/SSM mixer layers in LFM2 models) caused an UnboundLocalError because 'prefix' was never assigned before being used in get_state_dict calls. Add an else branch that logs and skips these unsupported layer types. Fixes unslothai/unsloth#4073
Summary of ChangesHello @stakeswky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a crash in _get_vllm_state_dict when processing models with non-attention layers, such as Mamba. The fix introduces an else branch to gracefully skip these layers, preventing the UnboundLocalError by using a continue statement. The change is correct, minimal, and effectively resolves the issue. The added logging is also helpful for users to understand why certain layers are being skipped.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d71efc652
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Skip layers that don't have self_attn or cross_attn (e.g. Mamba/SSM layers | ||
| # like LFM2's mixer layers). Full Mamba state dict extraction can be added later. | ||
| logger.info(f"Unsloth: Skipping layer {kk} — no self_attn or cross_attn found.") | ||
| continue |
There was a problem hiding this comment.
Avoid silently dropping full non-attention layer weights
This continue skips not just attention projections but also the rest of the layer (mlp, layernorms, etc.), so mixed architectures with non-attention blocks now produce an incomplete quant_state_dict without failing. In convert_vllm_to_huggingface, missing keys are silently ignored via if f"{layer_name}.weight" not in quant_state_dict: ... continue (around vllm_utils.py:1330), which leaves those modules at their initialized values and can yield incorrect model behavior while appearing successful.
Useful? React with 👍 / 👎.
danielhanchen
left a comment
There was a problem hiding this comment.
Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, vLLM 0.15.1) with Qwen3_4B_GRPO notebook -- PASS (training + inference OK).
Code review:
- The fix is correct.
continueproperly skips theo_proj, MLP, and layernorm extraction that follow the attention prefix assignment, preventing theUnboundLocalErrorfrom issue #4073. - Minimal and safe approach. Mamba/SSM layer state dict extraction can be added separately.
One minor nit: the log message uses an em dash (U+2014) in "Skipping layer {kk} --- no self_attn". Consider using a plain ASCII dash for consistency with the rest of the codebase, though this is cosmetic only.
LGTM.
danielhanchen
left a comment
There was a problem hiding this comment.
Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, vLLM 0.15.1).
Verification:
- Regression test:
unsloth/Llama-3.2-1B-Instructwithfast_inference=True-- PASS._get_vllm_state_dictprocesses all attention layers correctly, inference produces valid output. - LFM2 test:
LiquidAI/LFM2.5-1.2B-Thinkingwithfast_inference=True-- vLLM's BitsAndBytes loader fails before reaching_get_vllm_state_dict(vLLM doesn't fully support Lfm2ForCausalLM quantization yet). Confirmed theUnboundLocalErroronprefixis NOT hit. - Unit tests: verified
continuecorrectly skipso_proj(line 1127),mlp.gate_up_proj(line 1129),mlp.down_proj(line 1140), and layernorm extraction (lines 1143-1154) for layers withoutself_attn/cross_attn. Tested all-Mamba, mixed-ordering, and Mamba-first-layer edge cases.
Code review: Fix is correct and minimal. Two small nits:
- Unclosed parenthesis in comment (line 1122):
(e.g. Mamba/SSM layers-- missing closing). - Consider
logger.warninginstead oflogger.info-- silently dropping an entire layer's weights from the state dict is significant for debugging. A warning makes it more visible when users check extraction completeness. This aligns with the codex bot's inline comment about avoiding silent weight dropping.
Neither nit blocks merge.
danielhanchen
left a comment
There was a problem hiding this comment.
Update: End-to-end verification completed.
Successfully reproduced the bug and confirmed the fix works.
Bug reproduction (unpatched main, 16-bit LFM2):
File "vllm_utils.py", line 1122, in _get_vllm_state_dict
get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value
With PR fix applied (16-bit LFM2):
[WARNING] Unsloth: Skipping layer 0 - no self_attn or cross_attn found.
[WARNING] Unsloth: Skipping layer 1 - no self_attn or cross_attn found.
The else: continue branch fires correctly for the non-attention layers. No more UnboundLocalError.
Note: LFM2 then hits a separate, unrelated issue on its attention layers (Lfm2Attention uses out_proj instead of o_proj -- line 1098). That is out of scope for this PR and would need a separate fix for full LFM2 support.
Earlier 4-bit test did not reach _get_vllm_state_dict because vLLM's BnB loader fails for Lfm2ForCausalLM. Loading in 16-bit successfully reaches the function and exercises the fix.
Problem
_get_vllm_state_dictcrashes withUnboundLocalError: cannot access local variable 'prefix'when processing LFM2/Mamba models (e.g.LiquidAI/LFM2.5-1.2B-Thinking).The layer iteration loop only sets
prefixinsideif hasattr(layer, "self_attn")andelif hasattr(layer, "cross_attn")branches. Mamba/SSM layers usemixerinstead, so neither branch executes andprefixis never assigned before the subsequentget_state_dict(f"{prefix}.o_proj", ...)call.Fix
Add an
elsebranch that logs and skips layers withoutself_attnorcross_attn. This is the minimal-invasive fix — it prevents the crash and allows models with mixed layer types (attention + Mamba) to still extract the attention layers correctly.Future work
A more complete fix would add an
elif hasattr(layer, "mixer")branch that properly extracts Mamba layer state dicts (A, B, C, D matrices, dt projections, etc.) for full round-trip fidelity. This skip-based approach is a safe stopgap.Fixes unslothai/unsloth#4073