Skip to content

fix: skip non-attention layers in _get_vllm_state_dict (fixes unslothai/unsloth#4073)#510

Open
stakeswky wants to merge 2 commits intounslothai:mainfrom
stakeswky:fix/skip-mamba-layers-vllm-state-dict
Open

fix: skip non-attention layers in _get_vllm_state_dict (fixes unslothai/unsloth#4073)#510
stakeswky wants to merge 2 commits intounslothai:mainfrom
stakeswky:fix/skip-mamba-layers-vllm-state-dict

Conversation

@stakeswky
Copy link
Copy Markdown

Problem

_get_vllm_state_dict crashes with UnboundLocalError: cannot access local variable 'prefix' when processing LFM2/Mamba models (e.g. LiquidAI/LFM2.5-1.2B-Thinking).

The layer iteration loop only sets prefix inside if hasattr(layer, "self_attn") and elif hasattr(layer, "cross_attn") branches. Mamba/SSM layers use mixer instead, so neither branch executes and prefix is never assigned before the subsequent get_state_dict(f"{prefix}.o_proj", ...) call.

Fix

Add an else branch that logs and skips layers without self_attn or cross_attn. This is the minimal-invasive fix — it prevents the crash and allows models with mixed layer types (attention + Mamba) to still extract the attention layers correctly.

Future work

A more complete fix would add an elif hasattr(layer, "mixer") branch that properly extracts Mamba layer state dicts (A, B, C, D matrices, dt projections, etc.) for full round-trip fidelity. This skip-based approach is a safe stopgap.

Fixes unslothai/unsloth#4073

…oundLocalError

Layers without self_attn or cross_attn (e.g. Mamba/SSM mixer layers in
LFM2 models) caused an UnboundLocalError because 'prefix' was never
assigned before being used in get_state_dict calls.

Add an else branch that logs and skips these unsupported layer types.

Fixes unslothai/unsloth#4073
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @stakeswky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the _get_vllm_state_dict function that caused crashes when attempting to extract state dictionaries from models utilizing non-attention based layers, such as Mamba/SSM architectures. The fix ensures that the state dictionary extraction process can proceed without error by gracefully skipping unsupported layer types, thereby improving the robustness of the system for diverse model architectures.

Highlights

  • Fix UnboundLocalError in _get_vllm_state_dict: Resolved a crash occurring when processing LFM2/Mamba models due to the prefix variable not being initialized for layers lacking self_attn or cross_attn attributes.
  • Graceful handling of non-attention layers: Implemented an else branch to explicitly skip and log layers that do not possess self_attn or cross_attn modules, preventing crashes and allowing partial state dict extraction for mixed-architecture models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/vllm_utils.py
    • Added an else block to the layer iteration loop to handle layers without self_attn or cross_attn attributes, preventing an UnboundLocalError.
    • Included logging for skipped layers to provide visibility into the state dictionary extraction process.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash in _get_vllm_state_dict when processing models with non-attention layers, such as Mamba. The fix introduces an else branch to gracefully skip these layers, preventing the UnboundLocalError by using a continue statement. The change is correct, minimal, and effectively resolves the issue. The added logging is also helpful for users to understand why certain layers are being skipped.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d71efc652

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# Skip layers that don't have self_attn or cross_attn (e.g. Mamba/SSM layers
# like LFM2's mixer layers). Full Mamba state dict extraction can be added later.
logger.info(f"Unsloth: Skipping layer {kk} — no self_attn or cross_attn found.")
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid silently dropping full non-attention layer weights

This continue skips not just attention projections but also the rest of the layer (mlp, layernorms, etc.), so mixed architectures with non-attention blocks now produce an incomplete quant_state_dict without failing. In convert_vllm_to_huggingface, missing keys are silently ignored via if f"{layer_name}.weight" not in quant_state_dict: ... continue (around vllm_utils.py:1330), which leaves those modules at their initialized values and can yield incorrect model behavior while appearing successful.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, vLLM 0.15.1) with Qwen3_4B_GRPO notebook -- PASS (training + inference OK).

Code review:

  • The fix is correct. continue properly skips the o_proj, MLP, and layernorm extraction that follow the attention prefix assignment, preventing the UnboundLocalError from issue #4073.
  • Minimal and safe approach. Mamba/SSM layer state dict extraction can be added separately.

One minor nit: the log message uses an em dash (U+2014) in "Skipping layer {kk} --- no self_attn". Consider using a plain ASCII dash for consistency with the rest of the codebase, though this is cosmetic only.

LGTM.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, vLLM 0.15.1).

Verification:

  • Regression test: unsloth/Llama-3.2-1B-Instruct with fast_inference=True -- PASS. _get_vllm_state_dict processes all attention layers correctly, inference produces valid output.
  • LFM2 test: LiquidAI/LFM2.5-1.2B-Thinking with fast_inference=True -- vLLM's BitsAndBytes loader fails before reaching _get_vllm_state_dict (vLLM doesn't fully support Lfm2ForCausalLM quantization yet). Confirmed the UnboundLocalError on prefix is NOT hit.
  • Unit tests: verified continue correctly skips o_proj (line 1127), mlp.gate_up_proj (line 1129), mlp.down_proj (line 1140), and layernorm extraction (lines 1143-1154) for layers without self_attn/cross_attn. Tested all-Mamba, mixed-ordering, and Mamba-first-layer edge cases.

Code review: Fix is correct and minimal. Two small nits:

  1. Unclosed parenthesis in comment (line 1122): (e.g. Mamba/SSM layers -- missing closing ).
  2. Consider logger.warning instead of logger.info -- silently dropping an entire layer's weights from the state dict is significant for debugging. A warning makes it more visible when users check extraction completeness. This aligns with the codex bot's inline comment about avoiding silent weight dropping.

Neither nit blocks merge.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: End-to-end verification completed.

Successfully reproduced the bug and confirmed the fix works.

Bug reproduction (unpatched main, 16-bit LFM2):

  File "vllm_utils.py", line 1122, in _get_vllm_state_dict
    get_state_dict(f"{prefix}.o_proj", 0, state_dict, o_proj)
                      ^^^^^^
UnboundLocalError: cannot access local variable 'prefix' where it is not associated with a value

With PR fix applied (16-bit LFM2):

[WARNING] Unsloth: Skipping layer 0 - no self_attn or cross_attn found.
[WARNING] Unsloth: Skipping layer 1 - no self_attn or cross_attn found.

The else: continue branch fires correctly for the non-attention layers. No more UnboundLocalError.

Note: LFM2 then hits a separate, unrelated issue on its attention layers (Lfm2Attention uses out_proj instead of o_proj -- line 1098). That is out of scope for this PR and would need a separate fix for full LFM2 support.

Earlier 4-bit test did not reach _get_vllm_state_dict because vLLM's BnB loader fails for Lfm2ForCausalLM. Loading in 16-bit successfully reaches the function and exercises the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] fast inference for LFM (and Mamba models)

2 participants