Skip to content

[studio] Fix VLM detection for transformers v5#4868

Closed
Datta0 wants to merge 1 commit intounslothai:mainfrom
Datta0:studio_vlm_fixes
Closed

[studio] Fix VLM detection for transformers v5#4868
Datta0 wants to merge 1 commit intounslothai:mainfrom
Datta0:studio_vlm_fixes

Conversation

@Datta0
Copy link
Copy Markdown
Collaborator

@Datta0 Datta0 commented Apr 6, 2026

Fixes: #4859

@Datta0
Copy link
Copy Markdown
Collaborator Author

Datta0 commented Apr 6, 2026

#4859

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the VLM (Vision Language Model) detection logic by introducing a centralized _is_vlm_config helper and a fallback mechanism that fetches raw config.json metadata from the Hugging Face Hub or local paths when standard loading fails. It also updates the subprocess-based vision check to return None on failure, allowing the detection logic to proceed to the metadata fallback. Review feedback suggests expanding the list of excluded non-VLM model types (such as T5 and BART) to prevent false positives and renaming a test case to clarify that it refers to Transformers version 5 rather than the T5 model architecture.

"cogvlm2",
"minicpmv",
}
_AUDIO_ONLY_MODEL_TYPES = {"csm", "whisper"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of excluded model types should be expanded. The ForConditionalGeneration architecture suffix is used by many non-vision Seq2Seq models (such as T5, BART, Marian, etc.), which leads to false positives in VLM detection. Renaming this to a more general _NON_VLM_MODEL_TYPES and including common Seq2Seq families is recommended to improve detection accuracy.

Suggested change
_AUDIO_ONLY_MODEL_TYPES = {"csm", "whisper"}
_NON_VLM_MODEL_TYPES = {"csm", "whisper", "t5", "bart", "marian", "pegasus", "blenderbot", "m2m_100"}

Comment on lines +628 to 629
if model_type in _AUDIO_ONLY_MODEL_TYPES:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the exclusion check to use the renamed and expanded list of non-VLM model types to prevent misidentifying standard Seq2Seq models as vision models.

Suggested change
if model_type in _AUDIO_ONLY_MODEL_TYPES:
return False
if model_type in _NON_VLM_MODEL_TYPES:
return False

}
assert model_config._is_vlm_config(config) is False

def test_is_vision_model_falls_back_to_raw_metadata_for_t5_models(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test name is ambiguous because 't5' typically refers to the T5 model family, whereas here it refers to models requiring Transformers v5. Renaming it to test_is_vision_model_falls_back_to_raw_metadata_for_v5_models would clarify that the test is about version-based fallbacks rather than the T5 architecture.

Suggested change
def test_is_vision_model_falls_back_to_raw_metadata_for_t5_models(self):
def test_is_vision_model_falls_back_to_raw_metadata_for_v5_models(self):

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbfa817091

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +631 to +633
if architectures:
if any(x.endswith(_VLM_ARCH_SUFFIXES) for x in architectures):
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict conditional-generation suffix before marking VLM

The new _is_vlm_config helper treats any architecture ending with ForConditionalGeneration as vision, which is too broad for the new raw-metadata fallback path in is_vision_model. When load_model_config fails (common with custom/remote-code repos), text-only seq2seq configs can now be classified as vision even if they have no vision_config/img_processor/image_token_index, which can route normal models into vision-specific training/inference flows. This regression is introduced by applying the suffix heuristic to raw config.json dictionaries without an additional vision signal.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e26a88e121

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +682 to +683
if architectures and any(x.endswith("ForVisionText2Text") for x in architectures):
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate architecture entries before suffix matching

In the new raw-config.json fallback path, _is_vlm_config assumes every architectures element is a string and directly calls x.endswith(...). If a custom/partial config contains a non-string entry (for example null), this raises AttributeError and bubbles out of is_vision_model because the transformers-5 fallback branch does not wrap _is_vlm_config in a try/except. That turns a recoverable detection miss into a hard failure (e.g., /models/config can return 500) instead of returning False/None.

Useful? React with 👍 / 👎.

Fix VLM detection for models requiring transformers v5 (Qwen3.5, Gemma4)
in Unsloth Studio. These models were incorrectly classified as text-only
because the subprocess check failed and no fallback existed.

Changes:
- Add _is_vlm_config() helper for unified VLM detection across config
  types (AutoConfig objects and raw JSON dicts)
- Add _load_model_config_metadata() as raw config.json fallback when
  both in-process AutoConfig and subprocess detection fail
- For needs_transformers_5 models: try subprocess first, fall back to
  raw config.json metadata on transient failure
- Replace ForConditionalGeneration architecture suffix heuristic with
  explicit vision signals (vision_config, img_processor, image_token_index,
  image_token_id) to eliminate seq2seq false positives
- Add comprehensive _VLM_MODEL_TYPES safety net for known VLM model types
- Add _classify_detection_error() for permanent vs transient error
  classification (EntryNotFoundError, RepositoryNotFoundError, etc.)
- Update _VISION_CHECK_SCRIPT subprocess to match new detection logic
- Preserve vision detection cache and error classification from unslothai#4853
- Add tests for _is_vlm_config and raw metadata fallback path

Fixes: unslothai#4859
@rolandtannous
Copy link
Copy Markdown
Collaborator

rolandtannous commented Apr 7, 2026

@Datta0 this conflicts with #4878 which already solves Gemma4 and Qwen3.5 issues in studio and that I just merged. It also seggregates into two separate v5 transformers versions. (5.3.0 and 5.5.0). If there are no additional bits in this PR beyond just fixing training and inference for these two model families, then maybe we should close this one.

@Datta0
Copy link
Copy Markdown
Collaborator Author

Datta0 commented Apr 8, 2026

Closing this as the above mentioned one seems to handle 5.3 vs 5.5 as well

@Datta0 Datta0 closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Can't train Qwen3.5 or Gemma4 on multimodal datasets in Unsloth Studio

3 participants