[studio] Fix VLM detection for transformers v5#4868
[studio] Fix VLM detection for transformers v5#4868Datta0 wants to merge 1 commit intounslothai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the VLM (Vision Language Model) detection logic by introducing a centralized _is_vlm_config helper and a fallback mechanism that fetches raw config.json metadata from the Hugging Face Hub or local paths when standard loading fails. It also updates the subprocess-based vision check to return None on failure, allowing the detection logic to proceed to the metadata fallback. Review feedback suggests expanding the list of excluded non-VLM model types (such as T5 and BART) to prevent false positives and renaming a test case to clarify that it refers to Transformers version 5 rather than the T5 model architecture.
| "cogvlm2", | ||
| "minicpmv", | ||
| } | ||
| _AUDIO_ONLY_MODEL_TYPES = {"csm", "whisper"} |
There was a problem hiding this comment.
The list of excluded model types should be expanded. The ForConditionalGeneration architecture suffix is used by many non-vision Seq2Seq models (such as T5, BART, Marian, etc.), which leads to false positives in VLM detection. Renaming this to a more general _NON_VLM_MODEL_TYPES and including common Seq2Seq families is recommended to improve detection accuracy.
| _AUDIO_ONLY_MODEL_TYPES = {"csm", "whisper"} | |
| _NON_VLM_MODEL_TYPES = {"csm", "whisper", "t5", "bart", "marian", "pegasus", "blenderbot", "m2m_100"} |
| if model_type in _AUDIO_ONLY_MODEL_TYPES: | ||
| return False |
There was a problem hiding this comment.
| } | ||
| assert model_config._is_vlm_config(config) is False | ||
|
|
||
| def test_is_vision_model_falls_back_to_raw_metadata_for_t5_models(self): |
There was a problem hiding this comment.
The test name is ambiguous because 't5' typically refers to the T5 model family, whereas here it refers to models requiring Transformers v5. Renaming it to test_is_vision_model_falls_back_to_raw_metadata_for_v5_models would clarify that the test is about version-based fallbacks rather than the T5 architecture.
| def test_is_vision_model_falls_back_to_raw_metadata_for_t5_models(self): | |
| def test_is_vision_model_falls_back_to_raw_metadata_for_v5_models(self): |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbfa817091
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if architectures: | ||
| if any(x.endswith(_VLM_ARCH_SUFFIXES) for x in architectures): | ||
| return True |
There was a problem hiding this comment.
Restrict conditional-generation suffix before marking VLM
The new _is_vlm_config helper treats any architecture ending with ForConditionalGeneration as vision, which is too broad for the new raw-metadata fallback path in is_vision_model. When load_model_config fails (common with custom/remote-code repos), text-only seq2seq configs can now be classified as vision even if they have no vision_config/img_processor/image_token_index, which can route normal models into vision-specific training/inference flows. This regression is introduced by applying the suffix heuristic to raw config.json dictionaries without an additional vision signal.
Useful? React with 👍 / 👎.
78805d0 to
9b0d793
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e26a88e121
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if architectures and any(x.endswith("ForVisionText2Text") for x in architectures): | ||
| return True |
There was a problem hiding this comment.
Validate architecture entries before suffix matching
In the new raw-config.json fallback path, _is_vlm_config assumes every architectures element is a string and directly calls x.endswith(...). If a custom/partial config contains a non-string entry (for example null), this raises AttributeError and bubbles out of is_vision_model because the transformers-5 fallback branch does not wrap _is_vlm_config in a try/except. That turns a recoverable detection miss into a hard failure (e.g., /models/config can return 500) instead of returning False/None.
Useful? React with 👍 / 👎.
Fix VLM detection for models requiring transformers v5 (Qwen3.5, Gemma4) in Unsloth Studio. These models were incorrectly classified as text-only because the subprocess check failed and no fallback existed. Changes: - Add _is_vlm_config() helper for unified VLM detection across config types (AutoConfig objects and raw JSON dicts) - Add _load_model_config_metadata() as raw config.json fallback when both in-process AutoConfig and subprocess detection fail - For needs_transformers_5 models: try subprocess first, fall back to raw config.json metadata on transient failure - Replace ForConditionalGeneration architecture suffix heuristic with explicit vision signals (vision_config, img_processor, image_token_index, image_token_id) to eliminate seq2seq false positives - Add comprehensive _VLM_MODEL_TYPES safety net for known VLM model types - Add _classify_detection_error() for permanent vs transient error classification (EntryNotFoundError, RepositoryNotFoundError, etc.) - Update _VISION_CHECK_SCRIPT subprocess to match new detection logic - Preserve vision detection cache and error classification from unslothai#4853 - Add tests for _is_vlm_config and raw metadata fallback path Fixes: unslothai#4859
e26a88e to
3f60688
Compare
|
@Datta0 this conflicts with #4878 which already solves Gemma4 and Qwen3.5 issues in studio and that I just merged. It also seggregates into two separate v5 transformers versions. (5.3.0 and 5.5.0). If there are no additional bits in this PR beyond just fixing training and inference for these two model families, then maybe we should close this one. |
|
Closing this as the above mentioned one seems to handle 5.3 vs 5.5 as well |
Fixes: #4859