Add last_response_only parameter to train_on_responses_only#579
Add last_response_only parameter to train_on_responses_only#579maximedb wants to merge 3 commits intounslothai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a last_response_only parameter to the train_on_responses_only function in unsloth_zoo/dataset_utils.py, which allows masking all assistant turns except for the final one. The implementation was updated to collect all response spans into a list and then selectively apply labels based on the new flag. I have no feedback to provide.
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request introduces a last_response_only parameter to the train_on_responses_only function, allowing users to mask all assistant turns except for the final one in multi-turn conversations. The implementation refactors the logic to collect all response spans into a list before selectively applying labels based on the new flag. I have no feedback to provide.
danielhanchen
left a comment
There was a problem hiding this comment.
Thank you for the PR! The goal of this PR is to enable training only on the final assistant turn in multi-turn conversations, which is a common SFT scenario when you want supervision limited to the last response given a long conversation history. As a summary, this PR adds a new last_response_only=False keyword argument to train_on_responses_only and refactors the inner masking loop to collect (assistant_k, user_j) spans first, then applies labels from either all spans or spans[-1:] depending on the flag.
We ran 11 independent reviewers (7 from reviewer.py, 1 from gemini-code-assist, and 3 Sonnet subagents). Consolidated findings:
| Reviewers | Severity | Finding |
|---|---|---|
| 3/11 | Medium | last_response_only is not forwarded to UnslothVisionDataCollator in unsloth_zoo/vision_utils.py, so the VLM response-only masking path cannot use the new behavior |
| 3/11 | Low | spans[-1:] on empty spans silently returns [] (correct, but should be documented) |
| 2/11 | Low | Docstring could be clearer about the old_labels interaction when last_response_only=True |
| 2/11 | Low | Parameter alignment style is slightly off (last_response_only is longer than the other parameter names) |
| 1/11 | Low | The default path now allocates a spans list per sample instead of writing labels inline (very minor; dataset.map is one-shot so overhead is not measurable) |
| 1/11 | Low | Adding the new kwarg before num_proc would be safer against any callers passing num_proc positionally (unlikely) |
| 7+/11 | Info | PR description mentions tests/test_dataset_utils.py but no test file is in the actual diff |
Strong consensus across all 11 reviewers: the default path (last_response_only=False) is behaviorally a no-op relative to main in every scenario tested (single-turn, multi-turn, tensor inputs, old_labels present, empty conversations). Multiple reviewers ran differential harnesses comparing pre-PR and post-PR outputs across hundreds of randomized samples and confirmed bit-exact equivalence.
The only finding worth fixing in this PR is the VLM integration (Medium, 3/11). I am going to push a small follow-up commit to this PR branch to:
- Forward
last_response_onlytoUnslothVisionDataCollator.__init__inunsloth_zoo/vision_utils.py - Add an inline comment explaining that
spans[-1:]is intentionally safe on emptyspans
See inline comments below for the concrete suggested changes.
| tokenizer = None, # Optional | ||
| return_function = False, # Useful for iterating over lists | ||
| num_proc = None, | ||
| last_response_only = False, # Train only on the last assistant turn |
There was a problem hiding this comment.
[3/11 reviewers] The new last_response_only parameter is not forwarded to UnslothVisionDataCollator.__init__ in unsloth_zoo/vision_utils.py. UnslothVisionDataCollator currently calls _train_on_responses_only(None, ..., return_function=True) on line 734 without passing last_response_only, so VLM fine-tuning cannot access the new behavior. Since the function signature is unchanged (kwarg with default), this is a forward-compatible addition. Suggested forwarding in vision_utils.py (separate file, see that file for the actual edit):
| last_response_only = False, # Train only on the last assistant turn | |
| last_response_only = False, # Train only on the last assistant turn |
No change needed on this line itself — this comment flags the VLM integration to also be added in vision_utils.py in the follow-up.
| pass | ||
|
|
||
| # Apply labels: only the last assistant turn when last_response_only=True | ||
| apply_spans = spans[-1:] if last_response_only else spans |
There was a problem hiding this comment.
[3/11 reviewers] spans[-1:] on an empty list returns [] rather than raising IndexError, which is the correct behavior (sample with no assistant turn stays fully masked), but it is non-obvious. Adding a short inline comment clarifies intent:
| apply_spans = spans[-1:] if last_response_only else spans | |
| # Apply labels: only the last assistant turn when last_response_only=True. | |
| # Note: spans[-1:] safely returns [] when spans is empty (no assistant turn found), | |
| # so a sample with no assistant turn stays fully masked at -100. | |
| apply_spans = spans[-1:] if last_response_only else spans |
unsloth_zoo/dataset_utils.py
Outdated
| the labels with -100 for the instruction part. | ||
|
|
||
| If last_response_only=True, only the final assistant turn has its | ||
| labels unmasked; all earlier assistant turns remain masked at -100. |
There was a problem hiding this comment.
[2/11 reviewers] Minor docstring nit — the phrase "remain masked at -100" is accurate but could be clearer about why they remain masked (they are never written; labels is initialized to [-100] * n on line 255). Also worth noting the old_labels interaction: when last_response_only=True, earlier assistant spans are not copied from old_labels either. Suggested:
| labels unmasked; all earlier assistant turns remain masked at -100. | |
| If last_response_only=True, only the final assistant turn has its | |
| labels unmasked; all earlier assistant turns remain masked at -100 | |
| (they are never written, so they keep the initialized -100 values | |
| and are not copied from old_labels either). |
Wire the new last_response_only parameter through UnslothVisionDataCollator.__init__ in vision_utils.py so the VLM response-only masking path can use the new last-turn-only behavior. Without this, the kwarg is only usable via the direct dataset_utils entry point, and vision finetuning silently supervises every assistant turn. Also clarify the inline comment around spans[-1:] (safe for empty spans) and expand the docstring to explain that earlier assistant turns remain untouched rather than being actively masked.
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f9f66c783
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
unsloth_zoo/vision_utils.py
Outdated
| last_response_only = False, # Train only on the last assistant turn | ||
| completion_only_loss = True, |
There was a problem hiding this comment.
Keep new ctor arg from shifting positional parameters
Adding last_response_only before completion_only_loss in UnslothVisionDataCollator.__init__ is a backward-incompatible API change for positional callers. Any existing code that passed arguments positionally after num_proc will now bind values to the wrong parameters (for example, an old completion_only_loss=False value now sets last_response_only, and later args shift as well), which can silently change label masking or padding behavior. New optional parameters should be appended at the end (or the tail should be keyword-only) to preserve existing call semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a last_response_only parameter to the train_on_responses_only function and associated vision utilities. This feature enables the masking of all assistant turns except for the final one in multi-turn conversations, allowing training to focus exclusively on the last response. I have no feedback to provide as there were no review comments to evaluate.
danielhanchen
left a comment
There was a problem hiding this comment.
Round 2 review after the follow-up commit (6f9f66c) that forwarded last_response_only through UnslothVisionDataCollator. We ran the same review pipeline again: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, and 3 fresh Sonnet subagents.
Consolidated findings:
| Reviewers | Severity | Finding |
|---|---|---|
| 7/11 | Medium | In unsloth_zoo/vision_utils.py, last_response_only was inserted between num_proc and completion_only_loss in UnslothVisionDataCollator.__init__. This shifts every subsequent parameter by one position and would silently break any external caller that passed completion_only_loss, pad_to_multiple_of, resize_dimension, or snap_to_patch_size positionally. The only in-repo caller (studio/backend/core/training/trainer.py:3020) passes just (model, tokenizer) positionally, so there is no immediate breakage, but external callers are at risk. |
| 1/11 | Low | In UnslothVisionDataCollator.__init__, last_response_only=True is silently ignored when train_on_responses_only=False. Adding a warning or assertion would improve the UX. Low priority since standard Python kwarg behavior. |
| 1/11 | Low | Docstring paragraph in train_on_responses_only could be trimmed — the parenthetical about old_labels over-explains an implementation detail. The public contract is simply "earlier turns stay masked." |
| 1/11 | Low | Minor alignment inconsistency between dataset_utils.py and vision_utils.py (cosmetic). |
Consensus on the rest of the change set:
- Default path (
last_response_only=False) is a bit-exact no-op vsmain— confirmed by multiple differential harnesses. last_response_only=Truecorrectly masks all earlier assistant turns and leaves only the final one unmasked.- Closure capture of
last_response_onlyis correct. - The
spans[-1:]slice is safe on empty spans and the new inline comment documents that. UnslothVisionDataCollatorforwarding threads the flag correctly and does not need__slots__updates (the value is consumed in__init__and not stored onself).- No cross-version (transformers 4.57.6/5.4.0, TRL 0.22.2/0.27.1/1.0.0) compatibility concerns — pure Python label massaging.
I'm pushing a second follow-up commit that moves last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list to preserve the positional ABI. The other low-priority nits are not blocking.
unsloth_zoo/vision_utils.py
Outdated
| response_part = None, | ||
| force_match = True, # Match newlines as well! | ||
| num_proc = None, | ||
| last_response_only = False, # Train only on the last assistant turn |
There was a problem hiding this comment.
[7/11 reviewers] Inserting last_response_only between num_proc and completion_only_loss shifts every subsequent parameter by one position. Any external caller passing completion_only_loss, pad_to_multiple_of, resize_dimension, or snap_to_patch_size positionally would silently get wrong behavior. Move it to the end of the parameter list instead:
| last_response_only = False, # Train only on the last assistant turn | |
| completion_only_loss = True, | |
| pad_to_multiple_of = None, | |
| resize_dimension = 0, # can be 0, 1, 'max' or 'min' (max resizes based on the max of height width, min the min size, 0 the first dim, etc) | |
| snap_to_patch_size = False, | |
| last_response_only = False, # Train only on the last assistant turn |
Keep the positional ABI of UnslothVisionDataCollator.__init__ stable by appending last_response_only at the end of the parameter list instead of inserting it between num_proc and completion_only_loss. Inserting mid-signature would silently shift completion_only_loss, pad_to_multiple_of, resize_dimension, and snap_to_patch_size by one position for any external caller passing them positionally.
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request introduces a last_response_only parameter to the train_on_responses_only function and the UnslothVisionDataCollator class, enabling training to be restricted to the final assistant turn in a conversation. The logic in dataset_utils.py was updated to collect all assistant response spans and apply labels only to the last one when this flag is enabled. I have no feedback to provide.
danielhanchen
left a comment
There was a problem hiding this comment.
Round 3 review after the positional-ABI fix (deac818) that moved last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list. Same pipeline: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, 3 fresh Sonnet subagents.
Consolidated result: 10/11 APPROVE, 1/11 COMMENT (cosmetic nits only).
| Reviewers | Severity | Finding |
|---|---|---|
| 1/11 | Low (cosmetic) | Docstring paragraph in train_on_responses_only parenthetical about old_labels is implementation detail that could be trimmed. Non-blocking. |
| 1/11 | Low (cosmetic) | Minor = alignment inconsistency between dataset_utils.py (col 19) and vision_utils.py call site (col 22). Non-blocking. |
| 11/11 | Info | No full-repo test suite exists; no VLM end-to-end run was executed. Pre-existing testing gap, not a PR defect. |
Strong consensus on the current state:
- Positional ABI of
UnslothVisionDataCollator.__init__is now preserved (the kwarg is the last entry, after all pre-existing params). - No in-repo caller of
UnslothVisionDataCollator(uses more than 2 positional args, so there is no risk of breakage. last_response_onlyis threaded correctly through to_train_on_responses_only(..., last_response_only=last_response_only, ...)atvision_utils.py:743.dataset_utils.pydefault path (last_response_only=False) remains bit-exact withmainin all differential harnesses across hundreds of synthetic samples.last_response_only=Truecorrectly masks earlier assistant turns and leaves only the final span unmasked. Verified with single-turn, multi-turn, tensor-backedinput_ids, andold_labels-present cases.spans[-1:]safely returns[]on empty spans (no assistant turn) — documented inline.- Closure capture of
last_response_onlyis correct. - No
__slots__update needed (parameter is consumed in__init__, not stored onself).
No blocking findings remain. The PR is ready to merge.
Add
last_response_onlyparameter totrain_on_responses_onlyWhat & Why
train_on_responses_onlycurrently unmasks the labels for every assistant turn in a multi-turn conversation. This is useful for general SFT, but there are common fine-tuning scenarios — e.g. training a model to follow a final instruction given a conversation history — where you only want to supervise the last assistant response and leave all earlier ones masked.Change
A single new parameter
last_response_only=Falseis added totrain_on_responses_only. The default isFalse, so existing behaviour is completely unchanged.Implementation
The inner
_train_on_responses_onlywas refactored slightly: instead of immediately writing labels when each(assistant_start, response_end)span is found, the loop now collects all spans first, then applies labels from either all of them or justspans[-1:]depending on the flag. This avoids duplicating the parsing logic.The refactor is a no-op when
last_response_only=False.