Add last_response_only parameter to train_on_responses_only by maximedb · Pull Request #579 · unslothai/unsloth-zoo

maximedb · 2026-04-07T10:50:54Z

Add `last_response_only` parameter to `train_on_responses_only`

Note: This change was generated by an AI assistant (Claude). The repo owner should review it carefully before merging.

What & Why

train_on_responses_only currently unmasks the labels for every assistant turn in a multi-turn conversation. This is useful for general SFT, but there are common fine-tuning scenarios — e.g. training a model to follow a final instruction given a conversation history — where you only want to supervise the last assistant response and leave all earlier ones masked.

Change

A single new parameter last_response_only=False is added to train_on_responses_only. The default is False, so existing behaviour is completely unchanged.

train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
    last_response_only=True,  # only the final assistant turn is trained on
)

Implementation

The inner _train_on_responses_only was refactored slightly: instead of immediately writing labels when each (assistant_start, response_end) span is found, the loop now collects all spans first, then applies labels from either all of them or just spans[-1:] depending on the flag. This avoids duplicating the parsing logic.

The refactor is a no-op when last_response_only=False.

gemini-code-assist

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function in unsloth_zoo/dataset_utils.py, which allows masking all assistant turns except for the final one. The implementation was updated to collect all response spans into a list and then selectively apply labels based on the new flag. I have no feedback to provide.

danielhanchen · 2026-04-07T12:07:15Z

/gemini review

danielhanchen · 2026-04-07T12:07:16Z

@codex review

chatgpt-codex-connector · 2026-04-07T12:07:24Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function, allowing users to mask all assistant turns except for the final one in multi-turn conversations. The implementation refactors the logic to collect all response spans into a list before selectively applying labels based on the new flag. I have no feedback to provide.

danielhanchen

Thank you for the PR! The goal of this PR is to enable training only on the final assistant turn in multi-turn conversations, which is a common SFT scenario when you want supervision limited to the last response given a long conversation history. As a summary, this PR adds a new last_response_only=False keyword argument to train_on_responses_only and refactors the inner masking loop to collect (assistant_k, user_j) spans first, then applies labels from either all spans or spans[-1:] depending on the flag.

We ran 11 independent reviewers (7 from reviewer.py, 1 from gemini-code-assist, and 3 Sonnet subagents). Consolidated findings:

Reviewers	Severity	Finding
3/11	Medium	`last_response_only` is not forwarded to `UnslothVisionDataCollator` in `unsloth_zoo/vision_utils.py`, so the VLM response-only masking path cannot use the new behavior
3/11	Low	`spans[-1:]` on empty `spans` silently returns `[]` (correct, but should be documented)
2/11	Low	Docstring could be clearer about the `old_labels` interaction when `last_response_only=True`
2/11	Low	Parameter alignment style is slightly off (`last_response_only` is longer than the other parameter names)
1/11	Low	The default path now allocates a `spans` list per sample instead of writing labels inline (very minor; dataset.map is one-shot so overhead is not measurable)
1/11	Low	Adding the new kwarg before `num_proc` would be safer against any callers passing `num_proc` positionally (unlikely)
7+/11	Info	PR description mentions `tests/test_dataset_utils.py` but no test file is in the actual diff

Strong consensus across all 11 reviewers: the default path (last_response_only=False) is behaviorally a no-op relative to main in every scenario tested (single-turn, multi-turn, tensor inputs, old_labels present, empty conversations). Multiple reviewers ran differential harnesses comparing pre-PR and post-PR outputs across hundreds of randomized samples and confirmed bit-exact equivalence.

The only finding worth fixing in this PR is the VLM integration (Medium, 3/11). I am going to push a small follow-up commit to this PR branch to:

Forward last_response_only to UnslothVisionDataCollator.__init__ in unsloth_zoo/vision_utils.py
Add an inline comment explaining that spans[-1:] is intentionally safe on empty spans

See inline comments below for the concrete suggested changes.

danielhanchen · 2026-04-07T12:29:18Z

unsloth_zoo/dataset_utils.py

+    tokenizer         = None,  # Optional
+    return_function   = False, # Useful for iterating over lists
+    num_proc          = None,
+    last_response_only = False, # Train only on the last assistant turn


[3/11 reviewers] The new last_response_only parameter is not forwarded to UnslothVisionDataCollator.__init__ in unsloth_zoo/vision_utils.py. UnslothVisionDataCollator currently calls _train_on_responses_only(None, ..., return_function=True) on line 734 without passing last_response_only, so VLM fine-tuning cannot access the new behavior. Since the function signature is unchanged (kwarg with default), this is a forward-compatible addition. Suggested forwarding in vision_utils.py (separate file, see that file for the actual edit):

Suggested change

last_response_only = False, # Train only on the last assistant turn

last_response_only = False, # Train only on the last assistant turn

No change needed on this line itself — this comment flags the VLM integration to also be added in vision_utils.py in the follow-up.

danielhanchen · 2026-04-07T12:29:18Z

unsloth_zoo/dataset_utils.py

            pass
+
+            # Apply labels: only the last assistant turn when last_response_only=True
+            apply_spans = spans[-1:] if last_response_only else spans


[3/11 reviewers] spans[-1:] on an empty list returns [] rather than raising IndexError, which is the correct behavior (sample with no assistant turn stays fully masked), but it is non-obvious. Adding a short inline comment clarifies intent:

Suggested change

apply_spans = spans[-1:] if last_response_only else spans

# Apply labels: only the last assistant turn when last_response_only=True.

# Note: spans[-1:] safely returns [] when spans is empty (no assistant turn found),

# so a sample with no assistant turn stays fully masked at -100.

apply_spans = spans[-1:] if last_response_only else spans

danielhanchen · 2026-04-07T12:29:18Z

unsloth_zoo/dataset_utils.py

    the labels with -100 for the instruction part.
+
+    If last_response_only=True, only the final assistant turn has its
+    labels unmasked; all earlier assistant turns remain masked at -100.


[2/11 reviewers] Minor docstring nit — the phrase "remain masked at -100" is accurate but could be clearer about why they remain masked (they are never written; labels is initialized to [-100] * n on line 255). Also worth noting the old_labels interaction: when last_response_only=True, earlier assistant spans are not copied from old_labels either. Suggested:

Suggested change

labels unmasked; all earlier assistant turns remain masked at -100.

If last_response_only=True, only the final assistant turn has its

labels unmasked; all earlier assistant turns remain masked at -100

(they are never written, so they keep the initialized -100 values

and are not copied from old_labels either).

Wire the new last_response_only parameter through UnslothVisionDataCollator.__init__ in vision_utils.py so the VLM response-only masking path can use the new last-turn-only behavior. Without this, the kwarg is only usable via the direct dataset_utils entry point, and vision finetuning silently supervises every assistant turn. Also clarify the inline comment around spans[-1:] (safe for empty spans) and expand the docstring to explain that earlier assistant turns remain untouched rather than being actively masked.

danielhanchen · 2026-04-07T12:33:46Z

/gemini review

danielhanchen · 2026-04-07T12:33:47Z

@codex review

chatgpt-codex-connector · 2026-04-07T12:33:55Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f9f66c783

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T12:35:08Z

unsloth_zoo/vision_utils.py

+        last_response_only = False, # Train only on the last assistant turn
        completion_only_loss = True,


Keep new ctor arg from shifting positional parameters

Adding last_response_only before completion_only_loss in UnslothVisionDataCollator.__init__ is a backward-incompatible API change for positional callers. Any existing code that passed arguments positionally after num_proc will now bind values to the wrong parameters (for example, an old completion_only_loss=False value now sets last_response_only, and later args shift as well), which can silently change label masking or padding behavior. New optional parameters should be appended at the end (or the tail should be keyword-only) to preserve existing call semantics.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function and associated vision utilities. This feature enables the masking of all assistant turns except for the final one in multi-turn conversations, allowing training to focus exclusively on the last response. I have no feedback to provide as there were no review comments to evaluate.

danielhanchen

Round 2 review after the follow-up commit (6f9f66c) that forwarded last_response_only through UnslothVisionDataCollator. We ran the same review pipeline again: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, and 3 fresh Sonnet subagents.

Consolidated findings:

Reviewers	Severity	Finding
7/11	Medium	In `unsloth_zoo/vision_utils.py`, `last_response_only` was inserted between `num_proc` and `completion_only_loss` in `UnslothVisionDataCollator.__init__`. This shifts every subsequent parameter by one position and would silently break any external caller that passed `completion_only_loss`, `pad_to_multiple_of`, `resize_dimension`, or `snap_to_patch_size` positionally. The only in-repo caller (`studio/backend/core/training/trainer.py:3020`) passes just `(model, tokenizer)` positionally, so there is no immediate breakage, but external callers are at risk.
1/11	Low	In `UnslothVisionDataCollator.__init__`, `last_response_only=True` is silently ignored when `train_on_responses_only=False`. Adding a warning or assertion would improve the UX. Low priority since standard Python kwarg behavior.
1/11	Low	Docstring paragraph in `train_on_responses_only` could be trimmed — the parenthetical about `old_labels` over-explains an implementation detail. The public contract is simply "earlier turns stay masked."
1/11	Low	Minor alignment inconsistency between `dataset_utils.py` and `vision_utils.py` (cosmetic).

Consensus on the rest of the change set:

Default path (last_response_only=False) is a bit-exact no-op vs main — confirmed by multiple differential harnesses.
last_response_only=True correctly masks all earlier assistant turns and leaves only the final one unmasked.
Closure capture of last_response_only is correct.
The spans[-1:] slice is safe on empty spans and the new inline comment documents that.
UnslothVisionDataCollator forwarding threads the flag correctly and does not need __slots__ updates (the value is consumed in __init__ and not stored on self).
No cross-version (transformers 4.57.6/5.4.0, TRL 0.22.2/0.27.1/1.0.0) compatibility concerns — pure Python label massaging.

I'm pushing a second follow-up commit that moves last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list to preserve the positional ABI. The other low-priority nits are not blocking.

danielhanchen · 2026-04-07T12:53:11Z

unsloth_zoo/vision_utils.py

        response_part    = None,
        force_match      = True, # Match newlines as well!
        num_proc         = None,
+        last_response_only = False, # Train only on the last assistant turn


[7/11 reviewers] Inserting last_response_only between num_proc and completion_only_loss shifts every subsequent parameter by one position. Any external caller passing completion_only_loss, pad_to_multiple_of, resize_dimension, or snap_to_patch_size positionally would silently get wrong behavior. Move it to the end of the parameter list instead:

Suggested change

last_response_only = False, # Train only on the last assistant turn

completion_only_loss = True,

pad_to_multiple_of = None,

resize_dimension = 0, # can be 0, 1, 'max' or 'min' (max resizes based on the max of height width, min the min size, 0 the first dim, etc)

snap_to_patch_size = False,

last_response_only = False, # Train only on the last assistant turn

Keep the positional ABI of UnslothVisionDataCollator.__init__ stable by appending last_response_only at the end of the parameter list instead of inserting it between num_proc and completion_only_loss. Inserting mid-signature would silently shift completion_only_loss, pad_to_multiple_of, resize_dimension, and snap_to_patch_size by one position for any external caller passing them positionally.

danielhanchen · 2026-04-07T12:54:08Z

/gemini review

danielhanchen · 2026-04-07T12:54:09Z

@codex review

chatgpt-codex-connector · 2026-04-07T12:54:17Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function and the UnslothVisionDataCollator class, enabling training to be restricted to the final assistant turn in a conversation. The logic in dataset_utils.py was updated to collect all assistant response spans and apply labels only to the last one when this flag is enabled. I have no feedback to provide.

danielhanchen

Round 3 review after the positional-ABI fix (deac818) that moved last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list. Same pipeline: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, 3 fresh Sonnet subagents.

Consolidated result: 10/11 APPROVE, 1/11 COMMENT (cosmetic nits only).

Reviewers	Severity	Finding
1/11	Low (cosmetic)	Docstring paragraph in `train_on_responses_only` parenthetical about `old_labels` is implementation detail that could be trimmed. Non-blocking.
1/11	Low (cosmetic)	Minor `=` alignment inconsistency between `dataset_utils.py` (col 19) and `vision_utils.py` call site (col 22). Non-blocking.
11/11	Info	No full-repo test suite exists; no VLM end-to-end run was executed. Pre-existing testing gap, not a PR defect.

Strong consensus on the current state:

Positional ABI of UnslothVisionDataCollator.__init__ is now preserved (the kwarg is the last entry, after all pre-existing params).
No in-repo caller of UnslothVisionDataCollator( uses more than 2 positional args, so there is no risk of breakage.
last_response_only is threaded correctly through to _train_on_responses_only(..., last_response_only=last_response_only, ...) at vision_utils.py:743.
dataset_utils.py default path (last_response_only=False) remains bit-exact with main in all differential harnesses across hundreds of synthetic samples.
last_response_only=True correctly masks earlier assistant turns and leaves only the final span unmasked. Verified with single-turn, multi-turn, tensor-backed input_ids, and old_labels-present cases.
spans[-1:] safely returns [] on empty spans (no assistant turn) — documented inline.
Closure capture of last_response_only is correct.
No __slots__ update needed (parameter is consumed in __init__, not stored on self).

No blocking findings remain. The PR is ready to merge.

feat: add last_response_only parameter to train_on_responses_only

54a16c5

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

danielhanchen reviewed Apr 7, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

danielhanchen reviewed Apr 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

danielhanchen approved these changes Apr 7, 2026

View reviewed changes

	last_response_only = False, # Train only on the last assistant turn
	last_response_only = False, # Train only on the last assistant turn

-            apply_spans = spans[-1:] if last_response_only else spans
+            # Apply labels: only the last assistant turn when last_response_only=True.
+            # Note: spans[-1:] safely returns [] when spans is empty (no assistant turn found),
+            # so a sample with no assistant turn stays fully masked at -100.
+            apply_spans = spans[-1:] if last_response_only else spans

		last_response_only = False, # Train only on the last assistant turn
		completion_only_loss = True,

Conversation

maximedb commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add last_response_only parameter to train_on_responses_only

What & Why

Change

Implementation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

danielhanchen commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maximedb commented Apr 7, 2026 •

edited

Loading

Add `last_response_only` parameter to `train_on_responses_only`