Skip to content

Add last_response_only parameter to train_on_responses_only#579

Open
maximedb wants to merge 3 commits intounslothai:mainfrom
maximedb:train_on_last_response_only
Open

Add last_response_only parameter to train_on_responses_only#579
maximedb wants to merge 3 commits intounslothai:mainfrom
maximedb:train_on_last_response_only

Conversation

@maximedb
Copy link
Copy Markdown

@maximedb maximedb commented Apr 7, 2026

Add last_response_only parameter to train_on_responses_only

Note: This change was generated by an AI assistant (Claude). The repo owner should review it carefully before merging.


What & Why

train_on_responses_only currently unmasks the labels for every assistant turn in a multi-turn conversation. This is useful for general SFT, but there are common fine-tuning scenarios — e.g. training a model to follow a final instruction given a conversation history — where you only want to supervise the last assistant response and leave all earlier ones masked.

Change

A single new parameter last_response_only=False is added to train_on_responses_only. The default is False, so existing behaviour is completely unchanged.

train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n",
    last_response_only=True,  # only the final assistant turn is trained on
)

Implementation

The inner _train_on_responses_only was refactored slightly: instead of immediately writing labels when each (assistant_start, response_end) span is found, the loop now collects all spans first, then applies labels from either all of them or just spans[-1:] depending on the flag. This avoids duplicating the parsing logic.

The refactor is a no-op when last_response_only=False.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function in unsloth_zoo/dataset_utils.py, which allows masking all assistant turns except for the final one. The implementation was updated to collect all response spans into a list and then selectively apply labels based on the new flag. I have no feedback to provide.

@danielhanchen
Copy link
Copy Markdown
Contributor

/gemini review

@danielhanchen
Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function, allowing users to mask all assistant turns except for the final one in multi-turn conversations. The implementation refactors the logic to collect all response spans into a list before selectively applying labels based on the new flag. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! The goal of this PR is to enable training only on the final assistant turn in multi-turn conversations, which is a common SFT scenario when you want supervision limited to the last response given a long conversation history. As a summary, this PR adds a new last_response_only=False keyword argument to train_on_responses_only and refactors the inner masking loop to collect (assistant_k, user_j) spans first, then applies labels from either all spans or spans[-1:] depending on the flag.

We ran 11 independent reviewers (7 from reviewer.py, 1 from gemini-code-assist, and 3 Sonnet subagents). Consolidated findings:

Reviewers Severity Finding
3/11 Medium last_response_only is not forwarded to UnslothVisionDataCollator in unsloth_zoo/vision_utils.py, so the VLM response-only masking path cannot use the new behavior
3/11 Low spans[-1:] on empty spans silently returns [] (correct, but should be documented)
2/11 Low Docstring could be clearer about the old_labels interaction when last_response_only=True
2/11 Low Parameter alignment style is slightly off (last_response_only is longer than the other parameter names)
1/11 Low The default path now allocates a spans list per sample instead of writing labels inline (very minor; dataset.map is one-shot so overhead is not measurable)
1/11 Low Adding the new kwarg before num_proc would be safer against any callers passing num_proc positionally (unlikely)
7+/11 Info PR description mentions tests/test_dataset_utils.py but no test file is in the actual diff

Strong consensus across all 11 reviewers: the default path (last_response_only=False) is behaviorally a no-op relative to main in every scenario tested (single-turn, multi-turn, tensor inputs, old_labels present, empty conversations). Multiple reviewers ran differential harnesses comparing pre-PR and post-PR outputs across hundreds of randomized samples and confirmed bit-exact equivalence.

The only finding worth fixing in this PR is the VLM integration (Medium, 3/11). I am going to push a small follow-up commit to this PR branch to:

  1. Forward last_response_only to UnslothVisionDataCollator.__init__ in unsloth_zoo/vision_utils.py
  2. Add an inline comment explaining that spans[-1:] is intentionally safe on empty spans

See inline comments below for the concrete suggested changes.

tokenizer = None, # Optional
return_function = False, # Useful for iterating over lists
num_proc = None,
last_response_only = False, # Train only on the last assistant turn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3/11 reviewers] The new last_response_only parameter is not forwarded to UnslothVisionDataCollator.__init__ in unsloth_zoo/vision_utils.py. UnslothVisionDataCollator currently calls _train_on_responses_only(None, ..., return_function=True) on line 734 without passing last_response_only, so VLM fine-tuning cannot access the new behavior. Since the function signature is unchanged (kwarg with default), this is a forward-compatible addition. Suggested forwarding in vision_utils.py (separate file, see that file for the actual edit):

Suggested change
last_response_only = False, # Train only on the last assistant turn
last_response_only = False, # Train only on the last assistant turn

No change needed on this line itself — this comment flags the VLM integration to also be added in vision_utils.py in the follow-up.

pass

# Apply labels: only the last assistant turn when last_response_only=True
apply_spans = spans[-1:] if last_response_only else spans
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3/11 reviewers] spans[-1:] on an empty list returns [] rather than raising IndexError, which is the correct behavior (sample with no assistant turn stays fully masked), but it is non-obvious. Adding a short inline comment clarifies intent:

Suggested change
apply_spans = spans[-1:] if last_response_only else spans
# Apply labels: only the last assistant turn when last_response_only=True.
# Note: spans[-1:] safely returns [] when spans is empty (no assistant turn found),
# so a sample with no assistant turn stays fully masked at -100.
apply_spans = spans[-1:] if last_response_only else spans

the labels with -100 for the instruction part.

If last_response_only=True, only the final assistant turn has its
labels unmasked; all earlier assistant turns remain masked at -100.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/11 reviewers] Minor docstring nit — the phrase "remain masked at -100" is accurate but could be clearer about why they remain masked (they are never written; labels is initialized to [-100] * n on line 255). Also worth noting the old_labels interaction: when last_response_only=True, earlier assistant spans are not copied from old_labels either. Suggested:

Suggested change
labels unmasked; all earlier assistant turns remain masked at -100.
If last_response_only=True, only the final assistant turn has its
labels unmasked; all earlier assistant turns remain masked at -100
(they are never written, so they keep the initialized -100 values
and are not copied from old_labels either).

Wire the new last_response_only parameter through
UnslothVisionDataCollator.__init__ in vision_utils.py so the VLM
response-only masking path can use the new last-turn-only behavior.
Without this, the kwarg is only usable via the direct dataset_utils
entry point, and vision finetuning silently supervises every assistant
turn.

Also clarify the inline comment around spans[-1:] (safe for empty
spans) and expand the docstring to explain that earlier assistant
turns remain untouched rather than being actively masked.
@danielhanchen
Copy link
Copy Markdown
Contributor

/gemini review

@danielhanchen
Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f9f66c783

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 659 to 660
last_response_only = False, # Train only on the last assistant turn
completion_only_loss = True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep new ctor arg from shifting positional parameters

Adding last_response_only before completion_only_loss in UnslothVisionDataCollator.__init__ is a backward-incompatible API change for positional callers. Any existing code that passed arguments positionally after num_proc will now bind values to the wrong parameters (for example, an old completion_only_loss=False value now sets last_response_only, and later args shift as well), which can silently change label masking or padding behavior. New optional parameters should be appended at the end (or the tail should be keyword-only) to preserve existing call semantics.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function and associated vision utilities. This feature enables the masking of all assistant turns except for the final one in multi-turn conversations, allowing training to focus exclusively on the last response. I have no feedback to provide as there were no review comments to evaluate.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 review after the follow-up commit (6f9f66c) that forwarded last_response_only through UnslothVisionDataCollator. We ran the same review pipeline again: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, and 3 fresh Sonnet subagents.

Consolidated findings:

Reviewers Severity Finding
7/11 Medium In unsloth_zoo/vision_utils.py, last_response_only was inserted between num_proc and completion_only_loss in UnslothVisionDataCollator.__init__. This shifts every subsequent parameter by one position and would silently break any external caller that passed completion_only_loss, pad_to_multiple_of, resize_dimension, or snap_to_patch_size positionally. The only in-repo caller (studio/backend/core/training/trainer.py:3020) passes just (model, tokenizer) positionally, so there is no immediate breakage, but external callers are at risk.
1/11 Low In UnslothVisionDataCollator.__init__, last_response_only=True is silently ignored when train_on_responses_only=False. Adding a warning or assertion would improve the UX. Low priority since standard Python kwarg behavior.
1/11 Low Docstring paragraph in train_on_responses_only could be trimmed — the parenthetical about old_labels over-explains an implementation detail. The public contract is simply "earlier turns stay masked."
1/11 Low Minor alignment inconsistency between dataset_utils.py and vision_utils.py (cosmetic).

Consensus on the rest of the change set:

  • Default path (last_response_only=False) is a bit-exact no-op vs main — confirmed by multiple differential harnesses.
  • last_response_only=True correctly masks all earlier assistant turns and leaves only the final one unmasked.
  • Closure capture of last_response_only is correct.
  • The spans[-1:] slice is safe on empty spans and the new inline comment documents that.
  • UnslothVisionDataCollator forwarding threads the flag correctly and does not need __slots__ updates (the value is consumed in __init__ and not stored on self).
  • No cross-version (transformers 4.57.6/5.4.0, TRL 0.22.2/0.27.1/1.0.0) compatibility concerns — pure Python label massaging.

I'm pushing a second follow-up commit that moves last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list to preserve the positional ABI. The other low-priority nits are not blocking.

response_part = None,
force_match = True, # Match newlines as well!
num_proc = None,
last_response_only = False, # Train only on the last assistant turn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[7/11 reviewers] Inserting last_response_only between num_proc and completion_only_loss shifts every subsequent parameter by one position. Any external caller passing completion_only_loss, pad_to_multiple_of, resize_dimension, or snap_to_patch_size positionally would silently get wrong behavior. Move it to the end of the parameter list instead:

Suggested change
last_response_only = False, # Train only on the last assistant turn
completion_only_loss = True,
pad_to_multiple_of = None,
resize_dimension = 0, # can be 0, 1, 'max' or 'min' (max resizes based on the max of height width, min the min size, 0 the first dim, etc)
snap_to_patch_size = False,
last_response_only = False, # Train only on the last assistant turn

Keep the positional ABI of UnslothVisionDataCollator.__init__ stable
by appending last_response_only at the end of the parameter list
instead of inserting it between num_proc and completion_only_loss.
Inserting mid-signature would silently shift completion_only_loss,
pad_to_multiple_of, resize_dimension, and snap_to_patch_size by one
position for any external caller passing them positionally.
@danielhanchen
Copy link
Copy Markdown
Contributor

/gemini review

@danielhanchen
Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a last_response_only parameter to the train_on_responses_only function and the UnslothVisionDataCollator class, enabling training to be restricted to the final assistant turn in a conversation. The logic in dataset_utils.py was updated to collect all assistant response spans and apply labels only to the last one when this flag is enabled. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 review after the positional-ABI fix (deac818) that moved last_response_only to the end of UnslothVisionDataCollator.__init__'s parameter list. Same pipeline: 7 reviewers from reviewer.py, 1 gemini-code-assist pass, 3 fresh Sonnet subagents.

Consolidated result: 10/11 APPROVE, 1/11 COMMENT (cosmetic nits only).

Reviewers Severity Finding
1/11 Low (cosmetic) Docstring paragraph in train_on_responses_only parenthetical about old_labels is implementation detail that could be trimmed. Non-blocking.
1/11 Low (cosmetic) Minor = alignment inconsistency between dataset_utils.py (col 19) and vision_utils.py call site (col 22). Non-blocking.
11/11 Info No full-repo test suite exists; no VLM end-to-end run was executed. Pre-existing testing gap, not a PR defect.

Strong consensus on the current state:

  • Positional ABI of UnslothVisionDataCollator.__init__ is now preserved (the kwarg is the last entry, after all pre-existing params).
  • No in-repo caller of UnslothVisionDataCollator( uses more than 2 positional args, so there is no risk of breakage.
  • last_response_only is threaded correctly through to _train_on_responses_only(..., last_response_only=last_response_only, ...) at vision_utils.py:743.
  • dataset_utils.py default path (last_response_only=False) remains bit-exact with main in all differential harnesses across hundreds of synthetic samples.
  • last_response_only=True correctly masks earlier assistant turns and leaves only the final span unmasked. Verified with single-turn, multi-turn, tensor-backed input_ids, and old_labels-present cases.
  • spans[-1:] safely returns [] on empty spans (no assistant turn) — documented inline.
  • Closure capture of last_response_only is correct.
  • No __slots__ update needed (parameter is consumed in __init__, not stored on self).

No blocking findings remain. The PR is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants