Checklist
Background
AReaL hardcodes the policy-gradient loss to a global token mean. ScaleRL (arXiv:2510.13786, §3.2) treats the loss-aggregation level as a tunable axis that changes which unit dominates the gradient — and reports it as a meaningful knob for stability. AReaL exposes none of it, so users cannot reproduce GRPO-style (per-sequence) or MiniMax-M1-style (per-prompt) aggregation without patching the loss.
Potential Solution
Add actor.loss_aggregation selecting the reduction level:
| level |
reduction |
unit weighted equally |
source |
token_mean (default) |
Σ(pg·m)/Σm |
token |
DAPO |
seq_mean (new) |
mean_i(Σ_t pg·m / |o_i|) |
sequence |
GRPO |
prompt_mean (new) |
mean_g(Σ pg·m / Σ m) |
prompt-group |
MiniMax-M1 |
token_mean is byte-identical to today.
- One seam, no new machinery.
seq_mean and prompt_mean share a single per-unit reduction in aggregate_pg_loss (unit = one sequence, or group_size consecutive sequences). Paired with a _make_loss_weight_fn that returns the unit count, AReaL's existing engine contract Σ(loss_mb·w_mb)/Σw_mb realizes the exact global mean over that unit — each mode is just a (reduction, weight) pair, with no cross-microbatch denominators.
- Single source of truth. For
prompt_mean the prompt-group size is gconfig.n_samples, so group_size is derived (not a separate knob), mb_spec.granularity is auto-bumped to a multiple of it, and rollout.min_valid_group_size is raised to it (prompt_mean groups positionally and needs whole groups). The only user knob is actor.loss_aggregation.
Additional Information
Implemented in #1417 (draft). Because prompt_mean groups positionally, it depends on the partial-group reward/advantage fix in #1416 and is stacked on it. Mutation-verified tests in tests/test_prompt_mean_loss.py (per-mode values, packed==padded, and a three-mode pairing invariant tying loss_fn/loss_weight_fn to the engine reduction).
Checklist
areal/api/. (Additive: one newactor.loss_aggregationfield; defaulttoken_meanis byte-identical to today.)Background
AReaL hardcodes the policy-gradient loss to a global token mean. ScaleRL (arXiv:2510.13786, §3.2) treats the loss-aggregation level as a tunable axis that changes which unit dominates the gradient — and reports it as a meaningful knob for stability. AReaL exposes none of it, so users cannot reproduce GRPO-style (per-sequence) or MiniMax-M1-style (per-prompt) aggregation without patching the loss.
Potential Solution
Add
actor.loss_aggregationselecting the reduction level:token_mean(default)Σ(pg·m)/Σmseq_mean(new)mean_i(Σ_t pg·m / |o_i|)prompt_mean(new)mean_g(Σ pg·m / Σ m)token_meanis byte-identical to today.seq_meanandprompt_meanshare a single per-unit reduction inaggregate_pg_loss(unit = one sequence, orgroup_sizeconsecutive sequences). Paired with a_make_loss_weight_fnthat returns the unit count, AReaL's existing engine contractΣ(loss_mb·w_mb)/Σw_mbrealizes the exact global mean over that unit — each mode is just a(reduction, weight)pair, with no cross-microbatch denominators.prompt_meanthe prompt-group size isgconfig.n_samples, sogroup_sizeis derived (not a separate knob),mb_spec.granularityis auto-bumped to a multiple of it, androllout.min_valid_group_sizeis raised to it (prompt_mean groups positionally and needs whole groups). The only user knob isactor.loss_aggregation.Additional Information
Implemented in #1417 (draft). Because
prompt_meangroups positionally, it depends on the partial-group reward/advantage fix in #1416 and is stacked on it. Mutation-verified tests intests/test_prompt_mean_loss.py(per-mode values, packed==padded, and a three-mode pairing invariant tyingloss_fn/loss_weight_fnto the engine reduction).