LigerFusedLinearGRPOLoss produces ~100x larger grad_norm than TRL's non-Liger-Kernel path due to missing vLLM IS correction and other differences

## Summary

When using `use_liger_kernel=True` with TRL v0.27.2's `GRPOTrainer` and vLLM, `grad_norm` is ~100x larger than the non-Liger-Kernel path. This is primarily caused by the vLLM importance sampling correction not being applied in the Liger-Kernel loss path. Several other silent differences also exist.

## List of differences (TRL v0.27.2)

| # | Difference | TRL (non-Liger-Kernel) | Liger-Kernel | Impact | Silent? |
|---|---|---|---|---|---|
| 1 | **vLLM IS correction** | `per_token_loss *= importance_sampling_ratio` ([L2351-L2352](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2351-L2352)) | Not applied | **~100-300x** (primary cause) | Yes |
| 2 | **dapo/cispo normalizer** | `num_items_in_batch / num_processes` (total tokens across entire generation batch) ([L2371](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2371)) | `all_reduce(sum(attention_mask)) / world_size` (current micro-batch only), then divided by `current_gradient_accumulation_steps` ([Liger-Kernel L116-117](https://github.com/linkedin/Liger-Kernel/blob/1fdb668/src/liger_kernel/chunked_loss/grpo_loss.py#L116-L117), [TRL L2194](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2194)) | Several-fold depending on length variance | Yes |
| 3 | **tool_mask** | `completion_mask * tool_mask` ([L2243](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2243)) | `completion_mask` only ([TRL L2178](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2178)) | Proportional to tool token ratio | Yes |
| 4 | **use_bias_correction_kl** | `per_token_kl *= coef_1` ([L2315-L2316](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2315-L2316)) | Not implemented | KL term only | Yes |
| 5 | **delta (ratio clamping)** | `coef_1 = clamp(coef_1, max=delta)` ([L2326-L2327](https://github.com/huggingface/trl/blob/v0.27.2/trl/trainer/grpo_trainer.py#L2326-L2327)) | Not implemented | Changes clipping behavior | Yes |
| 6 | off_policy_mask, top_entropy_quantile, sequence-level IS, sapo | Supported | Not supported | - | No (raises error) |

## Where to fix

| # | Difference | Where to fix | Notes |
|---|---|---|---|
| 1 | **vLLM IS correction** | Liger-Kernel + TRL | Liger-Kernel: add parameter to apply `per_token_loss *= is_ratio` before reduction. TRL: pass it from `compute_liger_loss` |
| 2 | **dapo/cispo normalizer** | Liger-Kernel + TRL | Liger-Kernel: accept an external normalizer. TRL: pass `num_items_in_batch` |
| 3 | **tool_mask** | TRL only | Pass `completion_mask * tool_mask` in `compute_liger_loss` |
| 4 | **use_bias_correction_kl** | Liger-Kernel + TRL | Liger-Kernel: add flag to `ppo_loss_fn`. TRL: pass it |
| 5 | **delta (ratio clamping)** | Liger-Kernel + TRL | Liger-Kernel: add clamp argument for `coef_1`. TRL: pass it |

## Observed behavior

With DAPO loss on a large model using vLLM in GRPOTrainer:

| Metric (step 1) | Liger-Kernel (`use_liger_kernel=True`) | Non-Liger-Kernel |
|---|---|---|
| `grad_norm` | 31.09 | 0.29 |
| `loss` | 0.0187 | -0.0001 |

The IS ratio mean was ~0.003–0.01, so the non-Liger-Kernel path scales down the loss by that factor, while the Liger-Kernel path passes it through as-is.

## Conclusion

The primary cause is **# 1: vLLM IS correction is not applied in Liger-Kernel**. Since the IS ratio must be multiplied before reduction, it cannot be fixed on the TRL side alone — a Liger-Kernel change is necessary. Only # 3 can be resolved with a one-line fix on the TRL side.

#	Difference	Where to fix	Notes
1	vLLM IS correction	Liger-Kernel + TRL	Liger-Kernel: add parameter to apply `per_token_loss *= is_ratio` before reduction. TRL: pass it from `compute_liger_loss`
2	dapo/cispo normalizer	Liger-Kernel + TRL	Liger-Kernel: accept an external normalizer. TRL: pass `num_items_in_batch`
3	tool_mask	TRL only	Pass `completion_mask * tool_mask` in `compute_liger_loss`
4	use_bias_correction_kl	Liger-Kernel + TRL	Liger-Kernel: add flag to `ppo_loss_fn`. TRL: pass it
5	delta (ratio clamping)	Liger-Kernel + TRL	Liger-Kernel: add clamp argument for `coef_1`. TRL: pass it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LigerFusedLinearGRPOLoss produces ~100x larger grad_norm than TRL's non-Liger-Kernel path due to missing vLLM IS correction and other differences #1082

Summary

List of differences (TRL v0.27.2)

Where to fix

Observed behavior

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Difference	TRL (non-Liger-Kernel)	Liger-Kernel	Impact	Silent?
1	vLLM IS correction	`per_token_loss *= importance_sampling_ratio` (L2351-L2352)	Not applied	~100-300x (primary cause)	Yes
2	dapo/cispo normalizer	`num_items_in_batch / num_processes` (total tokens across entire generation batch) (L2371)	`all_reduce(sum(attention_mask)) / world_size` (current micro-batch only), then divided by `current_gradient_accumulation_steps` (Liger-Kernel L116-117, TRL L2194)	Several-fold depending on length variance	Yes
3	tool_mask	`completion_mask * tool_mask` (L2243)	`completion_mask` only (TRL L2178)	Proportional to tool token ratio	Yes
4	use_bias_correction_kl	`per_token_kl *= coef_1` (L2315-L2316)	Not implemented	KL term only	Yes
5	delta (ratio clamping)	`coef_1 = clamp(coef_1, max=delta)` (L2326-L2327)	Not implemented	Changes clipping behavior	Yes
6	off_policy_mask, top_entropy_quantile, sequence-level IS, sapo	Supported	Not supported	-	No (raises error)

LigerFusedLinearGRPOLoss produces ~100x larger grad_norm than TRL's non-Liger-Kernel path due to missing vLLM IS correction and other differences #1082

Description

Summary

List of differences (TRL v0.27.2)

Where to fix

Observed behavior

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions