Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Xixiang He1, Qiyao Sun1, Ao Cheng1, Xingming Li1, Xuanyu Ji1, Hailun Lu2, Runke Huang3, Qingyong Hu2

1 National University of Defense Technology 2 Intelligent Game and Decision Lab, Beijing 3 The Chinese University of Hong Kong, Shenzhen
ICML 2026 logo
ICML 2026 Accepted Paper Accepted at the International Conference on Machine Learning 2026 Acceptance Rate: 26.6%

AVSPO turns homogeneous reward groups from wasted GRPO updates into usable training signal through real-time collapse diagnosis and virtual reward samples.

Accepted by ICML 2026 Acceptance Rate: 26.6% Reinforcement Learning from Verifiable Rewards
58-63% relative ACR reduction
+4-6 accuracy points over GRPO
0.5B-14B Qwen2.5 model scales
0 additional LLM rollouts
Overview of the AVSPO framework
Overview of AVSPO. The policy samples a group of responses, the verifier produces binary rewards, and the ACR monitor detects homogeneous groups. When collapse is triggered, AVSPO injects virtual rewards only into the normalization statistics, restoring non-zero advantages without extra model rollouts.
Problem

Abstract

Group Relative Policy Optimization (GRPO), a prominent algorithm within Reinforcement Learning from Verifiable Rewards (RLVR), has achieved strong results in improving the reasoning capabilities of large language models. However, GRPO is prone to advantage collapse: when all responses in a group receive homogeneous rewards, such as all correct or all incorrect answers, the group produces near-zero advantages and vanishing gradients.

This work introduces the Advantage Collapse Rate (ACR), a diagnostic metric that quantifies the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, early ACR strongly predicts training stagnation and final performance.

We propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples guided by real-time ACR monitoring. AVSPO enables learning from homogeneous groups without additional rollouts, reducing collapse by 58-63% relative to GRPO and yielding consistent accuracy gains of 4-6 percentage points across model scales.

Contributions

From Diagnosis to Intervention

The paper reframes a hidden GRPO failure mode as a measurable training signal.

01

Collapse Becomes Observable

ACR monitors reward variance already computed during GRPO, exposing batches where gradients are effectively wasted.

02

Early Warning Signal

Across 245 configurations, early ACR explains 62% of final MATH-500 performance variance, allowing poor settings to be detected before long training runs finish.

03

Virtual Samples, Real Gradients

AVSPO adds synthetic reward values to normalization statistics only. The virtual samples do not contribute policy gradients.

04

Drop-in Cost Profile

The algorithm keeps the same per-iteration complexity as GRPO and adds less than 0.01% measured overhead in the reported profiling.

Diagnostic Metric

ACR Predicts Stagnation Before Accuracy Does

Advantage collapse occurs whenever within-group reward variance vanishes. In binary RLVR tasks, this happens in two symmetric regimes: uniformly failed groups on hard problems and uniformly solved groups on easy problems. Both cases produce no useful GRPO advantage despite valid verifier feedback.

ACR = 1/N Σ I(σRj < τ)
Early ACR predicts final MATH-500 accuracy
Figure 1. Early-stage ACR negatively correlates with final accuracy across 245 training configurations, with R2 = 0.617.
Method

Adaptive Virtual Sample Policy Optimization

A minimal intervention at the reward-statistics level, designed to preserve GRPO's critic-free efficiency.

1

Sample & Verify

For each prompt, GRPO samples G responses and receives binary rewards from the verifier.

2

Monitor Collapse

ACR is computed from group-level reward standard deviations in the current training batch.

3

Inject Virtual Rewards

When ACR exceeds the adaptive threshold, AVSPO adds K virtual reward samples to collapsed groups.

4

Recompute Advantages

Only real samples receive gradients; virtual rewards restore variance for advantage normalization.

Core update K = max(1, min(G, ceil(G · ACRα)))

The number of virtual samples scales with collapse severity. The paper uses α = 0.5 and an adaptive trigger threshold, balancing under-intervention against excessive augmentation.

Results

Consistent Gains Across Model Scales

All methods are trained for 500 steps on Level3-500 with G = 8 and T = 1.0, then evaluated on seven benchmarks.

0.28-0.45 GRPO ACR
0.11-0.18 AVSPO ACR
+6.1 avg. gain on Qwen2.5-Math-1.5B
+2.9 avg. over DCPO

Main Performance Snapshot

Avg. accuracy over MATH, GSM8K, Minerva, Olympiad, AMC, AIME, MMLU-Pro
Model GRPO ACR AVSPO ACR GRPO Avg. AVSPO Avg. Gain
Qwen2.5-0.5B0.450.1816.521.0+4.5
Qwen2.5-3B0.370.1427.932.2+4.3
Qwen2.5-3B-Instruct0.350.1339.743.4+3.7
Qwen2.5-14B0.280.1149.954.5+4.6
Qwen2.5-Math-1.5B0.400.1533.539.6+6.1
Qwen2.5-Math-7B0.330.1442.245.9+3.7
Analysis

When Does Collapse Happen?

ACR responds predictably to model capacity, sampling temperature, group size, and problem difficulty.

Sensitivity analysis of ACR across model size, temperature, group size, and dataset difficulty
Figure 3. ACR sensitivity analysis. Moderate temperatures, group sizes near 6-8, and intermediate problem difficulty provide healthier reward diversity.

Capacity Helps, But Not Monotonically

Larger models tend to reduce ACR, while task-model matching still matters; the 7B setting reaches lower ACR than 14B in the reported sensitivity study.

Low ACR Is Necessary, Not Sufficient

Higher sampling temperature lowers ACR, but accuracy peaks at moderate temperatures because excessive randomness harms solution quality.

Difficulty Has a U Shape

Very easy and very hard problems both cause homogeneous rewards. Intermediate difficulty yields the most useful reward diversity.

Efficiency

Negligible Runtime Overhead

AVSPO operates entirely on reward statistics. In the reported profiling for Qwen2.5-Math-1.5B, LLM generation and input preparation account for more than 96% of computation. ACR monitoring and virtual sample generation add roughly 3 ms per step, less than 0.01% overhead.

  • No additional model rollouts
  • No critic network
  • No step-level process annotations
  • Memory overhead limited to scalar virtual rewards
Computational overhead analysis for AVSPO
Figure 4. Per-step training time is dominated by model generation and input preparation; AVSPO's extra computation is effectively negligible.
Validation

Multi-seed Stability

Representative multi-seed experiments show non-overlapping GRPO and AVSPO performance distributions.

Performance distributions across 5 random seeds for Qwen2.5-Math-1.5B and Qwen2.5-3B
Figure 12. Boxplots over 5 seeds. AVSPO improves both MATH-500 and GSM8K for Qwen2.5-Math-1.5B and Qwen2.5-3B while showing narrower distributions.
Team

Authors

Author order follows the project specification.

Xixiang He

Xixiang He

National University of Defense Technology

Qiyao Sun

Qiyao Sun

National University of Defense Technology

Ao Cheng

Ao Cheng

National University of Defense Technology

Xingming Li

Xingming Li

National University of Defense Technology

Xuanyu Ji

Xuanyu Ji

National University of Defense Technology

Hailun Lu

Intelligent Game and Decision Lab, Beijing

Runke Huang

Runke Huang

The Chinese University of Hong Kong, Shenzhen

Qingyong Hu

Qingyong Hu

Intelligent Game and Decision Lab, Beijing

Contact

For questions about the paper, method, evaluation, or project page, please reach out to the corresponding author.

Qingyong Hu

Qingyong Hu

Corresponding Author  ·  Intelligent Game and Decision Lab

huqingyong15@outlook.com

We welcome questions about AVSPO, advantage collapse in GRPO/RLVR training, and potential collaboration opportunities in reinforcement learning from verifiable rewards. For code release or project-page updates, please reach out directly.

Citation

BibTeX

@inproceedings{he2026avspo,
  title     = {Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation},
  author    = {He, Xixiang and Sun, Qiyao and Cheng, Ao and Li, Xingming and Ji, Xuanyu and Lu, Hailun and Huang, Runke and Hu, Qingyong},
  booktitle = {International Conference on Machine Learning},
  year      = {2026}
}

Visitors Around the World