AVSPO | Advantage Collapse in GRPO

Problem

Abstract

Group Relative Policy Optimization (GRPO), a prominent algorithm within Reinforcement Learning from Verifiable Rewards (RLVR), has achieved strong results in improving the reasoning capabilities of large language models. However, GRPO is prone to advantage collapse: when all responses in a group receive homogeneous rewards, such as all correct or all incorrect answers, the group produces near-zero advantages and vanishing gradients.

This work introduces the Advantage Collapse Rate (ACR), a diagnostic metric that quantifies the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, early ACR strongly predicts training stagnation and final performance.

We propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples guided by real-time ACR monitoring. AVSPO enables learning from homogeneous groups without additional rollouts, reducing collapse by 58-63% relative to GRPO and yielding consistent accuracy gains of 4-6 percentage points across model scales.

Contributions

From Diagnosis to Intervention

The paper reframes a hidden GRPO failure mode as a measurable training signal.

01

Collapse Becomes Observable

ACR monitors reward variance already computed during GRPO, exposing batches where gradients are effectively wasted.

02

Early Warning Signal

Across 245 configurations, early ACR explains 62% of final MATH-500 performance variance, allowing poor settings to be detected before long training runs finish.

03

Virtual Samples, Real Gradients

AVSPO adds synthetic reward values to normalization statistics only. The virtual samples do not contribute policy gradients.

04

Drop-in Cost Profile

The algorithm keeps the same per-iteration complexity as GRPO and adds less than 0.01% measured overhead in the reported profiling.

Diagnostic Metric

ACR Predicts Stagnation Before Accuracy Does

Advantage collapse occurs whenever within-group reward variance vanishes. In binary RLVR tasks, this happens in two symmetric regimes: uniformly failed groups on hard problems and uniformly solved groups on easy problems. Both cases produce no useful GRPO advantage despite valid verifier feedback.

ACR = 1/N Σ I(σ_{R_j} < τ)

Early ACR predicts final MATH-500 accuracy — **Figure 1.** Early-stage ACR negatively correlates with final accuracy across 245 training configurations, with R² = 0.617.

Method

Adaptive Virtual Sample Policy Optimization

A minimal intervention at the reward-statistics level, designed to preserve GRPO's critic-free efficiency.

1

Sample & Verify

For each prompt, GRPO samples G responses and receives binary rewards from the verifier.

2

Monitor Collapse

ACR is computed from group-level reward standard deviations in the current training batch.

3

Inject Virtual Rewards

When ACR exceeds the adaptive threshold, AVSPO adds K virtual reward samples to collapsed groups.

4

Recompute Advantages

Only real samples receive gradients; virtual rewards restore variance for advantage normalization.

Core update K = max(1, min(G, ceil(G · ACR^α)))

The number of virtual samples scales with collapse severity. The paper uses α = 0.5 and an adaptive trigger threshold, balancing under-intervention against excessive augmentation.

Results

Consistent Gains Across Model Scales

All methods are trained for 500 steps on Level3-500 with G = 8 and T = 1.0, then evaluated on seven benchmarks.

0.28-0.45 GRPO ACR

0.11-0.18 AVSPO ACR

+6.1 avg. gain on Qwen2.5-Math-1.5B

+2.9 avg. over DCPO

Main Performance Snapshot

Avg. accuracy over MATH, GSM8K, Minerva, Olympiad, AMC, AIME, MMLU-Pro

Model	GRPO ACR	AVSPO ACR	GRPO Avg.	AVSPO Avg.	Gain
Qwen2.5-0.5B	0.45	0.18	16.5	21.0	+4.5
Qwen2.5-3B	0.37	0.14	27.9	32.2	+4.3
Qwen2.5-3B-Instruct	0.35	0.13	39.7	43.4	+3.7
Qwen2.5-14B	0.28	0.11	49.9	54.5	+4.6
Qwen2.5-Math-1.5B	0.40	0.15	33.5	39.6	+6.1
Qwen2.5-Math-7B	0.33	0.14	42.2	45.9	+3.7

Analysis

When Does Collapse Happen?

ACR responds predictably to model capacity, sampling temperature, group size, and problem difficulty.

Sensitivity analysis of ACR across model size, temperature, group size, and dataset difficulty — **Figure 3.** ACR sensitivity analysis. Moderate temperatures, group sizes near 6-8, and intermediate problem difficulty provide healthier reward diversity.

Capacity Helps, But Not Monotonically

Larger models tend to reduce ACR, while task-model matching still matters; the 7B setting reaches lower ACR than 14B in the reported sensitivity study.

Low ACR Is Necessary, Not Sufficient

Higher sampling temperature lowers ACR, but accuracy peaks at moderate temperatures because excessive randomness harms solution quality.

Difficulty Has a U Shape

Very easy and very hard problems both cause homogeneous rewards. Intermediate difficulty yields the most useful reward diversity.

Efficiency

Negligible Runtime Overhead

AVSPO operates entirely on reward statistics. In the reported profiling for Qwen2.5-Math-1.5B, LLM generation and input preparation account for more than 96% of computation. ACR monitoring and virtual sample generation add roughly 3 ms per step, less than 0.01% overhead.

No additional model rollouts
No critic network
No step-level process annotations
Memory overhead limited to scalar virtual rewards

Computational overhead analysis for AVSPO — **Figure 4.** Per-step training time is dominated by model generation and input preparation; AVSPO's extra computation is effectively negligible.

Validation

Multi-seed Stability

Representative multi-seed experiments show non-overlapping GRPO and AVSPO performance distributions.

Performance distributions across 5 random seeds for Qwen2.5-Math-1.5B and Qwen2.5-3B — **Figure 12.** Boxplots over 5 seeds. AVSPO improves both MATH-500 and GSM8K for Qwen2.5-Math-1.5B and Qwen2.5-3B while showing narrower distributions.

Team