PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.

Perception Bottleneck in Multimodal Reasoning

Figure 1: Comprehensive error-type breakdown and inference example between GRPO and PAPO.

We raise two fundamental research questions: " (1) Are there unique challenges in multimodal reasoning that do not arise in text-only settings and cannot be addressed solely through data- or reward-level modifications? (2) If so, how can we address this by designing a new RLVR optimization objective that is better grounded in multimodal domains? In exploring the essential causes of failures in multimodal reasoning, we follow a typical GRPO pipeline to train Qwen2.5-VL-3B on ViRL39K and manually examine and categorize error types based on 200 error instances sampled from four benchmarks. We find that the majority of errors, 67.0%, stem from poor perception. We hypothesize that this bottleneck in perception is due to the GRPO objective not providing any incentive for the model to generate visually grounded responses.

PAPO Algorithm

Figure 2: Overview of PAPO objective.

We propose Perception-Aware Policy Optimization (PAPO), a novel policy gradient algorithm that enhances multimodal reasoning through visually grounded optimization. PAPO can serve as a direct drop-in replacement for GRPO or DAPO without any additional assumptions. The key idea of PAPO is to encourage the model to learn to perceive while learning to reason by introducing an Implicit Perception Loss in the form of a KL divergence term that can be seamlessly plugged into mainstream RLVR algorithms. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models.

Figure 3: Comparison of GRPO and PAPO objectives.

Implicit Perception Loss

PAPO introduces an Implicit Perception Loss in the form of a KL divergence term that maximizes the difference between the model's output distribution when given the original image versus a corrupted (masked) version. This encourages the model to rely on meaningful visual content for generating responses, addressing the core perception bottleneck in multimodal reasoning.

Double Entropy Loss

To further enhance the training stability of PAPO, we introduce Double Entropy Loss, an effective regularizer that prevents collapse while preserving performance. This idea stems from our observation that rising rollout entropy in both π_θ and π_θ^mask is an early sign of collapse. Double Entropy Loss encourages the model to keep both entropies low.

Results

Main Results

Table 1: Performance (avg@8 acc %) comparison of Qwen2.5-VL-3B and 7B models between GRPO, DAPO and PAPO on general and more vision-dependent multimodal reasoning tasks. MathVerseV refers to the vision-centric subset of MathVerse. ∆%rel indicates the averaged relative gain over the baseline for each task. We observe consistent improvements against both GRPO and DAPO, with gains approaching 8%-19%, especially on tasks with high vision-dependency.

4.4%-17.5%

Overall Improvement

8.0%-19.1%

Vision-Dependent Tasks

30.5%

Perception Error Reduction

GRPO & DAPO

Both Baselines Improved

PAPO consistently outperforms both GRPO and DAPO across all benchmarks, with particularly pronounced improvements on vision-dependent tasks. The results demonstrate that our simple yet effective approach successfully addresses the perception bottleneck in multimodal reasoning without requiring additional computational resources or external models. PAPO can serve as a direct drop-in replacement for both GRPO and DAPO.

Ablation Studies

Impact of Implicit Perception Loss Weighting (γ)

Method	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
GRPO	51.89	−	42.97	−	47.92	−
PAPO @0.005	52.40	↑ 1.19	43.73	↑ 1.92	48.55	↑ 1.51
PAPO @0.01	52.53	↑ 1.73	45.17	↑ 4.52	49.26	↑ 2.97
PAPO @0.02	53.39	↑ 3.38	45.57	↑ 5.60	49.92	↑ 4.36
PAPO @0.04 (collapsed)	31.24	↓ 43.15	38.31	↓ 14.09	34.38	↓ 28.46

Key Findings

A larger γ under 0.02 tends to result in more pronounced improvements, especially on more visually-dependent tasks.
γ should not be set too large (e.g., 0.04), as it causes severe model collapse that cannot be regularized. (Detailed in later sections)
Without additional regularization, setting γ=0.02 for 3B models and γ=0.01 for 7B models serves as a good default.
Larger models are more sensitive to higher γ values and require earlier regularization.

Impact of Masking Strategy and Ratio

Model Size	Method	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
3B	random @0.6	52.53	↑ 1.73	45.17	↑ 4.52	49.26	↑ 2.97
3B	semantic @0.6	52.13	↑ 0.34	43.78	↑ 1.88	48.42	↑ 1.02
7B	random @0.6	63.56	↑ 1.91	57.49	↑ 5.37	60.86	↑ 3.55
7B	semantic @0.6	63.39	↑ 1.48	56.83	↑ 3.89	60.47	↑ 2.55

Impact of Masking Ratio

Model Size	Masking Ratio	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
3B	random @0.4	52.51	↑ 1.6	44.12	↑ 2.3	48.78	↑ 1.9
	random @0.6	52.53	↑ 1.7	45.17	↑ 4.5	49.26	↑ 3.0
	random @0.8	52.57	↑ 1.5	44.24	↑ 2.7	48.87	↑ 2.0
	random @1.0	52.13	↑ 0.7	43.98	↑ 2.3	48.51	↑ 1.4

Key Findings

Despite its simplicity, random masking empirically outperforms semantic-aware masking
A sufficiently large masking ratio (0.6-0.8) yields the best performance
Complete blackening (ratio 1.0) is less effective and more prone to KLprcp hacking

PAPO + Remove Reference KL

Model Size	Method	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
3B	GRPO	51.89	−	42.97	−	47.92	−
	GRPO + No KLref	53.96	↑ 4.75	45.46	↑ 5.37	50.18	↑ 5.03
	PAPO + No KLref	56.21	↑ 9.26	49.33	↑ 13.60	53.15	↑ 11.19
7B	GRPO	62.51	−	54.11	−	58.78	−
	GRPO + No KLref	63.99	↑ 2.05	57.94	↑ 5.36	61.30	↑ 3.53
	PAPO + No KLref	63.31	↑ 1.15	59.18	↑ 7.54	61.47	↑ 3.99

Combining PAPO with removing the reference KL penalty leads to further improvements, particularly pronounced on 3B models with an average relative gain of over 11%. The Double Entropy Loss becomes crucial in this setting to prevent KLprcp Hacking.

DAPO Baseline Regularization Analysis

Figure: Training dynamics comparison between DAPO baseline with entropy regularization and PAPOD with Double Entropy Loss. While adding entropy loss to DAPO delays collapse, PAPOD maintains stable training throughout and achieves superior performance.

Method	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
DAPO-7B	57.58	−	51.79	−	55.01	−
DAPO-7B + Entropy Loss	64.77	↑ 13.5	59.14	↑ 17.7	62.27	↑ 15.9
PAPOD-7B (w/ Double Ent)	65.83	↑ 15.6	59.82	↑ 19.1	63.16	↑ 17.5

Key Findings

Adding single entropy loss to DAPO successfully delays collapse and improves performance, but does not fully prevent training instability
PAPOD with Double Entropy Loss consistently outperforms both DAPO variants and completely prevents collapse throughout training
The perception-aware optimization combined with robust regularization demonstrates superior effectiveness over simple entropy regularization

Qualitative Examples

Additional qualitative examples demonstrate how PAPO improves multimodal reasoning with better perception.

Implicit Perception Loss (KLprcp) Hacking

Empirically, we observe a unique form of model collapse that can occur when training with an excessively large KLprcp loss weighting (γ), which we refer to as KLprcp Hacking. We include an in-depth analysis of KLprcp Hacking in this section, identifying the model collapsing behavior when the hacking happens, the key factors that lead to collapse, its early signs, and different prevention strategies. Intuitively, because we maximize a KL divergence that is theoretically unbounded, the model may "hack" KLprcp to drive the policy gradient loss toward zero. When the hacking occurs, the model takes progressively larger update steps and suffers a drastic drop in rewards. To better regularize PAPO in high-γ and removed KL settings, we introduce Double Entropy Loss, an effective regularizer that prevents collapse while preserving performance.

In-Depth Analysis of KLprcp Hacking

Collapsing Behavior

Figure 4: Collapsing behavior. A distinctive generation pattern in collapsed models is the production of irrelevant tokens. We verify this quantitatively by prompting GPT-4.1-mini to provide relatedness scores of the responses from 0 to 10 for GRPO and collapsed PAPO 7B model. We further compare the variance of KLprcp over the response tokens. As illustrated, the collapsed PAPO model exhibits significantly lower relatedness scores and higher variance across tokens in KLprcp.

Collapsed models exhibit significantly degraded coherence, with relatedness scores approximately 18% lower than baseline models. The variance in KLprcp loss across response tokens is about 8.4 times higher, indicating that the model learns to exploit the objective by generating highly unpredictable token sequences.

Collapsing Behavior Example

Example: Demonstration of how collapsed models generate unrelated tokens during reasoning.

Concrete Example: Find x

Question: Find x.
Ground Truth: √21
❌ Collapsed Model Output:
<think> In a right triangle, the Pythagorean theorem can be used to find the length of the hypotenuse. The theorem states that in a right triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. Here, the sides are 2, 5, and x. The side x is the hypotenuse because it is the longest side and the angle between the sides of 2 and 5 is a right angle. So, according to the Pythagorean theorem: x² = 2² + 5² = 4 + 25 = 29, x = √29. Therefore, the value of x is √29.</think>

                    \boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$29$}% \else\@ifundefined{color}{\normalfont$29$}{\normalfont$29$}% \fi }@ The final answer is \boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$x = \sqrt{29}$}% \else\@ifundefined{color}{\normalfont$x = \sqrt{29}$}{\normalfont$x = \sqrt{29}$}% \fi }@} However, the correct final answer without the box is \boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$x = \sqrt{29}$}% \else\@ifundefined{color}{\normalfont$x = \sqrt{29}$}{\normalfont$x = \sqrt{29}$}% \fi }@} ...
                  
✅ No Collapse (Correct) Output:
<think> In a right triangle, according to the Pythagorean theorem, the square of the hypotenuse is equal to the sum of the squares of the two right-angled sides. Here, the hypotenuse is 5, and one right-angle side is 2 and the other is x. Therefore: 5² = 2² + x², x = √21 </think>

                    \boxed{√21}
                  

Key Observation: The collapsed model generates extensive unrelated LaTeX formatting tokens and repetitive content, demonstrating how KLprcp hacking leads to degraded output quality and coherence.

Early Signs of Collapse

Implicit Perception Loss

Drastic decrease

Clipping Ratio-High

Continuously increases

Entropy Loss

Increases in both policies

Influential Factors

Loss Weighting (γ)

Higher KLprcp weighting (e.g., 0.04) is more likely to lead to collapse. We recommend γ ≤ 0.02 for safe training.

Model Size

Larger models (7B) are more sensitive to hacking under the same configuration compared to smaller models (3B).

Masking Ratio

Extreme masking ratios (e.g., 1.0 - complete blackening) lead to faster collapse compared to moderate ratios (0.6-0.8).

Prevention Methods

Method	General AVG	General Δ%rel	Vision AVG	Vision Δ%rel	Overall AVG	Overall Δ%rel
GRPO	62.51	−	54.11	−	58.78	−
PAPO_H w/ Inc KLref	63.14	↑ 1.12	57.03	↑ 3.99	60.42	↑ 2.40
PAPO_H w/ Single Ent	63.34	↑ 1.53	58.36	↑ 5.96	61.12	↑ 3.50
PAPO_H w/ Double Ent	63.50	↑ 1.53	59.37	↑ 7.96	61.66	↑ 4.39

Key Findings

Among the tested regularization methods, Double Entropy Loss achieves the best overall improvement of 4.4% while successfully preventing collapse. This method encourages the model to keep both the original and masked policy entropies low, providing effective regularization against KLprcp hacking.

Resources

Code

💻 Complete implementation of PAPO algorithm with training scripts and evaluation tools.

GitHub

Models & Data

🍉 Models & datasets in Hugging Face Collection: Qwen2.5-VL models (3B and 7B) fine-tuned with PAPO.

Hugging Face

Paper

📄 Detailed technical paper with comprehensive results and discussions.

arXiv

Citation

@article{wang2025perception,
  title={Perception-Aware Policy Optimization for Multimodal Reasoning},
  author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
  journal={arXiv preprint arXiv:2507.06448},
  year={2025}
}