PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

Learning to perceive while learning to reason!

1University of Illinois Urbana-Champaign, 2Alibaba Group
*Equal Contribution, Corresponding Author

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO.

Perception Bottleneck in Multimodal Reasoning

teaser
Figure 1: Comprehensive error-type breakdown and inference example between GRPO and PAPO.

We raise two fundamental research questions: "(1) Are there unique challenges in multimodal reasoning that do not arise in text-only settings? (2) If so, how can we design a better RLVR algorithm grounded in multimodal principles?" In exploring the essential causes of failures in multimodal reasoning, we follow a typical GRPO pipeline to train Qwen2.5-VL-3B on ViRL39K and manually examine and categorize error types based on 200 error instances sampled from four benchmarks. We find that the majority of errors, 67.0%, stem from poor perception. We hypothesize that this bottleneck in perception is due to the GRPO objective not providing any incentive for the model to generate visually grounded responses.

PAPO Algorithm

method overview
Figure 2: Overview of PAPO objective.

We propose Perception-Aware Policy Optimization (PAPO), a novel RLVR algorithm that enhances multimodal reasoning through visually grounded optimization. PAPO is a simple extension of GRPO that directly incorporates a perception-aware internal supervision signal into its training objective. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models.

objective comparison
Figure 3: Comparison of GRPO and PAPO objectives.

Key Innovation: Implicit Perception Loss

PAPO introduces an Implicit Perception Loss in the form of a KL divergence term that maximizes the difference between the model's output distribution when given the original image versus a corrupted (masked) version. This encourages the model to rely on meaningful visual content for generating responses, addressing the core perception bottleneck in multimodal reasoning.

Results

Main Results

main results
Table 1: Performance (avg@8 acc %) comparison of Qwen2.5-VL-3B and 7B models between GRPO and PAPO on general and more vision-dependent multimodal reasoning tasks. MathVerseV refers to the vision-centric subset of MathVerse. PAPO_H denotes the high KLprcp weighting (γ = 0.02) variant of PAPO (γ = 0.01). ∆%rel indicates the averaged relative gain over GRPO for each task. We observe consistent improvements with both PAPO and PAPO_H, with gains approaching 8%, especially on heavily vision-dependent tasks.
4.4%
Overall Improvement
8.0%
Vision-Dependent Tasks
30.5%
Perception Error Reduction

PAPO consistently outperforms GRPO across all benchmarks, with particularly pronounced improvements on vision-dependent tasks. The results demonstrate that our simple yet effective approach successfully addresses the perception bottleneck in multimodal reasoning without requiring additional computational resources or external models.

Ablation Studies

Impact of Implicit Perception Loss Weighting (γ)

Method General AVG General Δ%rel Vision AVG Vision Δ%rel Overall AVG Overall Δ%rel
GRPO 51.89 42.97 47.92
PAPO @0.005 52.40 ↑ 1.19 43.73 ↑ 1.92 48.55 ↑ 1.51
PAPO @0.01 52.53 ↑ 1.73 45.17 ↑ 4.52 49.26 ↑ 2.97
PAPO @0.02 53.39 ↑ 3.38 45.57 ↑ 5.60 49.92 ↑ 4.36
PAPO @0.04 (collapsed) 31.24 ↓ 43.15 38.31 ↓ 14.09 34.38 ↓ 28.46

Key Findings

  • A larger γ under 0.02 tends to result in more pronounced improvements, especially on more visually-dependent tasks.
  • γ should not be set too large (e.g., 0.04), as it causes severe model collapse that cannot be regularized. (Detailed in later sections)
  • Without additional regularization, setting γ=0.02 for 3B models and γ=0.01 for 7B models serves as a good default.

Impact of Masking Strategy and Ratio

Model Size Method General AVG General Δ%rel Vision AVG Vision Δ%rel Overall AVG Overall Δ%rel
3B random @0.6 52.53 ↑ 1.73 45.17 ↑ 4.52 49.26 ↑ 2.97
semantic @0.6 52.13 ↑ 0.34 43.78 ↑ 1.88 48.42 ↑ 1.02
7B random @0.6 63.56 ↑ 1.91 57.49 ↑ 5.37 60.86 ↑ 3.55
semantic @0.6 63.39 ↑ 1.48 56.83 ↑ 3.89 60.47 ↑ 2.55

Key Findings

  • Despite its simplicity, random masking empirically outperforms semantic-aware masking
  • A sufficiently large masking ratio (0.6-0.8) yields the best performance
  • Increasing γ up to 0.02 generally improves performance, while excessively large γ leads to model collapse

PAPO + Remove Reference KL

Model Size Method General AVG General Δ%rel Vision AVG Vision Δ%rel Overall AVG Overall Δ%rel
3B GRPO 51.89 42.97 47.92
GRPO + No KLref 53.96 ↑ 4.75 45.46 ↑ 5.37 50.18 ↑ 5.03
PAPO + No KLref 56.21 ↑ 9.26 49.33 ↑ 13.60 53.15 ↑ 11.19
7B GRPO 62.51 54.11 58.78
GRPO + No KLref 63.99 ↑ 2.05 57.94 ↑ 5.36 61.30 ↑ 3.53
PAPO + No KLref 63.31 ↑ 1.15 59.18 ↑ 7.54 61.47 ↑ 3.99

Combining PAPO with removing the reference KL penalty leads to further improvements, particularly pronounced on 3B models with an average relative gain of over 11%. The Double Entropy Loss becomes crucial in this setting to prevent KLprcp Hacking.

Qualitative Examples

qualitative example 1
qualitative example 2
qualitative example 3

Additional qualitative examples demonstrate how PAPO improves multimodal reasoning with better perception.

Implicit Perception Loss (KLprcp) Hacking

Empirically, we observe a unique form of model collapse that can occur when training with an excessively large KLprcp loss weighting (γ), which we refer to as KLprcp Hacking. We include an in-depth analysis of KLprcp Hacking in this section, identifying the model collapsing behavior when the hacking happens, the key factors that lead to collapse, its early signs, and different prevention strategies. Intuitively, because we maximize a KL divergence that is theoretically unbounded, the model may "hack" KLprcp to drive the policy gradient loss toward zero. When the hacking occurs, the model takes progressively larger update steps and suffers a drastic drop in rewards. To better regularize KLprcp in high-γ settings, we introduce Double Entropy Loss, an effective regularizer that prevents collapse while preserving performance.

Double Entropy Loss

This idea stems from our observation that rising rollout entropy in both π_θ and π_θ^mask is an early sign of collapse. Double Entropy Loss encourages the model to keep both entropies low, and can be expressed as:
double entropy loss
where H denotes the entropy loss, computed as the negative log probability of the generated sequence, and the ... part is identical to PAPO objective as shown above.

In-Depth Analysis of KLprcp Hacking

Collapsing Behavior

collapsing behavior analysis
Figure 4: Collapsing behavior. A distinctive generation pattern in collapsed models is the production of irrelevant tokens. We verify this quantitatively by prompting GPT-4.1-mini to provide relatedness scores of the responses from 0 to 10 for GRPO and collapsed PAPO 7B model. We further compare the variance of KLprcp over the response tokens. As illustrated, the collapsed PAPO model exhibits significantly lower relatedness scores and higher variance across tokens in KLprcp.

Collapsed models exhibit significantly degraded coherence, with relatedness scores approximately 18% lower than baseline models. The variance in KLprcp loss across response tokens is about 8.4 times higher, indicating that the model learns to exploit the objective by generating highly unpredictable token sequences.

Collapsing Behavior Example

collapsing behavior example
Example: Demonstration of how collapsed models generate unrelated tokens during reasoning.

Concrete Example: Find x

Question: Find x.
Ground Truth: √21
❌ Collapsed Model Output:
<think> In a right triangle, the Pythagorean theorem can be used to find the length of the hypotenuse. The theorem states that in a right triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. Here, the sides are 2, 5, and x. The side x is the hypotenuse because it is the longest side and the angle between the sides of 2 and 5 is a right angle. So, according to the Pythagorean theorem: x² = 2² + 5² = 4 + 25 = 29, x = √29. Therefore, the value of x is √29.</think>
\boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$29$}% \else\@ifundefined{color}{\normalfont$29$}{\normalfont$29$}% \fi }@ The final answer is \boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$x = \sqrt{29}$}% \else\@ifundefined{color}{\normalfont$x = \sqrt{29}$}{\normalfont$x = \sqrt{29}$}% \fi }@} However, the correct final answer without the box is \boxed{ \@ifundefined{FontU}{}{ \ifCLASSOPTIONcaptionscolor\colorbox{blue!10!white}{\normalfont$x = \sqrt{29}$}% \else\@ifundefined{color}{\normalfont$x = \sqrt{29}$}{\normalfont$x = \sqrt{29}$}% \fi }@} ...
✅ No Collapse (Correct) Output:
<think> In a right triangle, according to the Pythagorean theorem, the square of the hypotenuse is equal to the sum of the squares of the two right-angled sides. Here, the hypotenuse is 5, and one right-angle side is 2 and the other is x. Therefore: 5² = 2² + x², x = √21 </think>
\boxed{√21}

Key Observation: The collapsed model generates extensive unrelated LaTeX formatting tokens and repetitive content, demonstrating how KLprcp hacking leads to degraded output quality and coherence.

Early Signs of Collapse

collapsing behavior example
Implicit Perception Loss
Drastic decrease
Clipping Ratio-High
Continuously increases
Entropy Loss
Increases in both policies

Influential Factors

collapsing behavior example

Loss Weighting (γ)

Higher KLprcp weighting (e.g., 0.04) is more likely to lead to collapse. We recommend γ ≤ 0.02 for safe training.

Model Size

Larger models (7B) are more sensitive to hacking under the same configuration compared to smaller models (3B).

Masking Ratio

Extreme masking ratios (e.g., 1.0 - complete blackening) lead to faster collapse compared to moderate ratios (0.6-0.8).

Prevention Methods

collapsing behavior example
Method General AVG General Δ%rel Vision AVG Vision Δ%rel Overall AVG Overall Δ%rel
GRPO 62.51 54.11 58.78
PAPO_H w/ Inc KLref 63.14 ↑ 1.12 57.03 ↑ 3.99 60.42 ↑ 2.40
PAPO_H w/ Single Ent 63.34 ↑ 1.53 58.36 ↑ 5.96 61.12 ↑ 3.50
PAPO_H w/ Double Ent 63.50 ↑ 1.53 59.37 ↑ 7.96 61.66 ↑ 4.39

Key Findings

Among the tested regularization methods, Double Entropy Loss achieves the best overall improvement of 4.4% while successfully preventing collapse. This method encourages the model to keep both the original and masked policy entropies low, providing effective regularization against KLprcp hacking.

Resources

Code

💻 Complete implementation of PAPO algorithm with training scripts and evaluation tools.

GitHub

Models & Data

🍉 Models & datasets in Hugging Face Collection: Qwen2.5-VL models (3B and 7B) fine-tuned with PAPO.

Hugging Face

Paper

📄 Detailed technical paper with comprehensive results and discussions.

arXiv

Citation

          @misc{wang2025perceptionawarepolicyoptimizationmultimodal,
                title={Perception-Aware Policy Optimization for Multimodal Reasoning}, 
                author={Zhenhailong Wang and Xuehang Guo and Sofia Stoica and Haiyang Xu and Hongru Wang and Hyeonjeong Ha and Xiusi Chen and Yangyi Chen and Ming Yan and Fei Huang and Heng Ji},
                year={2025},
                eprint={2507.06448},
                archivePrefix={arXiv},
                primaryClass={cs.CL},
                url={https://arxiv.org/abs/2507.06448}, 
          }