Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research.
Motivation of the proposed Multimodal Policy Internalization task. The goal is to enhance the policy-following abilities of a large multimodal model (LMM) without requiring the policy to be provided in-context during inference, thereby improving both performance and efficiency.
ClevrPolicy, a new dataset focused on reasoning-intensive, visually dependent decision-making. Left: Illustration of policy generation, where a decision tree is first generated and converted into natural language instructions. Right: Example input-output pair corresponding to the policy. The policy is available only during training and not during inference.
GTAPolicy, a new dataset focused on complex tool-using policies with real-world images and queries. Left: illustration of the policy, consisting of two major parts, tool description and tool calling rules. Right: input and output example corresponding to the policy. The visual input can contain multiple images.
We observe a key limitation in the baseline methods: they focus solely on learning the expected behavior without direct access to the original policy. This limitation becomes more evident for complex policies, where directly learning output patterns is increasingly difficult. This raises the following question: To address this challenge, we propose TriMPI, a three-stage training framework that (1) warms up the model via continual pretraining on the original policy, and (2) introduces PolicyRollout, a new RL algorithm enables more policy-aware explorations (detailed in next section).
TriMPI consists of three stages: (1) Visually-Masked Continual Pretraining (VM-CPT); (2) Supervised Finetuning with Chain-of-thought (CoT SFT); (3) Reinforcement learning (RL) with PolicyRollout. An illustration is shown above.
From our initial experiments, we observe that although GRPO and DAPO yield substantial empirical gains, exploration in the RL stage remains insufficiently grounded in the policy. This limitation becomes more pronounced with complex policies, where ungrounded exploration rarely produces positive rewards. Simply adding the policy in-context during training is ineffective, since the model lacks access to the policy at inference time, introducing a misalignment between training and inference. In the next section, we present our solution through a modified rollout phase.
To further improve the effectiveness of the RL stage for MPI, we introduce PolicyRollout (PoRo), a simple yet effective extension to GRPO-style algorithms that augments the rollout space with policy-aware responses. Specifically, during the rollout stage, for each sampled instance, we construct a variant by inserting the policy in-context. We then allow the current policy model to generate an additional set of responses conditioned on the query Q, the image I, and the policy P. These policy-aware responses are concatenated with the no-policy responses to form the rollout space, after which we proceed with group-based advantage estimation. An illustration is shown above.
End-task Performance. We present the evaluation results on ClevrPolicy and GTAPolicy in terms of task performance. Our best-performing model achieves up to 70.7% and 79.4% absolute gains in accuracy over the CoT SFT baseline and the in-context setting, respectively. Comprehensive ablation study show that the key components proposed in TriMPI, including the RL stage, the VM-CPT stage, and the PolicyRollout algorithm, are all crucial to the final performance.
Efficiency. We also report efficiency metrics as shown above, including the number of prompt tokens and the prefill inferencetime (i.e., the first forward pass on the input prompt) before and after internalization. All metrics are computed on Qwen2.5-VL-7B. With the policy removed from the prompt, we observe reductions of up to 93.9% in prompt tokens and 85.7% in prefill inference time.
Qualitative Analysis. On the left, we show the inputs and the ground-truth reasoning trajectory annotated with the original policy sections. On the right, the CoT SFT model makes an error in correctly recalling the policy condition, and the CoT SFT + GRPO model makes an incorrect decision at the fifth condition, both leading to an incorrect final outcome. In contrast, the proposed TriMPI correctly recalls all policy conditions and performs reasoning consistent with the input image.
Further Evaluations Beyond End-task Performance. We also conduct comprehensive evaluations on the generalization capability on policy updates, the quality of the injected policy knowledge, and the robustness to catastrophic forgetting. We point interested readers to the paper for details.
@misc{wang2025multimodalpolicyinternalizationconversational,
title={Multimodal Policy Internalization for Conversational Agents},
author={Zhenhailong Wang and Jiateng Liu and Amin Fazel and Ritesh Sarkhel and Xing Fan and Xiang Li and Chenlei Guo and Heng Ji and Ruhi Sarikaya},
year={2025},
eprint={2510.09474},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.09474},
}