VisionCoach

📌 TL;DR

VisionCoach is an input-adaptive RL framework that improves spatio-temporal grounding in video reasoning through training-time visual prompting. Visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors, enabling grounded reasoning directly on raw videos without external tools at inference.

Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisionCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisionCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box IoU. Extensive experiments demonstrate that VisionCoach achieves state-of-the-art performance across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools.

🔧 Method Overview

VisionCoach training pipeline:

(1) Visual Prompt Selector: Takes video and question as input to predict appropriate visual prompt types. Applies prompts selectively only to challenging inputs.

(2) Spatio-Temporal Reasoner: Optimized with RL under visual prompt guidance, using object identity consistency and multi-region bounding-box IoU as rewards.

(3) Self-Distillation: The model internalizes improved reasoning from visual prompts, enabling grounded reasoning on raw videos without external tools at inference.

Reinforcement learning and visual prompting algorithm

📊 Results

V-STAR (Spatio-Temporal Reasoning)

Table 1. Performance on the V-STAR benchmark. VisionCoach achieves strong spatio-temporal reasoning across overall dimensions.

General Video Understanding & Reasoning

Table 2. Performance across different general video understanding and reasoning benchmarks. * indicates our implementation performance.

Figure. Inference efficiency comparison. VisionCoach consistently outperforms both text-centric (Qwen2.5-VL, Video-R1) and tool-calling (EgoR1, LongVT-RL) baselines, while operating at substantially lower inference latency than external tool-based approaches.

Qualitative Examples

Qualitative examples. Representative success cases where VisionCoach produces grounded and detailed video reasoning responses, illustrating how training-time visual prompting guides the model toward accurate spatio-temporal evidence and reduces hallucinations.

📈 Analysis

Spatio-temporal attention map. Visual prompting (VP) improves grounding in both temporal and spatial dimensions. The green box indicates the key frame, while the red box highlights the corresponding spatial region. Temporally, VP increases attention on the correct key frame containing the cowboy. Spatially, VP concentrates attention on the region corresponding to the queried visual attributes (e.g., the cowboy wearing specific clothing), while suppressing irrelevant regions.

Statistics of adaptive visual prompting. We analyze the distribution of designated sample difficulty (Easy vs. Hard), the visual prompts selected for hard samples by VP-SELECTOR, and the resulting reward gain from each prompt. 58% of samples are identified as hard, and for these hard instances VP-SELECTOR dynamically chooses among multiple visual prompting types. Applying these prompts consistently leads to reward improvements, with gains ranging from +56% to +66% across prompt types, indicating that adaptive visual prompting provides effective guidance and meaningful training signals on challenging examples.

BibTeX

@article{lee2025visioncoach,
  title={VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting},
  author={Daeun Lee and Shoubin Yu and Yue Zhang and Mohit Bansal},
  journal={arXiv preprint},
  year={2026},
  note={University of North Carolina, Chapel Hill}
}

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting