Papers
1–4 of 4Are Video Reasoning Models Ready to Go Outside?
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantia...
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve...
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for...
Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and ev...