State of the Field
Current research on vision-language models (VLMs) is increasingly focused on addressing inherent limitations such as misalignment, hallucination, and adaptability in real-world applications. Recent work has introduced innovative frameworks like Confusion-Aware Prompt Tuning, which enhances model discriminability by learning from misclassifications, and Spatial Credit Redistribution, which mitigates hallucination by redistributing activation credit to contextual information. Additionally, the integration of large language models with VLMs for tasks like forest change analysis highlights the potential for interactive, user-driven applications in environmental monitoring. Meanwhile, advancements in dynamic suppression of language priors aim to reduce object hallucinations, while open-set test-time adaptation strategies are being developed to improve model robustness against distribution shifts. These efforts collectively indicate a shift toward more practical, user-friendly systems capable of handling complex, multimodal tasks, with significant implications for sectors such as remote sensing, e-commerce, and autonomous systems.
Papers
1–10 of 26Chatting with Images for Introspective Visual Thinking
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the pr...
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges i...
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categor...
Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in ...
ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this...
Recursive Belief Vision Language Model
Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or...
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: W...
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. W...
Visual Persuasion: What Influences Decisions of Vision-Language Models?
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, decidin...
Global Context Compression with Interleaved Vision-Text Transformation
Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input ...