Vision-Language Models

Trending
26papers
6.0viability
+89%30d

State of the Field

Current research on vision-language models (VLMs) is increasingly focused on addressing inherent limitations such as misalignment, hallucination, and adaptability in real-world applications. Recent work has introduced innovative frameworks like Confusion-Aware Prompt Tuning, which enhances model discriminability by learning from misclassifications, and Spatial Credit Redistribution, which mitigates hallucination by redistributing activation credit to contextual information. Additionally, the integration of large language models with VLMs for tasks like forest change analysis highlights the potential for interactive, user-driven applications in environmental monitoring. Meanwhile, advancements in dynamic suppression of language priors aim to reduce object hallucinations, while open-set test-time adaptation strategies are being developed to improve model robustness against distribution shifts. These efforts collectively indicate a shift toward more practical, user-friendly systems capable of handling complex, multimodal tasks, with significant implications for sectors such as remote sensing, e-commerce, and autonomous systems.

Last updated Mar 4, 2026

Papers

1–10 of 26
Research Paper·Feb 11, 2026

Chatting with Images for Introspective Visual Thinking

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the pr...

8.0 viability
Research Paper·Jan 21, 2026

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges i...

8.0 viability
Research Paper·Mar 3, 2026

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categor...

8.0 viability
Research Paper·Feb 25, 2026

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in ...

8.0 viability
Research Paper·Feb 27, 2026

ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models

Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this...

7.0 viability
Research Paper·Feb 24, 2026

Recursive Belief Vision Language Model

Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or...

7.0 viability
Research Paper·Feb 25, 2026

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: W...

7.0 viability
Research Paper·Jan 16, 2026

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. W...

7.0 viability
Research Paper·Feb 17, 2026

Visual Persuasion: What Influences Decisions of Vision-Language Models?

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, decidin...

6.0 viability
Research Paper·Jan 15, 2026

Global Context Compression with Interleaved Vision-Text Transformation

Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input ...

6.0 viability
Page 1 of 3