Current research on vision-language models (VLMs) is increasingly focused on addressing inherent limitations such as misalignment, hallucination, and adaptability in real-world applications. Recent work has introduced innovative frameworks like Confusion-Aware Prompt Tuning, which enhances model discriminability by learning from misclassifications, and Spatial Credit Redistribution, which mitigates hallucination by redistributing activation credit to contextual information. Additionally, the integration of large language models with VLMs for tasks like forest change analysis highlights the potential for interactive, user-driven applications in environmental monitoring. Meanwhile, advancements in dynamic suppression of language priors aim to reduce object hallucinations, while open-set test-time adaptation strategies are being developed to improve model robustness against distribution shifts. These efforts collectively indicate a shift toward more practical, user-friendly systems capable of handling complex, multimodal tasks, with significant implications for sectors such as remote sensing, e-commerce, and autonomous systems.
Top papers
- CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment(8.0)
- Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis(8.0)
- Chatting with Images for Introspective Visual Thinking(8.0)
- Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models(8.0)
- ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models(7.0)
- NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors(7.0)
- Recursive Belief Vision Language Model(7.0)
- MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models(7.0)
- Global Context Compression with Interleaved Vision-Text Transformation(6.0)
- StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues(6.0)
- Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models(6.0)
- Can Vision-Language Models Understand Construction Workers? An Exploratory Study(6.0)
- Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models(6.0)
- Visual Persuasion: What Influences Decisions of Vision-Language Models?(6.0)
- Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models(6.0)
- ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models(6.0)
- Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks(5.0)
- ReasonEdit: Editing Vision-Language Models using Human Reasoning(5.0)
- Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation(5.0)
- Vision-Language Models Unlock Task-Centric Latent Actions(5.0)
- Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement(5.0)
- Narrow fine-tuning erodes safety alignment in vision-language agents(5.0)
- ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models(5.0)
- VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation(5.0)
- ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport(5.0)
- Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs(5.0)
- Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models(4.0)