State of the Field
Recent advancements in multimodal AI are increasingly focused on enhancing model efficiency and robustness across diverse tasks. Techniques like scalable training-free data selection are streamlining the training process for vision-language models, allowing for significant performance retention with reduced data usage. Meanwhile, novel frameworks are being developed to address issues such as object hallucination and cross-modal interference, which have long plagued multimodal large language models. Approaches like causal decoding and modality-adaptive decoding are reshaping how models generate outputs, ensuring greater fidelity and relevance in responses. Additionally, new architectures are being introduced that leverage specialized routing pathways for different modalities, enhancing both modality-specific learning and cross-modal understanding. This shift towards more sophisticated, controlled interactions among modalities not only improves the reliability of AI systems but also opens avenues for practical applications in areas like autonomous systems, content generation, and interactive AI, where accurate and contextually aware outputs are paramount.
Papers
1–10 of 20ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-sca...
MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMo...
GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grou...
FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance
Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such relia...
Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off
Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly f...
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in ...
Phi-4-reasoning-vision-15B Technical Report
We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal i...
C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning
Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. Howev...
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cogniti...
What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge
We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and trainin...