Multimodal AI

20papers
5.8viability

State of the Field

Recent advancements in multimodal AI are increasingly focused on enhancing model efficiency and robustness across diverse tasks. Techniques like scalable training-free data selection are streamlining the training process for vision-language models, allowing for significant performance retention with reduced data usage. Meanwhile, novel frameworks are being developed to address issues such as object hallucination and cross-modal interference, which have long plagued multimodal large language models. Approaches like causal decoding and modality-adaptive decoding are reshaping how models generate outputs, ensuring greater fidelity and relevance in responses. Additionally, new architectures are being introduced that leverage specialized routing pathways for different modalities, enhancing both modality-specific learning and cross-modal understanding. This shift towards more sophisticated, controlled interactions among modalities not only improves the reliability of AI systems but also opens avenues for practical applications in areas like autonomous systems, content generation, and interactive AI, where accurate and contextually aware outputs are paramount.

Last updated Feb 28, 2026

Papers

1–10 of 20
Research Paper·Feb 12, 2026

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-sca...

8.0 viability
Research Paper·Jan 15, 2026

MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMo...

8.0 viability
Research Paper·Jan 8, 2026

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grou...

8.0 viability
Research Paper·Feb 2, 2026

FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance

Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such relia...

8.0 viability
Research Paper·Feb 27, 2026

Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly f...

7.0 viability
Research Paper·Jan 15, 2026

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in ...

7.0 viability
Research Paper·Mar 4, 2026

Phi-4-reasoning-vision-15B Technical Report

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal i...

7.0 viability
Research Paper·Feb 11, 2026

C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. Howev...

6.0 viability
Research Paper·Jan 27, 2026

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cogniti...

6.0 viability
Research Paper·Jan 16, 2026

What Matters in Data Curation for Multimodal Reasoning? Insights from the DCVLR Challenge

We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and trainin...

6.0 viability
Page 1 of 2