State of the Field
Recent developments in multimodal learning are increasingly focused on enhancing model efficiency and adaptability in real-world applications. A notable trend is the shift towards self-improvement frameworks that leverage unlabeled data, as exemplified by recent work that introduces a co-evolutionary model for vision-language tasks, eliminating the need for costly human annotations. Additionally, frameworks designed to address continual missing modality learning are gaining traction, employing innovative architectures to manage incomplete data streams without succumbing to interference from competing modalities. This is complemented by efforts to integrate knowledge graph embeddings with vision-language models, facilitating more robust cross-modal reasoning. Furthermore, advancements in training methodologies are refining how images and texts are aligned, leading to improved performance across various benchmarks. Collectively, these initiatives are poised to solve pressing commercial challenges, such as enhancing user interaction in AI systems and improving data utilization in diverse applications, from healthcare to autonomous systems.
Papers
1–5 of 5V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-a...
DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning
Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Cont...
VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings
Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at ...
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We p...
Towards Generalized Multimodal Homography Estimation
Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when app...