Papers
1–4 of 4Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persist...
Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify...
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on gener...
Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differ...