Papers
1–4 of 4Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend ...
VIVECaption: A Split Approach to Caption Quality Improvement
Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed t...
Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal d...
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect ...