Papers
1–4 of 4PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone ...
AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech
We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and ...
Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in co...
Spatial Audio Question Answering and Reasoning on Dynamic Source Movements
Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spati...