State of the Field
Recent advancements in audio processing are focusing on enhancing the efficiency and accuracy of speech technologies, addressing both practical and technical challenges. A notable trend is the development of dynamic tokenization methods that allow for variable-frame-rate processing, improving the quality of speech resynthesis while reducing the number of tokens needed. This shift is complemented by innovations in sound source localization, which tackle real-world deployment issues by mitigating imbalances in data distribution, thus enhancing localization accuracy. Additionally, new frameworks for speech bandwidth extension are leveraging neural codecs to restore high-frequency content more effectively, leading to clearer audio transmission. The introduction of shape-gain decomposition in neural audio codecs is also improving bitrate-distortion performance, making these systems more robust and efficient. Collectively, these efforts are poised to solve commercial problems in telecommunications and media by delivering higher quality audio with lower computational costs, ultimately enhancing user experience in various applications.
Papers
1–5 of 5Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies main...
Analytic Incremental Learning For Sound Source Localization With Imbalance Rectification
Sound source localization (SSL) demonstrates remarkable results in controlled settings but struggles in real-world deployment due to dual imbalance challenges: intra-task imbalance arising from long-t...
Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs...
CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space
Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or wavefo...
The Equalizer: Introducing Shape-Gain Decomposition in Neural Audio Codecs
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly rob...