State of the Field
Current research in speech recognition is increasingly focused on addressing the limitations of existing systems, particularly in low-resource and high-stakes environments. Recent work has highlighted the challenges of dialectal variability in languages like Taiwanese Hakka and the need for inclusive technologies for speech-impaired individuals, such as those speaking Akan. Additionally, studies have shown that mainstream ASR systems struggle with short, critical utterances, prompting the development of synthetic data generation techniques to enhance accuracy for non-English speakers. Innovations like BBPE16 aim to streamline multilingual tokenization, while connector-sharing strategies based on linguistic family membership improve efficiency in multilingual ASR applications. Furthermore, frameworks like VibeVoice-ASR are designed to handle long-form audio and multi-speaker scenarios more effectively, integrating various speech processing tasks into a single pipeline. Collectively, these advancements signal a shift toward more robust, inclusive, and context-aware speech recognition systems capable of meeting diverse commercial needs.
Papers
1–9 of 9Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct w...
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode i...
Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language
The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents...
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment
Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) ...
BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition
Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design...
GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR
Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framewor...
Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight...
VIBEVOICE-ASR Technical Report
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker comp...
WESR: Scaling and Evaluating Word-level Event-Speech Recognition
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal e...