Speech Recognition

9papers
4.7viability
+25%30d

State of the Field

Current research in speech recognition is increasingly focused on addressing the limitations of existing systems, particularly in low-resource and high-stakes environments. Recent work has highlighted the challenges of dialectal variability in languages like Taiwanese Hakka and the need for inclusive technologies for speech-impaired individuals, such as those speaking Akan. Additionally, studies have shown that mainstream ASR systems struggle with short, critical utterances, prompting the development of synthetic data generation techniques to enhance accuracy for non-English speakers. Innovations like BBPE16 aim to streamline multilingual tokenization, while connector-sharing strategies based on linguistic family membership improve efficiency in multilingual ASR applications. Furthermore, frameworks like VibeVoice-ASR are designed to handle long-form audio and multi-speaker scenarios more effectively, integrating various speech processing tasks into a single pipeline. Collectively, these advancements signal a shift toward more robust, inclusive, and context-aware speech recognition systems capable of meeting diverse commercial needs.

Last updated Mar 2, 2026

Papers

1–9 of 9
Research Paper·Feb 26, 2026

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct w...

7.0 viability
Research Paper·Feb 12, 2026

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode i...

6.0 viability
Research Paper·Feb 5, 2026·HealthcareEducation

Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language

The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents...

5.0 viability
Research Paper·Feb 26, 2026

A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) ...

5.0 viability
Research Paper·Feb 2, 2026

BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design...

5.0 viability
Research Paper·Mar 2, 2026

GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framewor...

5.0 viability
Research Paper·Jan 26, 2026

Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight...

4.0 viability
Research Paper·Jan 26, 2026

VIBEVOICE-ASR Technical Report

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker comp...

3.0 viability
Research Paper·Jan 8, 2026

WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal e...

2.0 viability