Speech Recognition Comparison Hub

11 papers - avg viability 4.8

Current research in speech recognition is increasingly focused on addressing the challenges posed by low-resource and dialect-heavy languages, as well as enhancing long-form transcription accuracy. Recent work has introduced innovative frameworks that leverage dialect-aware modeling and metadata to improve performance in diverse linguistic contexts, such as Taiwanese Hakka and Bangla. Additionally, advancements like Whisper-CD demonstrate significant improvements in long-form speech recognition by employing contrastive decoding methods that enhance throughput and reduce word error rates. The field is also tackling real-world reliability issues, with studies revealing substantial transcription errors in high-stakes scenarios, particularly for non-English speakers. This has led to the development of synthetic data generation techniques to improve accuracy. Furthermore, the integration of large language models with efficient connector-sharing strategies is paving the way for more scalable multilingual ASR systems. Collectively, these efforts aim to create more robust, inclusive, and context-aware speech technologies that can operate effectively across varied linguistic landscapes.

Reference Surfaces

Benchmark Industry Index Database View Dataset Alternatives State Report Topic Page

Top Papers

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding(8.0)
Whisper-CD enhances long-form speech recognition accuracy and speed by using a training-free contrastive decoding method, making it a drop-in replacement for existing Whisper systems.
Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR(7.0)
Enable digital access for the endangered Nepal Bhasha language with a speech recognition API leveraging proximal transfer learning.
Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR(7.0)
A novel model merging algorithm, BoostedTSV-M, improves multi-domain ASR performance and out-of-distribution generalization, offering a scalable alternative to fine-tuning.
Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing(7.0)
Create a dialect-aware ASR tool tailored for the low-resource Taiwanese Hakka language, significantly reducing error rates.
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most(6.0)
Improve transcription accuracy for high-stakes voice applications using synthetic data augmentation.
BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition(5.0)
BBPE16 offers an efficient multilingual tokenization solution for ASR, improving performance for non-Latin scripts without increasing computational costs.
GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR(5.0)
Develop a metadata-gated adaptation framework to enhance ASR systems in dialect-heavy settings.
Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language(5.0)
Develop a speech recognition tool tailored for disordered Akan language based on a new impaired speech dataset.
A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment(5.0)
A scalable Bangla ASR and speaker diarization tool for longform audio with enhanced VAD and CTC alignment.
When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper(5.0)
Improve zero-shot ASR performance by understanding limitations of denoising preprocessing.