Text-to-Speech Comparison Hub

Fish Audio S2 Technical Report(9.0)

Fish Audio S2 is an open-sourced text-to-speech system that enables multi-speaker, instruction-following audio generation.

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice(8.0)

DeepASMR enables anyone to synthesize zero-shot ASMR speech from ordinary samples, leveraging a new dataset and advanced LLM techniques.

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2(7.0)

A novel framework for expressive text-to-speech synthesis that enhances emotional prosody control using causal learning principles.

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech(7.0)

Bolbosh is an open-source neural TTS system for Kashmiri, addressing the lack of speech technology for this underserved language.

Learning-free L2-Accented Speech Generation using Phonological Rules(7.0)

Generate accented speech without accented training data by combining phonological rules with a multilingual TTS model, enabling fine-grained accent control.

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis(6.0)

ZeSTA improves personalized speech synthesis by enhancing speaker similarity through zero-shot TTS augmentation and domain-conditioned training.

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS(4.0)

Improving voice cloning in TTS systems through effective LoRA fine-tuning of LLMs.

Text-to-Speech Comparison Hub

Reference Surfaces

Top Papers