Text-to-Speech

Trending

7papers

6.9viability

+500%30d

Papers

1–7 of 7

Research Paper·Mar 9, 2026

Fish Audio S2 Technical Report

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions...

9.0 viability

Research Paper·Jan 22, 2026

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style ...

8.0 viability

Research Paper·Mar 12, 2026

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduce...

7.0 viability

Research Paper·Mar 8, 2026

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Kashmiri is spoken by around 7 million people but remains critically underserved in speech technology, despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (TTS)...

7.0 viability

Research Paper·Mar 8, 2026

Learning-free L2-Accented Speech Generation using Phonological Rules

Accent plays a crucial role in speaker identity and inclusivity in speech technologies. Existing accented text-to-speech (TTS) systems either require large-scale accented datasets or lack fine-grained...

7.0 viability

Research Paper·Mar 4, 2026

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich a...

6.0 viability

Research Paper·Mar 11, 2026

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and ...

4.0 viability

Text-to-Speech

Papers

Fish Audio S2 Technical Report

DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech

Learning-free L2-Accented Speech Generation using Phonological Rules

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Filters