TTS Paper Demo

Adaptive Condition Optimization for Text-to-Speech via Inference-Time Gradient Guidance

Aco-TTS architecture overview

Main architecture overview of Aco-TTS and its adaptive condition optimization process.

Abstract

Existing zero-shot text-to-speech systems typically adopt a two-stage architecture combining a large language model with flow matching. However, flow matching follows an open-loop inference process without real-time correction during sampling, leading to speech distortion and unstable quality. A key limitation is that the condition is computed once before sampling and fixed throughout inference. Yet as the generation trajectory evolves, a static condition cannot adapt accordingly. To address this, we propose Adaptive Condition Optimization for Text-to-Speech (Aco-TTS), a gradient-guided framework for inference-time optimization that dynamically refines the condition during sampling using textual alignment and perceptual quality as guidance. This enables online correction and better constrains the flow trajectory. Aco-TTS is training-free and improves generation through gradient-based feedback at inference time. Experiments on SeedTTS, LibriSpeech, and AIShell-1 demonstrate significant reductions in word error rate and improvements in audio quality.

Audio Cases

Comparisons use the CosyVoice2 backbone baseline together with the two reward-guided variants from the paper: Aco-TTS-ASR and Aco-TTS-MOS. The English samples come from SeedTTS test-en, and the Chinese samples come from SeedTTS test-zh.

Case 1

English Demo

Prompt text: You've got the Mayor and Pullman backed against a wall.
Target text: Her old man was Doc Mitchell.
Prompt
Ground Truth
CosyVoice2
Aco-TTS-ASR
Aco-TTS-MOS
Case 2

English Demo

Prompt text: I shall be neither more nor less meritorious.
Target text: Tyler, Lucy, Michelle, we're going to space!
Prompt
Ground Truth
CosyVoice2
Aco-TTS-ASR
Aco-TTS-MOS
Case 3

English Demo

Prompt text: On the second day, the boy climbed to the top of a cliff near the camp.
Target text: The area was swirling in dust so intense that it hid the moon from view.
Prompt
Ground Truth
CosyVoice2
Aco-TTS-ASR
Aco-TTS-MOS
Case 4

Chinese Demo

Prompt text: 我挣扎了好久,把闹钟定在了八点。
Target text: 自动驾驶将大幅提升出行安全,效率。
Prompt
Ground Truth
CosyVoice2
Aco-TTS-ASR
Aco-TTS-MOS
Case 5

Chinese Demo

Prompt text: 真正落地成为产品服务进入每个人的生活。
Target text: 就在营救进行的同时,楼下却不断地发出欢呼,起哄声。
Prompt
Ground Truth
CosyVoice2
Aco-TTS-ASR
Aco-TTS-MOS