TTS Paper Demo

Adaptive Condition Optimization for Text-to-Speech via Inference-Time Gradient Guidance

Project: Aco-TTS Demo Page
Baseline: CosyVoice2
Demo Dataset: SeedTTS Eval

Abstract

Existing zero-shot text-to-speech systems typically adopt a two-stage architecture combining a large language model with flow matching. However, flow matching follows an open-loop inference process without real-time correction during sampling, leading to speech distortion and unstable quality. A key limitation is that the condition is computed once before sampling and fixed throughout inference. Yet as the generation trajectory evolves, a static condition cannot adapt accordingly. To address this, we propose Adaptive Condition Optimization for Text-to-Speech (Aco-TTS), a gradient-guided framework for inference-time optimization that dynamically refines the condition during sampling using textual alignment and perceptual quality as guidance. This enables online correction and better constrains the flow trajectory. Aco-TTS is training-free and improves generation through gradient-based feedback at inference time. Experiments on SeedTTS, LibriSpeech, and AIShell-1 demonstrate significant reductions in word error rate and improvements in audio quality.

Audio Cases

Comparisons use the CosyVoice2 backbone baseline together with the two reward-guided variants from the paper: Aco-TTS-ASR and Aco-TTS-MOS. The English samples come from SeedTTS test-en, and the Chinese samples come from SeedTTS test-zh.

Case 1

English Demo

Prompt text: You've got the Mayor and Pullman backed against a wall.

Target text: Her old man was Doc Mitchell.

Prompt

Ground Truth

CosyVoice2

Aco-TTS-ASR

Aco-TTS-MOS

Case 2

English Demo

Prompt text: I shall be neither more nor less meritorious.

Target text: Tyler, Lucy, Michelle, we're going to space!

Prompt

Ground Truth

CosyVoice2

Aco-TTS-ASR

Aco-TTS-MOS

Case 3

English Demo

Prompt text: On the second day, the boy climbed to the top of a cliff near the camp.

Target text: The area was swirling in dust so intense that it hid the moon from view.

Prompt

Ground Truth

CosyVoice2

Aco-TTS-ASR

Aco-TTS-MOS

Case 4

Chinese Demo

Prompt text: 我挣扎了好久，把闹钟定在了八点。

Target text: 自动驾驶将大幅提升出行安全，效率。

Prompt

Ground Truth

CosyVoice2

Aco-TTS-ASR

Aco-TTS-MOS

Case 5

Chinese Demo

Prompt text: 真正落地成为产品服务进入每个人的生活。

Target text: 就在营救进行的同时，楼下却不断地发出欢呼，起哄声。

Prompt

Ground Truth

CosyVoice2

Aco-TTS-ASR

Aco-TTS-MOS