Emotional Text to Speech with CosyVoice
Make your text sound human. CosyVoice generates expressive speech across five emotions and follows natural-language instructions to control style, dialect, speed, emphasis and breathing.
Expressive, controllable speech
Five core emotions
Render text as happy, sad, angry, fearful or surprised, in both Chinese and English.
Instruction control
Steer delivery with plain-language prompts like “speak slowly and gently” or “sound excited”.
Fine-grained markers
Place breaths, add emphasis and adjust pace at the word level for precise direction.
Consistent identity
Keep the same speaker identity across every emotion, style and speed.
Where expressive TTS shines
Games & characters
Voice NPCs and characters with emotion that matches the scene.
Video & social content
Add lively narration that holds viewer attention.
Conversational AI
Give assistants an empathetic, situation-aware tone.
Audiobooks & drama
Perform dialogue with believable emotional range.
Emotional TTS FAQ
What emotions does CosyVoice support?
CosyVoice can generate happy, sad, angry, fearful and surprised speech, plus neutral delivery, in both Chinese and English.
How do I control emotion and style?
Guide CosyVoice with natural-language instructions — for example “speak in a cheerful tone” — or use fine-grained markers for emphasis, pauses and speed.
Can I control speaking speed and emphasis?
Yes. CosyVoice supports fast and slow speed control plus word-level emphasis and breath markers for precise delivery.
Is emotional text to speech free to try?
Yes. Try expressive synthesis in the playground above. CosyVoice is open source under Apache-2.0 for self-hosting.