VibeVoice
VibeVoice: Multi-Speaker Text-to-Speech Podcast Generator
VibeVoice is an open-source framework by Microsoft for generating long-form, multi-speaker text-to-speech audio in English and Chinese. With support for up to 4 speakers, natural emotional responses, and seamless bilingual switching, it is ideal for creating podcast drafts, audiobooks, educational content, and more. Its advanced features include context-aware expression, long-form synthesis (up to 90 minutes), and high-quality speaker consistency. VibeVoice uses a unique next-token diffusion process to create realistic, dynamic speech while maintaining coherence over long sessions.
2025-09-06
--K
VibeVoice Product Information
VibeVoice: Turn Text Into 90-Minute Multi-Speaker Podcasts
What's VibeVoice
VibeVoice is Microsoft's open-source framework designed to turn written text into multi-speaker audio, specifically for long-form content like podcasts and audiobooks. It supports up to four distinct speakers and can generate dialogues lasting up to 90 minutes. Ideal for content creators, educators, and researchers, VibeVoice allows seamless multi-speaker conversations in both English and Chinese, making it a versatile tool for generating dynamic, natural-sounding audio.
Features
Long-Form Conversational Synthesis
VibeVoice can produce up to 90 minutes of continuous, coherent dialogue. Whether you’re creating a podcast, audiobook, or any other long-form content, VibeVoice maintains natural speaker identity and conversation flow over extended periods.
Multi-Speaker Support
Generate dialogues with up to 4 distinct speakers. Each speaker maintains consistent voice characteristics throughout the conversation, ensuring a natural multi-speaker dynamic for podcasts and audiobooks.
Context-Aware Expression
VibeVoice integrates spontaneous emotions and even singing, enhancing the realism of conversations with expressive intonation and dynamic emotional shifts.
Cross-Lingual Speech
With support for both English and Chinese, VibeVoice can seamlessly switch between languages within a single conversation, offering a versatile solution for bilingual content generation.
Ultra-Low Frame Rate Tokenizer
The system’s ultra-low 7.5Hz token compression ensures efficient processing and reduced computational load, making it a powerful tool for long-form audio generation without compromising on quality.
Use Case
Podcast Prototyping
Content creators can use VibeVoice to quickly prototype podcast episodes, experimenting with formats, pacing, and speaker dynamics before committing to full-scale production.
Audiobook Narration
Authors and publishers can generate multi-character audiobook recordings with up to four distinct voices, making character dialogue more engaging and consistent throughout the story.
Educational Content & Training
VibeVoice can bring educational materials to life by generating engaging, dynamic dialogues between professors and students, enhancing e-learning experiences for auditory learners.
Language Learning & Bilingual Content
With native English and Chinese support, VibeVoice is ideal for creating bilingual dialogues for language practice, listening comprehension, and cultural exchanges.
Game Development & Interactive Stories
Game developers can use VibeVoice to prototype in-game dialogues and test narrative pacing, tone, and emotional delivery without the need for voice actors.
Accessibility & Assistive Technology
VibeVoice helps make content more accessible by converting long documents, articles, and reports into engaging audio, making it easier for visually impaired users to consume content.
FAQ
How long can VibeVoice generate speech?
VibeVoice can generate up to 90 minutes of audio with the 1.5B model and up to 45 minutes with the 7B model, offering a balance between speech quality and performance.
How many speakers can I include in one audio?
VibeVoice supports up to 4 distinct speakers in a conversation, each with their own voice prompt to ensure consistency in timbre and role identity.
Does VibeVoice generate background music or sound effects?
No, VibeVoice focuses purely on speech synthesis. While some faint background music artifacts may occur due to training data, they are unintended and uncontrollable.
Which languages does VibeVoice support?
VibeVoice supports English and Chinese natively. Other languages may work, but results may be unstable or unintelligible.
Can VibeVoice run on consumer hardware?
Yes, but the hardware requirements depend on the model. The 1.5B model requires 7–10GB VRAM, while the 7B model requires 18–24GB VRAM. Generating long audio takes significant time due to its computational complexity.
Can I use VibeVoice for commercial projects?
VibeVoice is open-source (MIT License), but its creators recommend using it for research and prototyping only. Commercial deployment should include ethical safeguards and disclosure practices.
Does VibeVoice support overlapping speech?
No, VibeVoice currently supports only sequential, turn-taking dialogue and does not model simultaneous speech or interruptions.
How does VibeVoice compare to other TTS services like ElevenLabs or Google TTS?
Unlike commercial services, VibeVoice is open-source, runs locally, and is specialized in long-form, multi-speaker content. It offers greater flexibility for research and creative experimentation but may not provide the same real-time speed or language support as commercial alternatives.