Back to List
Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations
Research BreakthroughTTSVoice CloningDiffusion Models

Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations

The Meituan LongCat technical team has officially unveiled LongCat-AudioDiT, a pioneering model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the synthesis pipeline, the model abandons traditional intermediate representations like Mel-spectrograms in favor of direct operation within the waveform latent space. Utilizing a Diffusion Transformer (DiT) architecture, LongCat-AudioDiT aims to learn the inherent laws of sound directly, thereby eliminating the cascaded errors typically associated with multi-stage data conversion. This breakthrough addresses a critical technical bottleneck in audio generation, offering a more streamlined and accurate approach to replicating human voices without the need for extensive speaker-specific training data.

美团技术团队

Key Takeaways

  • Elimination of Intermediate Steps: LongCat-AudioDiT completely removes the need for Mel-spectrograms, which are traditionally used as a bridge in TTS systems.
  • Waveform Latent Space Focus: The model operates directly within the waveform latent space to capture the authentic laws of sound and audio patterns.
  • Diffusion Transformer Architecture: It leverages a diffusion-based model (AudioDiT) to generate high-fidelity speech from text inputs.
  • Reduction of Cascaded Errors: By bypassing intermediate data conversions, the system prevents the accumulation of errors that often degrade voice cloning quality.
  • Zero-Shot Capability: The architecture is specifically optimized to enhance the upper limits of zero-shot voice cloning performance.

In-Depth Analysis

Overcoming the Bottleneck of Cascaded Errors

In traditional Text-to-Speech (TTS) architectures, the process of converting text into audible speech often involves multiple intermediate stages. One of the most common stages is the generation of a Mel-spectrogram, which acts as a visual representation of the audio spectrum. While effective, this multi-step process introduces a significant technical challenge: cascaded errors. Each transition—from text to spectrogram, and then from spectrogram to waveform via a vocoder—carries the risk of data loss and artifact generation.

The Meituan LongCat team identified this as a primary bottleneck limiting the quality of voice cloning. With the introduction of LongCat-AudioDiT, the team has moved toward a "direct-to-waveform" philosophy. By abandoning the Mel-spectrogram entirely, the model ensures that the synthesis process remains within a singular, continuous latent space. This approach effectively blocks the source of conversion errors, allowing the AI to maintain higher fidelity to the original target voice during the cloning process.

The Power of Diffusion in Waveform Latent Space

At the core of LongCat-AudioDiT is the integration of Diffusion Transformers (DiT) applied to the waveform latent space. Rather than trying to map text to a simplified frequency representation, the model is designed to understand the complex, underlying laws of sound itself. By training the model to operate in the latent space of the actual waveform, the LongCat team allows the AI to capture the nuances of timbre, pitch, and rhythm more holistically.

This shift to a diffusion-based approach in the latent space represents a move toward more robust generative modeling. The "AudioDiT" framework enables the model to iteratively refine the audio signal, starting from noise and moving toward a structured waveform that matches the input text and the target voice profile. This method is particularly potent for zero-shot voice cloning, where the model must replicate a voice it has never encountered during training based on only a very brief audio sample. By learning the fundamental patterns of audio rather than just surface-level spectral features, LongCat-AudioDiT pushes the performance ceiling for zero-shot synthesis.

Industry Impact

The release of LongCat-AudioDiT marks a significant shift in the technical trajectory of the AI audio industry. By proving that intermediate representations like Mel-spectrograms can be successfully bypassed, Meituan has set a new benchmark for architectural efficiency in TTS. This innovation is likely to influence how other industry players approach the problem of voice cloning, potentially leading to a widespread move toward latent waveform modeling. For applications requiring high-precision voice replication—such as personalized digital assistants, content creation, and real-time translation—this technology offers a path toward more natural and seamless human-AI interaction. Furthermore, the reduction in cascaded errors simplifies the overall pipeline, potentially reducing the computational overhead and complexity involved in deploying high-quality voice cloning solutions at scale.

Frequently Asked Questions

Question: What makes LongCat-AudioDiT different from traditional TTS models?

Traditional models usually convert text into a Mel-spectrogram before turning it into sound. LongCat-AudioDiT skips the Mel-spectrogram step entirely and works directly in the waveform latent space using a diffusion model to avoid errors caused by these conversions.

Question: Why is the removal of Mel-spectrograms important for voice cloning?

Mel-spectrograms are an intermediate representation that can lead to "cascaded errors," where mistakes in the first stage of generation are amplified in the final audio. Removing them allows the AI to learn the direct laws of sound, resulting in more accurate and higher-quality voice clones.

Question: What is the benefit of using a Diffusion Transformer (DiT) in this model?

The Diffusion Transformer allows the model to generate audio by refining noise into a clear waveform. When applied to the waveform latent space, it helps the AI better capture the complex characteristics of a specific voice, which is essential for high-quality zero-shot cloning.

Related News

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a groundbreaking systematic multi-round evaluation benchmark designed specifically for interactive video world models. Described as a diagnostic "CT scanner" for AI, WBench is engineered to pinpoint the exact limitations of current models as they transition from passive observation to active, multi-turn interaction. By providing a structured framework for assessment, WBench allows researchers to identify where world models struggle in complex scenarios, ranging from lunar simulations to futuristic urban environments. This open-source initiative marks a significant milestone in the AI industry, offering a standardized tool to measure the boundaries of world models and facilitating the development of more sophisticated, interactive artificial intelligence systems.

Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade
Research Breakthrough

Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade

The Meituan LongCat team has officially open-sourced "General 365," a new evaluation benchmark designed to measure the reasoning capabilities of AI models. In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the industry-leading Gemini 3 Pro achieved an accuracy rate of only 62.8%, while the vast majority of tested models failed to reach the 60% threshold. This release aims to establish a more rigorous standard for evaluating complex reasoning tasks in the AI industry, highlighting the ongoing challenges in developing truly capable reasoning engines. By open-sourcing this tool, Meituan provides a new yardstick for the global AI community to assess and improve logical depth in large language models.

Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI
Research Breakthrough

Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI

The Meituan Technology Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the field of embodied action, LARYBench provides a standardized metric for measuring how models learn from human video datasets. Experimental findings associated with the benchmark reveal that general-purpose vision models significantly outperform specialized embodied AI expert models in both action generalization and control precision. This research confirms that sophisticated embodied action representations can emerge naturally from massive human video data, marking a pivotal shift in how researchers approach robotic control and autonomous system training.