Back to List
Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Research BreakthroughWorld ModelsMeituanAI Benchmarking

Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially released and open-sourced WBench, a groundbreaking evaluation framework designed to measure the capabilities of interactive video world models. As the first systematic multi-round benchmark of its kind, WBench serves as a diagnostic "CT scanner" for the AI industry, pinpointing the specific technical hurdles models face when transitioning from passive video generation to active, multi-round interaction. By evaluating performance across diverse scenarios—ranging from lunar explorations to complex cybernetic urban environments—WBench establishes a new standard for assessing how world models understand and react to interactive prompts. This open-source initiative aims to provide researchers with the tools necessary to identify where current models fail and how to push the boundaries of interactive artificial intelligence.

美团技术团队

Key Takeaways

  • Pioneering Benchmark: Meituan's LongCat team has introduced WBench, the first systematic multi-round evaluation benchmark specifically for interactive video world models.
  • Diagnostic Capabilities: Described as a "CT scanner," WBench is designed to precisely locate the technical bottlenecks that prevent world models from achieving seamless interaction.
  • Shift to Active Interaction: The benchmark focuses on the transition from "passive viewing" (simple video generation) to "active interaction" (responding to multi-round inputs).
  • Open-Source Contribution: By open-sourcing WBench, Meituan provides the global research community with a standardized tool to measure and improve world model boundaries.
  • Diverse Testing Scenarios: The framework evaluates models across a wide spectrum of environments, including "Moonwalks" and "Cyber Cities," to test the limits of spatial and temporal consistency.

In-Depth Analysis

The Diagnostic Evolution: WBench as a "CT Scanner" for AI

In the rapidly evolving landscape of generative AI, world models have emerged as a critical frontier. However, evaluating these models has historically been a challenge due to the lack of standardized metrics for interactivity. The Meituan LongCat team addresses this gap with WBench. By framing the benchmark as a "CT scanner," the team emphasizes a shift from holistic, often subjective, assessments to precise, diagnostic evaluations. Just as a medical scanner identifies internal structural issues, WBench is engineered to identify exactly where a world model's logic or consistency breaks down during the interactive process. This level of granularity is essential for researchers who need to understand whether a model's failure stems from a lack of physical common sense, poor temporal coherence, or an inability to process sequential user commands.

Bridging the Gap: From Passive Viewing to Active Interaction

Most current video generation models excel at "passive viewing"—creating a single, coherent video clip based on a static prompt. However, the true potential of a "world model" lies in its ability to act as a simulator that users can interact with in real-time. WBench is specifically designed to measure this transition. The "multi-round" aspect of the benchmark is its most significant innovation. Instead of testing a single output, WBench evaluates how a model maintains consistency and logic over several rounds of interaction. This simulates real-world applications where an AI must navigate a changing environment, such as a lunar landscape or a futuristic city, while responding to continuous user input. By measuring the boundaries of these interactions, WBench highlights the current limitations of AI in maintaining a stable "world state" over time.

Industry Impact

Standardizing World Model Evaluation

The introduction of WBench marks a significant step toward the standardization of world model research. In an industry where "world model" is often used as a broad marketing term, WBench provides a rigorous, systematic framework that defines what successful interaction actually looks like. By providing a clear set of criteria and multi-round testing protocols, it allows different research teams to compare their models' performance on a level playing field. This standardization is likely to accelerate the development of more robust AI simulators for robotics, gaming, and autonomous systems.

Accelerating Open-Source Innovation

By choosing to open-source WBench, Meituan is positioning itself as a key contributor to the global AI infrastructure. Open-sourcing such a benchmark lowers the barrier to entry for smaller research teams and academic institutions, allowing them to test their models against industry-leading standards without developing their own proprietary evaluation tools. This collaborative approach is expected to foster a more transparent research environment where failures are as documented as successes, ultimately leading to faster iterations and more reliable interactive AI technologies.

Frequently Asked Questions

Question: What makes WBench different from existing video generation benchmarks?

Unlike traditional benchmarks that focus on the visual quality of a single generated video (passive viewing), WBench is the first to offer a systematic, multi-round evaluation. It specifically tests how well a model handles ongoing interaction and maintains consistency across multiple steps, which is the core requirement for a true "world model."

Question: Why does the Meituan team refer to WBench as a "CT scanner"?

The term "CT scanner" is used as a metaphor for the benchmark's diagnostic precision. Rather than just giving a model a "pass" or "fail" grade, WBench is designed to pinpoint the exact stage or round where a model's interactive capabilities break down, allowing developers to see the "internal" logic errors of the model.

Question: What kind of scenarios does WBench use for testing?

WBench utilizes a variety of complex scenarios to test the boundaries of AI world models. These include diverse environments such as "Moonwalks" (testing low-gravity physics and unique environments) and "Cyber Cities" (testing complex urban structures and high-density visual data), ensuring the models are evaluated against a wide range of physical and architectural logic.

Related News

LARYBench Released: Defining the ImageNet for Embodied Action Representations via Large-Scale Human Video Learning
Research Breakthrough

LARYBench Released: Defining the ImageNet for Embodied Action Representations via Large-Scale Human Video Learning

The Meituan Technical Team has introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from extensive visual datasets. Positioned as the 'ImageNet' for embodied AI, LARYBench provides a standardized method for measuring how models understand and execute physical actions. Experimental findings reveal a significant shift in AI development: general vision models demonstrate superior performance in action generalization and control precision compared to specialized action expert models. Furthermore, the benchmark proves that embodied action representations can effectively emerge from large-scale human video data, suggesting that specialized robotic data may not be the only path to achieving high-level embodied intelligence.

Meituan LongCat Team Unveils LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning via Waveform Latent Space
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Redefining Zero-Shot Voice Cloning via Waveform Latent Space

Meituan's LongCat team has announced a significant advancement in speech synthesis with the release of LongCat-AudioDiT. This new model aims to overcome the limitations of traditional zero-shot Text-to-Speech (TTS) systems by eliminating intermediate representations like Mel-spectrograms. Instead, it utilizes a diffusion-based approach operating directly within the waveform latent space. This method is designed to prevent the accumulation of cascade errors that often occur during multi-stage data conversion. By allowing the AI to learn the inherent patterns of sound directly, LongCat-AudioDiT pushes the boundaries of high-fidelity voice cloning and streamlined audio generation, marking a technical shift in how AI models interpret and replicate human vocal characteristics.

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization
Research Breakthrough

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization

Meituan's technical team has achieved significant recognition at ACL 2026, with six papers accepted into this prestigious computational linguistics conference. The research spans a broad spectrum of cutting-edge AI fields, including large model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Furthermore, the papers explore advancements in reinforcement learning and the emerging field of generative recommendation. This collection of work underscores Meituan's strategic focus on refining generative paradigms and enhancing the practical capabilities of AI models in solving intricate problems and providing personalized user experiences. By addressing both theoretical benchmarks and practical application challenges, Meituan is positioning itself at the forefront of the next generation of natural language processing and artificial intelligence development.