Back to List
Meituan LongCat Team Launches WBench: The First Multi-Round Benchmark for Interactive Video World Models
Research BreakthroughMeituanWorld ModelsAI Benchmarking

Meituan LongCat Team Launches WBench: The First Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering systematic multi-round evaluation benchmark specifically designed for interactive video world models. Positioned as a diagnostic tool analogous to a "CT scanner," WBench is engineered to pinpoint the technical limitations encountered by AI models as they transition from passive video observation to active, multi-turn interaction. By providing a structured framework for assessment, WBench aims to clarify the boundaries of current world models, offering the research community a precise method to identify where models fail in maintaining consistency and responsiveness during interactive tasks. This development represents a critical advancement in the standardization of world model evaluation, focusing on the complexities of dynamic, user-driven environments.

美团技术团队

Key Takeaways

  • Introduction of WBench: Meituan's LongCat team has developed and open-sourced WBench, the first systematic benchmark for interactive video world models.
  • Focus on Interaction: Unlike traditional benchmarks, WBench emphasizes multi-round evaluation, moving beyond passive video generation to active user interaction.
  • Diagnostic Precision: The tool is described as a "CT scanner" for AI, capable of identifying specific bottlenecks in how models process and respond to interactive prompts.
  • Open-Source Contribution: By making WBench open-source, Meituan provides a standardized framework for the global AI research community to measure world model performance.

In-Depth Analysis

Transitioning from Passive Viewing to Active Interaction

The emergence of world models has largely been defined by their ability to generate realistic video sequences based on static prompts. However, the Meituan LongCat team identifies a significant gap in the current technological landscape: the transition from "passive viewing" to "active interaction." While existing models can create visually impressive content, their ability to function as a truly interactive "world"—where user inputs lead to consistent, logical, and multi-stage consequences—remains a primary challenge.

WBench is designed to address this specific boundary. By focusing on interactive video world models, the benchmark shifts the evaluation criteria from mere visual fidelity to functional interactivity. This transition is crucial for the development of AI systems that can serve as simulators for real-world or virtual environments, where the model must not only predict the next frame but also respond dynamically to changing variables introduced by a user over multiple rounds of engagement.

The "CT Scanner" for World Models

The Meituan technical team utilizes a powerful metaphor, describing WBench as a "CT scanner." This implies that the benchmark does more than just provide a pass/fail grade; it offers a deep, diagnostic look into the internal logic and consistency of a world model. In the context of multi-round interactions, a model might perform well in the initial stage but lose coherence as the interaction progresses.

WBench’s systematic approach allows researchers to see exactly where the "breakage" occurs. Whether the model fails to maintain spatial consistency, loses track of object permanence, or fails to adhere to the causal logic of the interaction, WBench provides the data necessary to pinpoint these failures. This level of granularity is essential for iterative development, allowing engineers to move beyond trial-and-error and toward targeted improvements in model architecture and training data.

Systematic Multi-Round Evaluation Framework

The core innovation of WBench lies in its "multi-round" evaluation structure. Traditional benchmarks often evaluate a single output based on a single input. However, a true world model must be able to sustain a narrative or a physical simulation over time. WBench tests the model's endurance and consistency across sequential interactions, which is a much higher bar for performance.

By open-sourcing this benchmark, Meituan is setting a new standard for how the industry perceives "world models." It suggests that a model cannot be considered a true world model if it cannot handle the complexities of a multi-turn dialogue or action sequence within a visual environment. This systematic framework provides a clear roadmap for what the next generation of AI models must achieve to be considered truly interactive.

Industry Impact

The release of WBench by the Meituan LongCat team has significant implications for the AI industry, particularly in the fields of autonomous systems, gaming, and virtual simulation. By providing a specialized tool to measure the boundaries of world models, Meituan is facilitating a more rigorous scientific approach to AI development.

Firstly, WBench fills a void in the evaluation ecosystem. As more companies claim to have developed "world models," the industry has lacked a common yardstick to verify these claims, especially regarding interactivity. WBench provides that yardstick. Secondly, the open-source nature of the project encourages collaborative improvement. Researchers worldwide can now use WBench to compare different architectures, leading to faster breakthroughs in how AI understands and simulates physical and digital reality. Finally, by identifying the specific points where models "get stuck," WBench helps prioritize research efforts toward solving the most critical bottlenecks in interactive AI, potentially accelerating the path toward more sophisticated and reliable AI-driven simulations.

Frequently Asked Questions

Question: What makes WBench different from other AI benchmarks?

Unlike standard benchmarks that focus on static image or video generation, WBench is the first to focus specifically on systematic, multi-round evaluation for interactive video world models. It evaluates how well a model can handle ongoing interactions rather than just a single prompt-to-video task.

Question: Why does the Meituan team refer to WBench as a "CT scanner"?

The term "CT scanner" is used to highlight the benchmark's ability to provide a detailed diagnostic analysis. It doesn't just measure performance; it identifies the specific points and rounds where a model's interactive capabilities fail, allowing for precise technical adjustments.

Question: Is WBench available for public use?

Yes, the Meituan LongCat team has open-sourced WBench, making it available for the broader AI research community to use, evaluate, and build upon for the advancement of interactive world models.

Related News

LongCat Open Sources VitaBench 2.0: A Pioneering Benchmark for Long-Term Dynamic User Modeling in AI Agents
Research Breakthrough

LongCat Open Sources VitaBench 2.0: A Pioneering Benchmark for Long-Term Dynamic User Modeling in AI Agents

The Meituan technical team has officially open-sourced VitaBench 2.0, a groundbreaking benchmark developed under the LongCat project. This new framework is the first of its kind to focus on long-term dynamic user modeling within real-life scenarios. VitaBench 2.0 is designed to systematically evaluate the capabilities of Large Language Models (LLMs) in maintaining personalization and demonstrating proactivity throughout extended, evolving interactions. By shifting the focus from static, short-term tasks to complex, real-world user relationships, VitaBench 2.0 sets a new standard for the industry. It provides a rigorous methodology for assessing how AI agents adapt to user needs over time, ensuring that the next generation of AI is not only reactive but also deeply personalized and capable of taking initiative in dynamic environments.

LARYBench: Redefining Embodied Action Representation Through Large-Scale Human Video Learning
Research Breakthrough

LARYBench: Redefining Embodied Action Representation Through Large-Scale Human Video Learning

The Meituan Technical Team has introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the development of general latent action representations from massive visual datasets. This benchmark serves as a critical milestone, often compared to an 'ImageNet' for embodied actions. The research findings reveal a significant shift in AI development: general-purpose vision models demonstrate superior performance in action generalization and control precision when compared to specialized embodied AI expert models. Most notably, the study confirms that embodied action representations can naturally emerge from large-scale human video data, suggesting that the vast library of human motion can be a primary source for training sophisticated robotic control systems without the need for exclusive robotic telemetry.

Meituan LongCat Team Releases General 365 Benchmark Revealing Significant Reasoning Gaps in Leading AI Models
Research Breakthrough

Meituan LongCat Team Releases General 365 Benchmark Revealing Significant Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In a comprehensive assessment of 26 mainstream models, the results indicate a challenging landscape for current AI technology. Even Gemini 3 Pro, currently regarded as one of the most powerful models available, achieved an accuracy rate of only 62.8%. The benchmark results further reveal that the vast majority of tested models failed to reach a 60% accuracy threshold, which is often considered a basic passing grade. This release by Meituan's technical team establishes a rigorous new standard for measuring AI reasoning, highlighting that most current models still struggle with complex logical tasks.