Back to List
Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI
Research BreakthroughEmbodied AILARYBenchComputer Vision

Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI

The Meituan Technology Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the field of embodied action, LARYBench provides a standardized metric for measuring how models learn from human video datasets. Experimental findings associated with the benchmark reveal that general-purpose vision models significantly outperform specialized embodied AI expert models in both action generalization and control precision. This research confirms that sophisticated embodied action representations can emerge naturally from massive human video data, marking a pivotal shift in how researchers approach robotic control and autonomous system training.

美团技术团队

Key Takeaways

  • Introduction of LARYBench: A systematic evaluation benchmark designed to define and measure general latent action representations in embodied AI.
  • Superiority of General Models: Experimental results show that general vision models outperform specialized action expert models in generalization and precision.
  • Emergence from Human Video: The benchmark demonstrates that embodied action representations can emerge from large-scale human video data without specialized robotic training.
  • The 'ImageNet' Moment: LARYBench aims to provide the same level of standardization for embodied AI that ImageNet provided for computer vision.

In-Depth Analysis

Defining the 'ImageNet' for Embodied Action Representation

The release of LARYBench (Latent Action Representation Yielding Benchmark) by the Meituan Technology Team represents a significant milestone in the evolution of embodied AI. For years, the field has lacked a unified, systematic benchmark to evaluate how well models can translate visual information into actionable, latent representations. By positioning LARYBench as the 'ImageNet' for embodied action, the researchers are establishing a foundational framework that allows for the objective measurement of general latent action representations. This system focuses on learning from large-scale visual data, providing a structured path for models to bridge the gap between seeing an action and understanding the underlying mechanics required to perform it.

General Vision Models vs. Specialized Action Experts

One of the most striking revelations from the LARYBench experiments is the performance gap between general-purpose vision models and specialized embodied AI action expert models. Traditionally, the industry has leaned toward developing 'expert' models—systems specifically architected and trained for narrow robotic tasks. However, the LARYBench data suggests a paradigm shift: general vision models, which are trained on broader and more diverse datasets, exhibit significantly better action generalization and control precision. This indicates that the features learned by general models are more robust and adaptable to the complexities of embodied tasks than the rigid frameworks of specialized experts. The ability of these general models to maintain high precision while adapting to new environments is a critical finding for the future of scalable AI.

The Emergence of Action from Human Video Data

LARYBench provides empirical evidence for a concept that has long been theorized: the emergence of embodied action representations from large-scale human video data. Rather than requiring exclusively robotic or synthetic data, the benchmark shows that models can extract meaningful action representations simply by observing human behavior in videos. This 'emergence' suggests that the fundamental laws of motion, interaction, and spatial awareness are embedded within the vast quantities of human video data available today. By leveraging this data, LARYBench demonstrates that models can develop a sophisticated understanding of action that is both generalizable and precise, potentially reducing the reliance on expensive, specialized robotic data collection.

Industry Impact

The introduction of LARYBench is poised to reshape the AI industry by standardizing how embodied intelligence is developed and evaluated. By proving that general vision models are more effective than specialized ones, it encourages a shift in resource allocation toward large-scale general model training. This could accelerate the development of more versatile robots capable of performing a wide array of tasks in unstructured environments. Furthermore, the ability to learn from human video data lowers the barrier to entry for training embodied systems, as it utilizes existing, massive datasets rather than requiring specialized hardware for data generation. LARYBench provides the necessary metrics to track progress in this new direction, ensuring that 'action' becomes as measurable and scalable as 'recognition' was in the previous decade.

Frequently Asked Questions

Question: What is the primary purpose of LARYBench?

LARYBench is a systematic evaluation benchmark designed to guide and measure the learning of general latent action representations from large-scale visual data, acting as a standard for the embodied AI field.

Question: Why do general vision models perform better than specialized expert models in this benchmark?

According to the experimental results, general vision models demonstrate superior action generalization and control precision because they learn more robust and adaptable features from diverse data, whereas specialized models may be too narrow to handle varied embodied tasks effectively.

Question: Can AI models learn how to move just by watching human videos?

Yes, the LARYBench findings indicate that embodied action representations can emerge from large-scale human video data, allowing models to learn generalized action patterns without needing specialized robotic training data.

Related News

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a groundbreaking systematic multi-round evaluation benchmark designed specifically for interactive video world models. Described as a diagnostic "CT scanner" for AI, WBench is engineered to pinpoint the exact limitations of current models as they transition from passive observation to active, multi-turn interaction. By providing a structured framework for assessment, WBench allows researchers to identify where world models struggle in complex scenarios, ranging from lunar simulations to futuristic urban environments. This open-source initiative marks a significant milestone in the AI industry, offering a standardized tool to measure the boundaries of world models and facilitating the development of more sophisticated, interactive artificial intelligence systems.

Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade
Research Breakthrough

Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade

The Meituan LongCat team has officially open-sourced "General 365," a new evaluation benchmark designed to measure the reasoning capabilities of AI models. In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the industry-leading Gemini 3 Pro achieved an accuracy rate of only 62.8%, while the vast majority of tested models failed to reach the 60% threshold. This release aims to establish a more rigorous standard for evaluating complex reasoning tasks in the AI industry, highlighting the ongoing challenges in developing truly capable reasoning engines. By open-sourcing this tool, Meituan provides a new yardstick for the global AI community to assess and improve logical depth in large language models.

Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations
Research Breakthrough

Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations

The Meituan LongCat technical team has officially unveiled LongCat-AudioDiT, a pioneering model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the synthesis pipeline, the model abandons traditional intermediate representations like Mel-spectrograms in favor of direct operation within the waveform latent space. Utilizing a Diffusion Transformer (DiT) architecture, LongCat-AudioDiT aims to learn the inherent laws of sound directly, thereby eliminating the cascaded errors typically associated with multi-stage data conversion. This breakthrough addresses a critical technical bottleneck in audio generation, offering a more streamlined and accurate approach to replicating human voices without the need for extensive speaker-specific training data.