Back to List
LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos
Research BreakthroughEmbodied AIComputer VisionRobotics

LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos

The Meituan Technology Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI, often referred to as the 'ImageNet' for action representation. Experimental results within the benchmark demonstrate a paradigm shift: general vision models significantly outperform specialized embodied AI expert models in both action generalization and control precision. The research confirms that sophisticated embodied action representations can emerge naturally from large-scale human video data, providing a new pathway for developing more versatile and precise robotic control systems without relying solely on specialized expert demonstrations.

美团技术团队

Key Takeaways

  • Introduction of LARYBench: A new systematic benchmark designed to evaluate and guide the development of general latent action representations from vast visual datasets.
  • Superiority of General Models: Findings reveal that general-purpose vision models exceed the performance of specialized embodied AI expert models in critical areas like action generalization and control precision.
  • Emergence from Human Video: The research proves that embodied action representations can emerge from large-scale human video data, suggesting a shift away from niche expert-only training data.
  • Standardizing Embodied AI: LARYBench aims to serve as the 'ImageNet' for the field of embodied action, providing a unified metric for measuring how well models understand and execute physical movements.

In-Depth Analysis

Defining the 'ImageNet' for Embodied AI

The release of LARYBench (Latent Action Representation Yielding Benchmark) by the Meituan Technology Team represents a foundational shift in how the industry approaches embodied intelligence. Historically, the field of computer vision was transformed by ImageNet, which provided a massive, standardized dataset for object recognition. LARYBench seeks to perform a similar role for the world of physical actions. By providing a systematic evaluation framework, it allows researchers to measure how effectively a model can learn 'latent action representations'—the underlying logic of movement and interaction—from raw visual data. This standardization is crucial for a field that has often struggled with fragmented evaluation metrics and specialized, non-transferable models.

Generalization vs. Specialization: A New Performance Leader

One of the most striking revelations from the LARYBench experimental results is the performance gap between general vision models and specialized embodied AI expert models. For years, the prevailing wisdom suggested that to master specific robotic or embodied tasks, one needed 'expert models' trained specifically on those tasks. However, LARYBench demonstrates that general vision models, which are trained on broader and more diverse visual information, actually exhibit significantly better action generalization. This means they can adapt to new, unseen scenarios more effectively than their specialized counterparts. Furthermore, these general models showed higher control precision, indicating that the breadth of visual understanding contributes directly to the accuracy of physical execution.

The Emergence of Action from Human Video Data

The research highlights a critical breakthrough in data utilization: the emergence of embodied action representations from large-scale human video data. Traditionally, training robots required labor-intensive expert demonstrations or simulated environments. LARYBench proves that by observing human movements in standard video formats, AI models can internalize the complexities of physical action. This 'emergence' suggests that the latent structures of how humans interact with the world are embedded within the vast amounts of video data already available. By leveraging this data, the AI industry can bypass the bottleneck of specialized data collection, allowing for the rapid scaling of embodied intelligence through general-purpose visual learning.

Industry Impact

The introduction of LARYBench and its subsequent findings are poised to reshape the AI industry in several ways. First, it validates the trend toward 'foundation models' in robotics, suggesting that the path to better robots lies in better general vision systems rather than more narrow, task-specific ones. This could lead to a consolidation of research efforts toward large-scale visual pre-training.

Second, the discovery that human video data is a viable source for action representation lowers the barrier to entry for developing embodied AI. Companies can now look toward massive video repositories as a primary training resource. Finally, by providing a standardized benchmark, LARYBench will likely accelerate the pace of innovation, as it gives the global research community a clear target and a consistent way to measure progress in the quest for truly autonomous and capable embodied agents.

Frequently Asked Questions

Question: What exactly is LARYBench?

LARYBench stands for Latent Action Representation Yielding Benchmark. It is a systematic evaluation system developed by the Meituan Technology Team to measure and guide how AI models learn general action representations from large-scale visual data, essentially acting as a standardized testing ground for embodied AI.

Question: Why do general vision models perform better than specialized expert models in this benchmark?

According to the LARYBench results, general vision models possess superior action generalization and control precision. This is likely because their exposure to a wider variety of visual data allows them to develop a more robust and flexible understanding of movement and spatial relationships, which translates better to diverse embodied tasks than the narrow training of expert models.

Question: Can robots really learn to move just by watching human videos?

The findings from LARYBench indicate that embodied action representations can 'emerge' from large-scale human video data. This means that the fundamental principles of how to act and interact in a physical space are present in human videos, and general models are capable of extracting this information to improve their own control and generalization capabilities.

Related News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation benchmark designed specifically for interactive video world models. As the first systematic multi-round assessment tool of its kind, WBench serves as a diagnostic 'CT scanner' for the AI industry. It is engineered to precisely identify the technical bottlenecks that occur when world models attempt to transition from 'passive viewing'—simply generating or observing video—to 'active interaction,' where the model must respond to dynamic inputs over multiple stages. By testing these models across diverse environments, ranging from lunar walks to cybernetic cities, WBench provides the necessary framework to define the current boundaries of world model capabilities and highlights where the technology currently struggles in maintaining consistency during complex, interactive sequences.

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization
Research Breakthrough

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization

Meituan's technical team has achieved significant recognition at ACL 2026, with six papers accepted into this prestigious computational linguistics conference. The research spans a broad spectrum of cutting-edge AI fields, including large model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Furthermore, the papers explore advancements in reinforcement learning and the emerging field of generative recommendation. This collection of work underscores Meituan's strategic focus on refining generative paradigms and enhancing the practical capabilities of AI models in solving intricate problems and providing personalized user experiences. By addressing both theoretical benchmarks and practical application challenges, Meituan is positioning itself at the forefront of the next generation of natural language processing and artificial intelligence development.

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space

The Meituan LongCat team has officially released LongCat-AudioDiT, a specialized model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the audio generation pipeline, the model abandons traditional intermediate representations like Mel-spectrograms. Instead, it utilizes a diffusion-based approach operating directly within the waveform latent space. This strategic shift is intended to eliminate cascade errors that typically arise during multi-stage data conversion processes. By allowing the AI to learn the inherent patterns of sound directly from the source, LongCat-AudioDiT aims to overcome existing technical bottlenecks in voice synthesis, providing a more streamlined and high-fidelity solution for cloning voices without the need for extensive training on specific target speakers.