Back to List
LARYBench Release: Defining the ImageNet for Embodied Action Representations and Measuring Generalization from Human Videos
Research BreakthroughEmbodied AIComputer VisionLARYBench

LARYBench Release: Defining the ImageNet for Embodied Action Representations and Measuring Generalization from Human Videos

The Meituan Technical Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI by providing a standardized way to measure how models learn actions from human video. Experimental findings within the benchmark reveal a paradigm shift: general-purpose vision models now significantly outperform specialized embodied AI action expert models in both action generalization and control precision. Most notably, the research confirms that embodied action representations can emerge naturally from large-scale human video datasets, suggesting a new path forward for training autonomous agents without the need for narrow, task-specific datasets.

美团技术团队

Key Takeaways

  • Introduction of LARYBench: A systematic evaluation benchmark (Latent Action Representation Yielding Benchmark) created to evaluate general latent action representations derived from large-scale visual data.
  • Superiority of General Models: Experimental results demonstrate that general vision models outperform specialized embodied AI expert models in both action generalization and control precision.
  • Emergence from Human Video: The benchmark proves that embodied action representations can emerge from large-scale human video data, rather than requiring specialized robotic data alone.
  • Systematic Evaluation: LARYBench provides a structured methodology for measuring how well models can translate visual information into actionable representations for embodied agents.

In-Depth Analysis

The Shift from Specialized Experts to General Vision Models

The release of LARYBench highlights a critical turning point in the development of embodied AI. Traditionally, the industry has relied on "action expert models"—AI systems specifically designed and trained for narrow, embodied tasks. However, the experimental data provided by LARYBench indicates that general vision models, which are trained on broader and more diverse visual datasets, are now achieving superior results.

This performance gap is particularly evident in two key metrics: action generalization and control precision. Generalization refers to the model's ability to apply learned actions to new, unseen environments or tasks, while control precision measures the accuracy of the physical movements executed by the agent. The fact that general vision models excel in these areas suggests that the underlying features learned from diverse visual data are more robust and adaptable than the specialized features learned by niche expert models. LARYBench serves as the first systematic tool to quantify this advantage, effectively acting as an "ImageNet" for the field of embodied action.

Emergence of Action Representations from Human Video Data

One of the most significant findings facilitated by LARYBench is the confirmation that embodied action representations can "emerge" from large-scale human video data. This implies that AI models do not necessarily need to be trained exclusively on robotic telemetry or specialized embodied datasets to understand the mechanics of action. By observing human movements and interactions within vast video libraries, these models can internalize latent representations of how actions are performed in the physical world.

LARYBench provides the metrics to measure this emergence, showing that the transition from passive observation (watching videos) to active representation (understanding actions) is not only possible but highly effective. This discovery validates the use of massive, unlabelled human video datasets as a primary resource for training the next generation of embodied AI, potentially reducing the reliance on expensive and difficult-to-collect robotic demonstration data.

Industry Impact

The introduction of LARYBench is poised to reshape the research and development priorities within the AI and robotics industries. By establishing a systematic benchmark, it allows researchers to move away from anecdotal evidence of model performance and toward a standardized, data-driven evaluation of latent action representations.

The finding that general vision models are superior to specialized experts suggests that the path to advanced robotics may lie in scaling general-purpose foundation models rather than building fragmented, task-specific systems. This could lead to a consolidation of efforts around large-scale visual pre-training. Furthermore, the ability to leverage human video data for action learning opens up a nearly inexhaustible source of training material, which could significantly accelerate the deployment of embodied AI in complex, real-world environments. LARYBench provides the necessary yardstick to measure progress in this new direction, ensuring that developments in action generalization and control precision are accurately tracked and optimized.

Frequently Asked Questions

Question: What exactly is LARYBench?

LARYBench stands for Latent Action Representation Yielding Benchmark. It is a systematic evaluation system designed to measure how effectively general latent action representations can be learned from large-scale visual data, specifically focusing on their application in embodied AI.

Question: Why are general vision models performing better than specialized action experts?

According to the experimental results from LARYBench, general vision models demonstrate significantly better action generalization and control precision. This is likely due to the broader range of visual features and contexts these models encounter during training, which allows them to develop more robust representations that translate better to various embodied tasks compared to models trained on narrow, specialized datasets.

Question: Can robots really learn to move just by watching human videos?

The findings associated with LARYBench indicate that embodied action representations can indeed emerge from large-scale human video data. This means that the fundamental understanding of how actions are structured and executed can be derived from observing human behavior, which can then be applied to robotic control and embodied intelligence.

Related News

Meituan Technical Team Showcases Six Research Papers at ACL 2026: Advancing LLM Evaluation and Reasoning Paradigms
Research Breakthrough

Meituan Technical Team Showcases Six Research Papers at ACL 2026: Advancing LLM Evaluation and Reasoning Paradigms

The Meituan Technical Team has announced the acceptance of six research papers at ACL 2026, a premier international conference in computational linguistics and natural language processing. These papers cover a broad spectrum of cutting-edge AI domains, including large model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the research explores advancements in reinforcement learning and generative recommendation systems. By focusing on these critical technical directions, Meituan aims to establish a new paradigm for generative AI, moving beyond basic text generation toward more sophisticated, logical, and specialized applications. This contribution highlights Meituan's commitment to bridging the gap between theoretical research and practical industry implementation, particularly in enhancing the reasoning capabilities and evaluative frameworks of modern language models.

Meituan LongCat Team Unveils LongCat-AudioDiT to Redefine Zero-Shot TTS Voice Cloning via Waveform Latent Space
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT to Redefine Zero-Shot TTS Voice Cloning via Waveform Latent Space

The Meituan LongCat team has announced the release of LongCat-AudioDiT, a pioneering model designed to advance the capabilities of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally restructuring the synthesis process, the model moves away from traditional intermediate representations like Mel-spectrograms, which are often identified as sources of cascade errors. Instead, LongCat-AudioDiT operates directly within the waveform latent space using a diffusion-based framework. This approach allows the AI to learn the inherent laws of sound directly from the data, bypassing intermediate stages that can degrade audio quality. The development aims to overcome existing technical bottlenecks in voice synthesis, providing a more direct and error-resistant method for high-fidelity voice cloning without the need for extensive per-speaker training.

LARYBench Released: Redefining Embodied AI Action Representation Through Large-Scale Human Video Learning
Research Breakthrough

LARYBench Released: Redefining Embodied AI Action Representation Through Large-Scale Human Video Learning

The Meituan Technical Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to measure general latent action representations derived from large-scale visual data. This benchmark marks a significant milestone in embodied intelligence, often compared to the 'ImageNet' moment for action representation. The research findings reveal a paradigm shift: general-purpose vision models significantly outperform specialized embodied expert models in both action generalization and control precision. Crucially, the study demonstrates that embodied action representations can spontaneously emerge from large-scale human video data, providing a new pathway for developing more capable and generalized robotic systems without relying solely on specialized datasets.