Back to List
LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos
Research BreakthroughEmbodied AIComputer VisionMachine Learning

LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos

The Meituan Technical Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI, often referred to as the 'ImageNet' for action representation. Experimental findings within the benchmark reveal that general vision models significantly outperform specialized embodied AI action expert models in both action generalization and control precision. Crucially, the research demonstrates that embodied action representations can emerge directly from large-scale human video data, providing a new methodology for measuring how AI systems translate visual observation into physical action capabilities.

美团技术团队

Key Takeaways

  • Introduction of LARYBench: A systematic benchmark designed to evaluate latent action representations learned from large-scale visual datasets.
  • Superiority of General Models: General vision models demonstrate significantly better performance in action generalization and control precision compared to specialized embodied AI expert models.
  • Emergence from Human Videos: The benchmark proves that embodied action representations can emerge from observing large-scale human video data without explicit action labels.
  • A New Industry Standard: LARYBench is positioned as the 'ImageNet' for the embodied AI field, providing a standardized metric for generalization and precision.

In-Depth Analysis

The Framework of LARYBench

LARYBench, which stands for Latent Action Representation Yielding Benchmark, represents a systematic shift in how the AI industry evaluates embodied intelligence. By focusing on "latent action representation," the benchmark addresses the critical gap between seeing an action and understanding the underlying mechanics required to replicate it. The system is designed to guide the learning process from massive visual datasets, transforming passive observation into actionable intelligence. By establishing a systematic evaluation protocol, LARYBench allows researchers to measure how effectively a model can extract action-oriented features from raw pixels, a process that is fundamental to the development of autonomous agents and robotics.

General Vision Models vs. Specialized Experts

One of the most striking revelations from the LARYBench experimental results is the performance gap between general-purpose vision models and specialized embodied AI action expert models. Traditionally, the industry has leaned toward developing niche models trained specifically for robotic control or embodied tasks. However, LARYBench demonstrates that general vision models—those trained on broad, diverse visual data—possess a superior ability to generalize actions across different scenarios. Furthermore, these general models exhibit higher control precision. This suggests that the foundational visual features learned by large-scale general models are more robust and adaptable for embodied tasks than the features captured by models with a narrower, task-specific focus.

Action Representation Emergence from Human Videos

The benchmark provides empirical evidence for a transformative concept in AI: the emergence of embodied action representations from human video data. This implies that AI models do not necessarily require direct robotic telemetry or specialized sensor data to understand physical movement. Instead, by processing large-scale videos of humans performing various tasks, these models can synthesize a latent understanding of action. This "emergence" is a critical finding, as it suggests that the vast repositories of human video content available globally can serve as a primary training ground for embodied AI, significantly lowering the barrier to training sophisticated robotic systems.

Industry Impact

The release of LARYBench is poised to redefine the development trajectory of embodied AI. By providing a standardized metric—akin to what ImageNet did for computer vision—it allows for objective comparisons between different architectural approaches. The finding that general vision models excel in this domain may lead to a consolidation of research efforts, where the focus shifts from building specialized action models to fine-tuning large-scale general vision models for physical tasks. This could accelerate the deployment of more precise and adaptable robots in real-world environments, as the industry moves toward leveraging human video data as a scalable resource for learning complex physical interactions.

Frequently Asked Questions

Question: What is the primary purpose of LARYBench?

LARYBench is a systematic evaluation benchmark designed to measure and guide the learning of general latent action representations from large-scale visual data, serving as a standard for the embodied AI industry.

Question: Why are general vision models performing better than specialized models in this benchmark?

According to the experimental results, general vision models show significantly better action generalization and control precision, suggesting that broad visual training provides a more robust foundation for understanding actions than specialized, task-specific training.

Question: Can AI learn to move just by watching human videos?

Yes, LARYBench demonstrates that embodied action representations can emerge from large-scale human video data, allowing models to learn the latent structures of action through visual observation.

Related News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation benchmark designed specifically for interactive video world models. As the first systematic multi-round assessment tool of its kind, WBench serves as a diagnostic 'CT scanner' for the AI industry. It is engineered to precisely identify the technical bottlenecks that occur when world models attempt to transition from 'passive viewing'—simply generating or observing video—to 'active interaction,' where the model must respond to dynamic inputs over multiple stages. By testing these models across diverse environments, ranging from lunar walks to cybernetic cities, WBench provides the necessary framework to define the current boundaries of world model capabilities and highlights where the technology currently struggles in maintaining consistency during complex, interactive sequences.

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization
Research Breakthrough

Meituan's ACL 2026 Research Breakthroughs: From Large Model Evaluation to Complex Reasoning Optimization

Meituan's technical team has achieved significant recognition at ACL 2026, with six papers accepted into this prestigious computational linguistics conference. The research spans a broad spectrum of cutting-edge AI fields, including large model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Furthermore, the papers explore advancements in reinforcement learning and the emerging field of generative recommendation. This collection of work underscores Meituan's strategic focus on refining generative paradigms and enhancing the practical capabilities of AI models in solving intricate problems and providing personalized user experiences. By addressing both theoretical benchmarks and practical application challenges, Meituan is positioning itself at the forefront of the next generation of natural language processing and artificial intelligence development.

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space

The Meituan LongCat team has officially released LongCat-AudioDiT, a specialized model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the audio generation pipeline, the model abandons traditional intermediate representations like Mel-spectrograms. Instead, it utilizes a diffusion-based approach operating directly within the waveform latent space. This strategic shift is intended to eliminate cascade errors that typically arise during multi-stage data conversion processes. By allowing the AI to learn the inherent patterns of sound directly from the source, LongCat-AudioDiT aims to overcome existing technical bottlenecks in voice synthesis, providing a more streamlined and high-fidelity solution for cloning voices without the need for extensive training on specific target speakers.