Back to List
Meituan LongCat Releases General 365 Reasoning Benchmark: Top Models Struggle to Surpass 63% Accuracy
Research BreakthroughMeituanLongCatAI Benchmark

Meituan LongCat Releases General 365 Reasoning Benchmark: Top Models Struggle to Surpass 63% Accuracy

The Meituan LongCat team has officially open-sourced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive assessment involving 26 mainstream AI models, the results highlight a significant performance gap in complex reasoning. Gemini 3 Pro, currently the top-performing model in this evaluation, achieved an accuracy rate of only 62.8%. Notably, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered the passing mark for this benchmark. This release aims to establish a more rigorous standard for AI reasoning, exposing the current limitations of even the most advanced models in the industry.

美团技术团队

Key Takeaways

  • New Reasoning Benchmark: Meituan's LongCat team has officially released and open-sourced "General 365," a specialized tool for evaluating AI reasoning.
  • Comprehensive Testing: The benchmark was used to assess 26 mainstream large language models to determine their logical and reasoning proficiency.
  • Performance Ceiling: Gemini 3 Pro emerged as the leader in the test, yet it only managed an accuracy rate of 62.8%.
  • Widespread Underperformance: Most models involved in the study were unable to reach the 60% passing threshold, indicating a significant challenge in current AI reasoning capabilities.

In-Depth Analysis

The Emergence of General 365: A New Standard in Reasoning

The Meituan LongCat team has introduced General 365 at a critical juncture in the evolution of artificial intelligence. As large language models (LLMs) become increasingly integrated into complex workflows, the need for a rigorous, specialized evaluation of their reasoning capabilities has become paramount. By open-sourcing General 365, the LongCat team is providing the global AI community with a new "yardstick" to measure progress. This benchmark is specifically designed to move beyond simple knowledge retrieval and focus on the intricate logical processes that define true reasoning.

The decision to test 26 different mainstream models provides a broad and representative cross-section of the current AI landscape. This comprehensive approach ensures that the benchmark's findings are not limited to a specific architecture or provider but instead reflect the general state of the industry. The results suggest that General 365 is a high-bar evaluation tool, designed to challenge models in ways that existing benchmarks might not, thereby revealing the true depth—or lack thereof—of their reasoning faculties.

Analyzing the Performance Gap and the 60% Threshold

The data released by the LongCat team reveals a stark reality: there is a significant performance gap in the realm of AI reasoning. The fact that Gemini 3 Pro, a model recognized for its advanced capabilities, achieved only a 62.8% accuracy rate is highly telling. This score represents the current "ceiling" of performance on the General 365 benchmark, suggesting that even the industry's most sophisticated models have a long way to go before mastering complex reasoning tasks.

Perhaps more concerning is the observation that the vast majority of the 26 models tested could not even reach the 60% mark. In many academic and professional contexts, 60% is considered the minimum passing grade. The failure of most mainstream models to hit this target on General 365 indicates that the benchmark has successfully identified a widespread limitation in current LLM development. This "60% barrier" serves as a clear indicator that while models are becoming more fluent and knowledgeable, their ability to consistently apply logic and reason through complex problems remains a significant hurdle.

Industry Impact

The introduction of General 365 is poised to have a lasting impact on the AI industry by shifting the focus of model evaluation. For a long time, the industry has prioritized scale and general knowledge, but Meituan's new benchmark highlights that reasoning is the next major frontier. By making General 365 open-source, the LongCat team is encouraging transparency and healthy competition among AI developers.

This benchmark provides a clear target for research teams worldwide. The specific data points—such as the 62.8% peak and the sub-60% average—provide a baseline that will likely drive future innovations in model architecture and training methodologies. As developers strive to surpass the benchmarks set by General 365, we can expect a renewed focus on logical consistency and multi-step reasoning, which are essential for the next generation of AI applications.

Frequently Asked Questions

Question: What is General 365 and who developed it?

General 365 is a reasoning evaluation benchmark developed and open-sourced by the Meituan LongCat team. It is designed to provide a rigorous standard for testing the logical reasoning capabilities of large language models.

Question: How did the top AI models perform on this benchmark?

According to the test results of 26 mainstream models, Gemini 3 Pro was the top performer with an accuracy of 62.8%. However, the majority of the other models tested failed to reach the 60% accuracy threshold, highlighting a general struggle with the reasoning tasks presented in the benchmark.

Question: Why is General 365 considered a "new yardstick" for the industry?

It is considered a new yardstick because it sets a high difficulty level that current mainstream models struggle to meet. By focusing specifically on reasoning and revealing that most models score below 60%, it establishes a more challenging and precise standard for evaluating the true intelligence of AI systems.

Related News

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Reasoning Paradigms
Research Breakthrough

Meituan Showcases AI Innovations at ACL 2026: Advancing Large Model Evaluation and Reasoning Paradigms

The Meituan technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference in computational linguistics and natural language processing (NLP). These papers represent a significant stride in Meituan's AI research, covering a diverse range of cutting-edge topics. The research focuses on critical areas such as large model evaluation frameworks, complex process reasoning, and the optimization of competition-level mathematical thinking. Furthermore, the papers delve into reinforcement learning optimizations and the emerging field of generative recommendation systems. By contributing to these specialized domains, Meituan aims to establish a new generation paradigm for generative AI, bridging the gap between theoretical research and practical industrial applications. This selection underscores Meituan's commitment to advancing the capabilities of Large Language Models (LLMs) and their integration into complex real-world workflows.

LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos
Research Breakthrough

LARYBench Released: Defining the ImageNet for Embodied Action Representation and Measuring Generalization from Human Videos

The Meituan Technical Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to advance the development of general latent action representations. Positioned as the 'ImageNet' for the field of embodied AI, LARYBench provides a standardized methodology for learning from large-scale visual data. The benchmark's initial experimental results reveal a significant shift in AI performance: general vision models consistently outperform specialized embodied AI expert models in both action generalization and control precision. Crucially, the research demonstrates that sophisticated embodied action representations can emerge naturally from large-scale human video data, suggesting a new path for training robots and autonomous systems without relying solely on specialized, task-specific datasets.

Meituan LongCat Team Unveils LongCat-AudioDiT: Revolutionizing Zero-Shot TTS via Direct Waveform Latent Space Diffusion
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Revolutionizing Zero-Shot TTS via Direct Waveform Latent Space Diffusion

The Meituan LongCat team has officially released LongCat-AudioDiT, a pioneering model designed to overcome the technical limitations of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the synthesis pipeline, the team has moved away from traditional intermediate representations like Mel-spectrograms. Instead, LongCat-AudioDiT operates directly within the waveform latent space using a diffusion-based architecture. This approach is specifically engineered to eliminate cascade errors caused by multi-stage data conversion, allowing the AI to learn the inherent laws of sound directly. This breakthrough promises to set a new upper limit for the fidelity and accuracy of voice cloning technology, providing a more streamlined and robust solution for high-quality audio generation.