Back to List
Meituan LongCat Releases General 365: A New Benchmark Highlighting Reasoning Gaps in Leading AI Models
Industry NewsMeituanLongCatAI Benchmarking

Meituan LongCat Releases General 365: A New Benchmark Highlighting Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In an initial assessment of 26 mainstream AI models, the results reveal a significant performance gap across the industry. Even Gemini 3 Pro, currently regarded as the most capable model globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the tested models failed to reach the 60% threshold, which serves as the benchmark's passing grade. This release by Meituan's technical team establishes a new, more challenging standard for AI reasoning, suggesting that current models still face substantial hurdles in complex logical processing.

美团技术团队

Key Takeaways

  • Official Release of General 365: Meituan's LongCat team has introduced a new evaluation standard named General 365, specifically targeting AI reasoning performance.
  • Extensive Model Testing: The benchmark was used to evaluate 26 mainstream models to provide a comprehensive overview of the current state of AI capabilities.
  • Gemini 3 Pro Performance: Despite being identified as the strongest model currently available, Gemini 3 Pro only secured a 62.8% accuracy rate on the benchmark.
  • Industry-Wide Reasoning Deficit: Most models evaluated in the study failed to achieve a score of 60%, indicating a widespread struggle to meet basic reasoning requirements under this new standard.

In-Depth Analysis

The Launch of General 365 and the 26-Model Evaluation

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark arrives at a critical time when the industry is seeking more accurate ways to measure the true cognitive and reasoning depths of large language models (LLMs). By testing 26 mainstream models, the LongCat team has provided a broad horizontal comparison that highlights where the industry stands today. The decision to test such a wide array of models suggests that General 365 is intended to be a universal yardstick, capable of differentiating between models that may perform well on standard conversational tasks but falter when faced with rigorous reasoning requirements.

The scope of this evaluation is significant. Testing 26 different models ensures that the data is not skewed by a small sample size and reflects the diverse architectures and training methodologies currently prevalent in the AI sector. The results from this extensive testing phase serve as the foundational data for General 365, positioning it as a high-bar standard that challenges the status quo of AI performance metrics.

Analyzing the Performance Ceiling: Gemini 3 Pro and the 60% Threshold

One of the most striking findings from the LongCat team's report is the performance of Gemini 3 Pro. As the model currently recognized as the "strongest on Earth" (地表最强), its accuracy rate of 62.8% provides a clear indication of the benchmark's difficulty. If the industry leader is only able to clear the 60% mark by a narrow margin, it implies that General 365 targets reasoning complexities that are at the very edge of current AI capabilities. This 62.8% score sets a new ceiling for what is considered top-tier performance in the context of this specific benchmark.

Perhaps more concerning for the industry is the fact that the majority of the 26 models tested did not even reach the 60% "passing line." In many academic and professional contexts, 60% is viewed as the minimum threshold for competency. The failure of most mainstream models to hit this mark suggests that there is a significant "reasoning gap" in current AI development. While models are becoming increasingly proficient at generating fluent text and following instructions, their ability to navigate the specific reasoning challenges presented by General 365 remains limited. This data point underscores the necessity of the LongCat team's new standard, as it exposes weaknesses that other, perhaps more lenient, benchmarks might overlook.

Industry Impact

The introduction of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI reasoning is perceived and tested within the industry. By setting a standard where even the most advanced models struggle to achieve high scores, Meituan is pushing the AI community to move beyond surface-level performance and focus on deep logical processing.

This benchmark serves as a reality check for AI developers. The fact that most models failed to reach a 60% accuracy rate indicates that the next frontier of LLM development must prioritize reasoning accuracy over mere scale or fluency. As General 365 becomes a recognized scale (标尺), it will likely drive a new wave of optimization focused on the specific types of logical failures identified during these tests. Furthermore, the transparency provided by the LongCat team regarding the performance of 26 different models encourages a more competitive and data-driven approach to model refinement across the global AI landscape.

Frequently Asked Questions

Question: What is General 365?

General 365 is a new reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to act as a "new yardstick" for measuring the reasoning capabilities of AI models, focusing on high-difficulty tasks that challenge current mainstream architectures.

Question: How did the top AI models perform on this benchmark?

According to the data released by the Meituan technical team, the performance was lower than many might expect. The top-performing model, Gemini 3 Pro, achieved an accuracy of 62.8%. Most of the other 26 mainstream models tested failed to reach a 60% accuracy level.

Question: Why is the 60% mark significant in the General 365 test?

The 60% mark is described as the "passing line" (及格线). The fact that most models failed to reach this score highlights a significant deficiency in the reasoning abilities of current mainstream AI, even those that are widely used today.

Related News

Meituan Launches LongCat-2.0: A 1.6 Trillion Parameter Model Trained on 50,000 Domestic Computing Cards
Industry News

Meituan Launches LongCat-2.0: A 1.6 Trillion Parameter Model Trained on 50,000 Domestic Computing Cards

Meituan has officially announced the release of LongCat-2.0, a pioneering trillion-parameter large language model. This model represents a major technological milestone as the first in the industry to complete its entire training and inference lifecycle on a domestic computing cluster featuring 50,000 cards. LongCat-2.0 boasts a total of 1.6 trillion parameters, with an average activation of approximately 48 billion and a dynamic range of 33 billion to 56 billion. Pre-trained from scratch, the model natively supports a 1-million-token long context window. Its architecture is specifically designed to optimize Agentic Coding tasks, focusing on the efficient and stable understanding, generation, and execution of code in real-world scenarios.

Meituan Technical Team Showcases Machine Learning Research Excellence at ICML 2026 International Conference
Industry News

Meituan Technical Team Showcases Machine Learning Research Excellence at ICML 2026 International Conference

The Meituan Technical Team has announced its selection of academic papers for the 2026 International Conference on Machine Learning (ICML), one of the world's most prestigious forums for AI research. ICML serves as a critical platform for addressing the future challenges and core issues within the machine learning landscape. By evaluating research based on both theoretical depth and practical influence, the conference aims to steer the direction of global technological advancement. Meituan's participation underscores its commitment to contributing high-value research to the international community. This selection highlights the team's focus on bridging the gap between cutting-edge theory and real-world application, reinforcing its position as a significant contributor to the evolution of machine learning and its future research trajectories.

Meituan Technical Team Presents Six Research Papers at ACL 2026 Focusing on Large Model Evaluation and Reasoning Optimization
Industry News

Meituan Technical Team Presents Six Research Papers at ACL 2026 Focusing on Large Model Evaluation and Reasoning Optimization

Meituan's technical team has announced that six of its research papers have been accepted for ACL 2026, a premier international conference in the field of computational linguistics and natural language processing (NLP). The research spans several critical frontiers of artificial intelligence, including large model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the papers explore advancements in reinforcement learning optimization and generative recommendation systems. This collection of work represents Meituan's strategic push toward building a new paradigm for generative AI, focusing on enhancing the reasoning capabilities and evaluation frameworks of modern large language models to meet the demands of complex, real-world applications.