
Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation
Meituan's LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive test of 26 mainstream models, the results revealed a significant performance gap in the industry. Even the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is considered the passing mark for this evaluation. This release sets a challenging new standard for AI development, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems currently available.
Key Takeaways
- Meituan's LongCat team has introduced General 365, a new standard for evaluating AI reasoning.
- Evaluation of 26 mainstream models shows that most current AI systems struggle with complex reasoning tasks.
- Gemini 3 Pro emerged as the top performer but only achieved a 62.8% accuracy rate.
- The majority of tested models failed to reach the 60% 'passing' threshold, indicating a significant industry-wide challenge.
In-Depth Analysis
Establishing a New Benchmark for Reasoning
The LongCat team at Meituan has officially released General 365, a benchmark specifically engineered to measure the reasoning depth of modern artificial intelligence. In the rapidly evolving landscape of large language models (LLMs), traditional benchmarks often fail to capture the nuances of logical deduction and complex problem-solving. General 365 aims to fill this gap by providing a more stringent and accurate scale for assessing how models handle intricate reasoning scenarios. By focusing on these high-level cognitive tasks, Meituan is positioning General 365 as a critical tool for developers to identify the true limits of their models beyond simple pattern recognition or data retrieval.
The Performance Gap in Mainstream Models
The initial testing phase of General 365 involved 26 of the most prominent AI models in the industry today. The results were telling, revealing that high-level reasoning remains an elusive goal for most AI architectures. Gemini 3 Pro, which is currently recognized as one of the most powerful models globally, led the evaluation but only managed to secure a 62.8% accuracy rate. This score, while the highest among the group, suggests that even the industry's 'state-of-the-art' models have significant room for improvement when faced with the specific challenges posed by the General 365 benchmark.
Perhaps more striking is the fact that the vast majority of the 26 models tested were unable to reach the 60% mark. In the context of this benchmark, 60% is treated as the baseline for a 'passing' grade. The failure of most mainstream models to meet this basic requirement underscores a widespread deficiency in reasoning capabilities across the current AI ecosystem. This data suggests that while models are becoming more conversational and versatile, their ability to perform consistent, logical reasoning is not yet at a mature level.
Industry Impact
The release of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By setting a benchmark where even the strongest models barely pass, Meituan is forcing a shift in focus from quantity of data to the quality of reasoning. This benchmark serves as a reality check for the industry, providing a clear metric that distinguishes between models that can simulate intelligence and those that can truly reason through problems. As developers strive to improve their scores on General 365, we can expect to see a new wave of research focused on enhancing the logical frameworks and cognitive processing abilities of future AI systems.
Frequently Asked Questions
Question: What is the primary purpose of Meituan's General 365?
General 365 is designed to be a new benchmark for evaluating the reasoning capabilities of AI models, providing a more difficult and accurate measure of logical performance than previous standards.
Question: How did the top AI models perform on the General 365 test?
Out of 26 mainstream models tested, Gemini 3 Pro performed the best with a 62.8% accuracy rate. However, most other models failed to reach the 60% passing threshold.
Question: Why is the 60% score significant in this benchmark?
The 60% mark is considered the 'passing line' for General 365. The fact that most models failed to reach it highlights that current AI technology still faces major hurdles in mastering complex reasoning tasks.


