
Meituan LongCat Launches General 365: New Reasoning Benchmark Reveals AI Performance Gaps
Meituan's LongCat team has officially released General 365, a new evaluation benchmark specifically designed to measure the reasoning capabilities of large language models. In a comprehensive assessment of 26 mainstream AI models, the benchmark revealed a significant struggle across the industry to handle complex reasoning tasks. According to the results, Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%. Most notably, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a more rigorous standard for AI evaluation, highlighting that even the most advanced models currently available face substantial challenges in logical reasoning.
Key Takeaways
- Meituan's LongCat team has introduced General 365, a specialized benchmark for evaluating AI reasoning.
- Testing of 26 mainstream models shows that reasoning remains a significant challenge for current AI technology.
- Gemini 3 Pro recorded the highest accuracy at 62.8%, yet this remains relatively low for a top-tier model.
- The majority of tested models failed to achieve a 60% accuracy rate, falling below the benchmark's passing line.
In-Depth Analysis
The Introduction of General 365 by Meituan LongCat
The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is positioned as a new "ruler" or standard for measuring the reasoning capabilities of large language models (LLMs). By focusing specifically on reasoning, Meituan aims to provide a more nuanced understanding of how models process complex logic rather than just retrieving information. The launch of General 365 comes at a time when the industry is seeking more rigorous ways to differentiate between models that can simulate conversation and those that can truly perform logical deduction.
Analyzing the Performance of Mainstream Models
The initial data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 of the most prominent models in the industry. The results indicate a widespread inability to master the reasoning tasks presented in the General 365 suite.
Even Gemini 3 Pro, which the report identifies as the strongest model currently available ("the strongest on the surface"), only achieved an accuracy rate of 62.8%. This figure is particularly telling because it represents the ceiling of current performance within this specific testing framework. Perhaps more significant is the finding that the "passing line" of 60% was out of reach for the vast majority of the 26 models tested. This suggests that while AI has made strides in natural language processing, the leap to consistent, high-level reasoning is still in progress.
Industry Impact
Setting a New Standard for AI Reasoning
The release of General 365 is significant for the AI industry as it shifts the focus from general performance to specific reasoning depth. By establishing a benchmark where even the leading models struggle to pass a 60% threshold, Meituan is challenging developers to move beyond superficial improvements. This "new ruler" provides a clear metric for progress, forcing a shift toward solving more complex cognitive tasks.
Identifying the Reasoning Bottleneck
The fact that 26 mainstream models were tested and most failed to reach a basic level of proficiency on this benchmark highlights a critical bottleneck in AI development. The data suggests that current training methodologies may be reaching a plateau in terms of logical reasoning. For the industry, this serves as a call to action to refine how models are taught to think and reason, rather than just how they are taught to predict the next word in a sequence. The benchmark results for Gemini 3 Pro set a baseline that other developers will now aim to surpass, potentially accelerating the next wave of reasoning-focused AI research.
Frequently Asked Questions
What is the General 365 benchmark?
General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the logical and reasoning capabilities of large language models through a series of rigorous assessments.
How did Gemini 3 Pro perform on this benchmark?
Gemini 3 Pro was the highest-scoring model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%. While it was the top performer, its score highlights the difficulty of the General 365 reasoning tasks.
Why did most models fail the General 365 test?
According to the findings from the Meituan LongCat team, the majority of the 26 mainstream models tested could not reach the 60% accuracy mark. This indicates that complex reasoning remains a major weakness for most current AI models, regardless of their general popularity or performance in other areas.


