
Meituan LongCat Team Launches General 365: A New Benchmark Revealing Critical Gaps in AI Reasoning Capabilities
The Meituan LongCat team has officially released General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of modern artificial intelligence. In an initial assessment of 26 mainstream models, the results reveal a significant performance gap across the industry. Even Gemini 3 Pro, currently identified as the most powerful model in the test, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team establishes a new standard for measuring logical depth in AI and highlights the substantial room for improvement in complex reasoning tasks.
Key Takeaways
- New Evaluation Standard: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on reasoning performance.
- Industry-Wide Testing: The benchmark was applied to 26 mainstream AI models to provide a comprehensive overview of current capabilities.
- Gemini 3 Pro Leads: Currently the top performer in this evaluation, Gemini 3 Pro reached an accuracy of 62.8%.
- Widespread Failure: Most models tested were unable to achieve a 60% accuracy rate, indicating a general struggle with the benchmark's requirements.
In-Depth Analysis
The Launch of General 365 by Meituan LongCat
The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark arrives at a time when the industry is shifting its focus from simple conversational fluency to deep logical reasoning. By testing 26 mainstream models, Meituan has provided a broad cross-section of the current state of artificial intelligence. The introduction of General 365 is positioned as a "new yardstick" (标尺) for the industry, suggesting that existing benchmarks may not be sufficiently challenging or specific enough to differentiate the reasoning prowess of high-tier models.
Analyzing the Performance Gap: The 60% Threshold
The data released alongside General 365 provides a sobering look at the current limitations of large language models. The fact that Gemini 3 Pro, cited as the strongest model currently available, only managed a score of 62.8% suggests that the General 365 benchmark is exceptionally rigorous.
Perhaps more significant is the finding that the majority of the 26 models failed to reach the 60% mark. In many academic and professional contexts, 60% represents the minimum standard for passing or basic competency. The failure of most mainstream models to hit this benchmark indicates that while AI has made strides in many areas, complex reasoning remains a significant hurdle. This "passing line" serves as a clear indicator of the difficulty inherent in the General 365 evaluation set and the current ceiling for AI logic.
Meituan's Role in AI Standardization
By releasing this benchmark through their technical team, Meituan is asserting itself as a key player in the infrastructure of AI development. General 365 does not just rank models; it defines the criteria for what constitutes successful reasoning. The focus on 26 different models ensures that the benchmark is not tailored to a specific architecture but is instead a general assessment of the industry's progress. The results suggest that the path to truly "intelligent" reasoning is still in its early stages, with even the market leaders having significant room for growth.
Industry Impact
The release of General 365 is likely to have a multi-faceted impact on the AI industry. First, it provides a transparent and difficult target for model developers. When the "strongest" model only scores 62.8%, it creates a competitive drive for other labs to optimize for these specific reasoning challenges.
Second, it shifts the narrative away from general performance toward specialized reasoning. As more companies integrate AI into complex decision-making processes, benchmarks like General 365 become essential for determining which models are actually capable of handling logical tasks without failure. Meituan’s contribution highlights that the next frontier of AI development is not just about more data, but about higher quality logical processing and the ability to clear the 60% "competency" hurdle.
Frequently Asked Questions
Question: What is General 365?
General 365 is a new reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to test the logical and reasoning capabilities of AI models, providing a standardized metric for the industry.
Question: How did the top AI models perform on this benchmark?
According to the report from Meituan, Gemini 3 Pro is currently the strongest performer with an accuracy of 62.8%. However, most of the 26 mainstream models tested failed to reach a 60% accuracy rate.
Question: Why is the 60% score significant in the General 365 test?
The 60% score is often viewed as a basic passing grade or a threshold for competency. The fact that most mainstream models failed to reach this level underscores the high difficulty of the General 365 benchmark and the current limitations of AI reasoning.


