
Meituan LongCat Releases General 365 Reasoning Benchmark as Leading AI Models Struggle to Pass
The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream AI models, the results revealed a significant performance gap in the industry. Even the high-performing Gemini 3 Pro, currently regarded as one of the most capable models available, achieved an accuracy rate of only 62.8%. Furthermore, the evaluation demonstrated that the vast majority of tested models were unable to reach the 60% accuracy threshold, which is traditionally considered a passing grade. This release by Meituan's technology team establishes a challenging new standard for AI reasoning, highlighting that current frontier models still face substantial hurdles in mastering complex logical tasks.
Key Takeaways
- Launch of General 365: Meituan's LongCat team has introduced a new evaluation standard specifically focused on reasoning capabilities.
- Widespread Performance Gap: Out of 26 mainstream models tested, most failed to achieve a 60% accuracy rate.
- Gemini 3 Pro Results: The industry-leading Gemini 3 Pro secured the top spot but only managed a 62.8% accuracy score.
- New Industry Benchmark: General 365 is positioned as a high-bar metric that exposes the limitations of current large language models in complex reasoning scenarios.
In-Depth Analysis
The Challenge of General 365
The introduction of General 365 by the Meituan LongCat team marks a pivotal shift in how artificial intelligence reasoning is measured. By testing 26 of the most prominent models in the current market, the benchmark provides a sobering look at the state of AI development. The fact that a majority of these models could not surpass the 60% mark suggests that General 365 is designed to test depth and logical consistency rather than simple pattern matching. This high level of difficulty serves to differentiate truly capable reasoning engines from those that rely on surface-level heuristics.
Benchmarking the Frontier: Gemini 3 Pro
One of the most significant findings from the Meituan report is the performance of Gemini 3 Pro. Despite its reputation as a leading model in the global AI landscape, its accuracy on the General 365 benchmark was limited to 62.8%. While this score placed it at the top of the 26 models tested, the narrow margin by which it passed the 60% threshold indicates that even the most advanced systems have considerable room for improvement. This data point underscores the rigor of the General 365 evaluation framework and suggests that the "reasoning" capabilities of modern LLMs are still in a relatively early stage of evolution when subjected to such stringent criteria.
The 60% Threshold and Model Failure Rates
The report highlights a concerning trend: the "passing line" of 60% remains out of reach for the bulk of the AI industry. With 26 mainstream models under review, the failure of the majority to reach this basic benchmark suggests a systemic challenge in current model architectures or training methodologies. Meituan's findings imply that while models are becoming more conversational and versatile, their ability to navigate the specific logical complexities demanded by General 365 remains a significant bottleneck. This creates a clear roadmap for future research, emphasizing the need for more robust reasoning frameworks.
Industry Impact
The release of General 365 by Meituan is likely to have a profound impact on the AI research community. By establishing a benchmark where even the strongest models struggle, Meituan has effectively raised the ceiling for what is considered "advanced" reasoning. This move encourages developers to move beyond traditional benchmarks that may have become saturated or prone to data contamination.
Furthermore, the transparency of these results—showing that most mainstream models fall below a 60% accuracy rate—provides a realistic baseline for enterprise expectations. As companies look to integrate AI into complex decision-making processes, benchmarks like General 365 offer a more accurate reflection of a model's reliability in high-stakes reasoning tasks. This will likely drive a new wave of optimization focused specifically on the logical gaps identified by the LongCat team.
Frequently Asked Questions
Question: What is the primary focus of the General 365 benchmark?
General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.
Question: How did the top-performing models fare on this benchmark?
According to the test results of 26 mainstream models, Gemini 3 Pro was the top performer with an accuracy of 62.8%. However, the majority of the other models tested failed to reach the 60% accuracy mark.
Question: Why is the 60% accuracy mark significant in this report?
The 60% mark is described as the "passing line." The fact that most mainstream models failed to reach this level highlights the extreme difficulty of the General 365 benchmark and the current limitations of AI reasoning.

