
Meituan LongCat Releases General 365: A New Reasoning Benchmark Where Most AI Models Fail to Pass
The Meituan LongCat team has officially open-sourced 'General 365,' a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In an initial assessment of 26 mainstream AI models, the results highlight a significant gap in current cognitive performance. Even Gemini 3 Pro, identified as the top performer in the test, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested were unable to reach the 60% passing threshold. This release by Meituan's technology team provides a new standard for the industry, revealing that complex reasoning remains a substantial challenge for even the most advanced artificial intelligence systems currently available.
Key Takeaways
- Meituan's LongCat team has officially released and open-sourced the General 365 reasoning benchmark.
- Evaluation of 26 mainstream models reveals that most current AI systems struggle with complex reasoning tasks.
- Gemini 3 Pro emerged as the top performer with an accuracy of 62.8%, yet this remains relatively low for a leading model.
- The majority of tested models failed to reach a 60% accuracy score, establishing a high difficulty ceiling for the benchmark.
In-Depth Analysis
The Launch of General 365
The Meituan LongCat team has introduced General 365 as a specialized tool for evaluating the reasoning depth of artificial intelligence. By open-sourcing this benchmark, the team provides the global AI community with a new metric to measure progress beyond simple linguistic fluency. The focus of General 365 is specifically on 'General' reasoning, suggesting a broad application across various logical domains. The release comes at a time when the industry is shifting its focus from model size to the quality of logical output and problem-solving efficiency.
Performance Gap in Mainstream Models
The results released alongside the benchmark provide a sobering look at the current state of AI. Out of 26 mainstream models tested, the performance was notably lower than what is typically seen on standard benchmarks. The fact that Gemini 3 Pro, currently regarded as one of the most capable models globally, only secured a 62.8% accuracy rate indicates that General 365 contains highly challenging reasoning problems. This data point serves as a critical indicator that even the 'strongest' models have significant room for improvement when faced with the specific criteria set by the LongCat team.
The 60% Passing Threshold
A striking finding from the LongCat team's report is that the vast majority of the 26 models failed to reach the 60% mark. In many academic and professional contexts, 60% is considered the minimum threshold for a 'passing' grade. The failure of most mainstream models to meet this baseline suggests that current LLM (Large Language Model) architectures may still lack the robust logical frameworks required for consistent reasoning. This gap between current performance and the 'passing' line highlights the rigorous nature of General 365 as a evaluative standard.
Industry Impact
The introduction of General 365 by Meituan is significant for the AI industry as it establishes a more demanding yardstick for reasoning. By making the benchmark open-source, Meituan allows other developers to stress-test their models against the same 26-model baseline. This could lead to a shift in development priorities, moving away from general knowledge retrieval and toward the enhancement of internal logic and multi-step reasoning. As models strive to exceed the 62.8% mark set by Gemini 3 Pro, General 365 will likely become a key reference point for future iterations of large language models.
Frequently Asked Questions
Question: What is the significance of the 62.8% score achieved by Gemini 3 Pro?
Within the context of the General 365 benchmark, 62.8% represents the highest accuracy among 26 mainstream models. While it leads the field, the score suggests that even top-tier AI models face difficulty with the reasoning tasks included in this specific evaluation.
Question: Why did most models fail to reach the 60% mark on General 365?
The failure of the majority of models to reach 60% indicates that the General 365 benchmark is designed with a high level of difficulty that targets the weaknesses in current AI reasoning capabilities, setting a new and more difficult standard for the industry.
Question: Is General 365 available for public use?
Yes, the Meituan LongCat team has officially open-sourced General 365, allowing the broader technology community to use it for evaluating and improving AI model reasoning.

