
Meituan LongCat Releases General 365: A New Benchmark Highlighting Reasoning Gaps in Leading AI Models
The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In an initial assessment of 26 mainstream AI models, the results reveal a significant performance gap across the industry. Even Gemini 3 Pro, currently regarded as the most capable model globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the tested models failed to reach the 60% threshold, which serves as the benchmark's passing grade. This release by Meituan's technical team establishes a new, more challenging standard for AI reasoning, suggesting that current models still face substantial hurdles in complex logical processing.
Key Takeaways
- Official Release of General 365: Meituan's LongCat team has introduced a new evaluation standard named General 365, specifically targeting AI reasoning performance.
- Extensive Model Testing: The benchmark was used to evaluate 26 mainstream models to provide a comprehensive overview of the current state of AI capabilities.
- Gemini 3 Pro Performance: Despite being identified as the strongest model currently available, Gemini 3 Pro only secured a 62.8% accuracy rate on the benchmark.
- Industry-Wide Reasoning Deficit: Most models evaluated in the study failed to achieve a score of 60%, indicating a widespread struggle to meet basic reasoning requirements under this new standard.
In-Depth Analysis
The Launch of General 365 and the 26-Model Evaluation
The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark arrives at a critical time when the industry is seeking more accurate ways to measure the true cognitive and reasoning depths of large language models (LLMs). By testing 26 mainstream models, the LongCat team has provided a broad horizontal comparison that highlights where the industry stands today. The decision to test such a wide array of models suggests that General 365 is intended to be a universal yardstick, capable of differentiating between models that may perform well on standard conversational tasks but falter when faced with rigorous reasoning requirements.
The scope of this evaluation is significant. Testing 26 different models ensures that the data is not skewed by a small sample size and reflects the diverse architectures and training methodologies currently prevalent in the AI sector. The results from this extensive testing phase serve as the foundational data for General 365, positioning it as a high-bar standard that challenges the status quo of AI performance metrics.
Analyzing the Performance Ceiling: Gemini 3 Pro and the 60% Threshold
One of the most striking findings from the LongCat team's report is the performance of Gemini 3 Pro. As the model currently recognized as the "strongest on Earth" (地表最强), its accuracy rate of 62.8% provides a clear indication of the benchmark's difficulty. If the industry leader is only able to clear the 60% mark by a narrow margin, it implies that General 365 targets reasoning complexities that are at the very edge of current AI capabilities. This 62.8% score sets a new ceiling for what is considered top-tier performance in the context of this specific benchmark.
Perhaps more concerning for the industry is the fact that the majority of the 26 models tested did not even reach the 60% "passing line." In many academic and professional contexts, 60% is viewed as the minimum threshold for competency. The failure of most mainstream models to hit this mark suggests that there is a significant "reasoning gap" in current AI development. While models are becoming increasingly proficient at generating fluent text and following instructions, their ability to navigate the specific reasoning challenges presented by General 365 remains limited. This data point underscores the necessity of the LongCat team's new standard, as it exposes weaknesses that other, perhaps more lenient, benchmarks might overlook.
Industry Impact
The introduction of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI reasoning is perceived and tested within the industry. By setting a standard where even the most advanced models struggle to achieve high scores, Meituan is pushing the AI community to move beyond surface-level performance and focus on deep logical processing.
This benchmark serves as a reality check for AI developers. The fact that most models failed to reach a 60% accuracy rate indicates that the next frontier of LLM development must prioritize reasoning accuracy over mere scale or fluency. As General 365 becomes a recognized scale (标尺), it will likely drive a new wave of optimization focused on the specific types of logical failures identified during these tests. Furthermore, the transparency provided by the LongCat team regarding the performance of 26 different models encourages a more competitive and data-driven approach to model refinement across the global AI landscape.
Frequently Asked Questions
Question: What is General 365?
General 365 is a new reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to act as a "new yardstick" for measuring the reasoning capabilities of AI models, focusing on high-difficulty tasks that challenge current mainstream architectures.
Question: How did the top AI models perform on this benchmark?
According to the data released by the Meituan technical team, the performance was lower than many might expect. The top-performing model, Gemini 3 Pro, achieved an accuracy of 62.8%. Most of the other 26 mainstream models tested failed to reach a 60% accuracy level.
Question: Why is the 60% mark significant in the General 365 test?
The 60% mark is described as the "passing line" (及格线). The fact that most models failed to reach this score highlights a significant deficiency in the reasoning abilities of current mainstream AI, even those that are widely used today.


