
Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning Evaluation
The Meituan LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as one of the most advanced models, achieved a top accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is traditionally considered a passing grade. This release by Meituan's technical team establishes a more demanding standard for measuring AI reasoning, highlighting that current models still face substantial challenges in complex logical tasks.
Key Takeaways
- New Evaluation Standard: Meituan's LongCat team has introduced General 365, specifically designed to test the reasoning limits of AI models.
- Industry-Wide Testing: The benchmark was applied to 26 mainstream models to provide a comprehensive overview of current AI capabilities.
- Performance Ceiling: Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%.
- Reasoning Deficit: Most tested models failed to achieve a 60% score, indicating a widespread struggle with the reasoning tasks presented in General 365.
In-Depth Analysis
The Introduction of General 365
The Meituan LongCat team has officially open-sourced General 365, positioning it as a new yardstick for the evaluation of artificial intelligence. Unlike traditional benchmarks that may focus on general knowledge or linguistic fluency, General 365 appears to target the core cognitive function of reasoning. By releasing this tool, the LongCat team provides the developer community with a rigorous framework to identify the strengths and weaknesses of various large language models (LLMs) in logical processing.
The decision to open-source this benchmark suggests a move toward greater transparency and standardization in how AI progress is measured. As models become more sophisticated, the industry requires more difficult and nuanced testing environments to differentiate between superficial pattern matching and genuine logical reasoning.
Benchmarking the Leaders: Gemini 3 Pro and Beyond
In the initial testing phase conducted by the LongCat team, 26 mainstream models were put to the test. The results offer a sobering look at the current state of AI development. Gemini 3 Pro, which is currently identified as the strongest model in the field, reached an accuracy of 62.8%. While this represents the leading edge of current technology, it also highlights a significant margin for improvement.
The data reveals a steep drop-off in performance beyond the top-tier models. The fact that the majority of the 26 models could not reach a 60% accuracy level—often considered the minimum standard for competency—suggests that General 365 is a highly challenging benchmark. This performance gap underscores the difficulty of the reasoning tasks included in the set and indicates that many current LLMs may still struggle when faced with complex, multi-step logical requirements.
Industry Impact
The release of General 365 is significant for the AI industry as it shifts the focus from simple performance metrics to deep reasoning capabilities. By setting a benchmark where even the most advanced models score near the 60% mark, Meituan is effectively raising the bar for what constitutes a "high-performing" model. This encourages AI researchers and developers to move beyond optimizing for existing, potentially saturated benchmarks and instead focus on the fundamental challenges of machine reasoning.
Furthermore, the benchmark serves as a reality check for the industry. While marketing for AI models often emphasizes human-like capabilities, the General 365 results demonstrate that there is still a long way to go before AI can consistently master complex reasoning tasks. This new standard will likely drive a new wave of innovation focused on cognitive depth rather than just model size or data volume.
Frequently Asked Questions
Question: What is General 365?
General 365 is a new reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to provide a rigorous standard for testing the logical reasoning capabilities of large language models.
Question: How did mainstream models perform on this benchmark?
In a test of 26 mainstream models, the performance was generally low. Gemini 3 Pro led the group with a 62.8% accuracy rate, but the majority of models failed to reach a 60% score.
Question: Why is the 60% score significant in this context?
The 60% mark is often viewed as a basic passing grade or a threshold for competency. The fact that most models fell below this line indicates that General 365 is a particularly difficult test that exposes the reasoning limitations of current AI technology.

