
Meituan LongCat Team Launches General 365: A New Benchmark Revealing AI Reasoning Limitations
The Meituan LongCat team has officially released General 365, a new evaluation benchmark specifically designed to measure the reasoning capabilities of large language models. In an extensive test involving 26 mainstream models, the benchmark has highlighted a significant performance gap in the current AI landscape. According to the results, Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%. Strikingly, the vast majority of the tested models failed to reach the 60% threshold, which is typically considered a passing grade. This development suggests that while AI has made strides in general tasks, complex reasoning remains a formidable challenge for even the most advanced systems currently available on the market.
Key Takeaways
- Meituan's LongCat team has introduced General 365, a rigorous new benchmark for evaluating AI reasoning.
- A comprehensive test of 26 mainstream models shows that most current AI systems struggle with complex reasoning tasks.
- Gemini 3 Pro recorded the highest accuracy at 62.8%, setting the current ceiling for the benchmark.
- The majority of tested models failed to achieve a 60% accuracy score, indicating a widespread "reasoning gap" in the industry.
In-Depth Analysis
The Debut of General 365
The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is positioned as a new yardstick for measuring the logical and reasoning depth of large language models (LLMs). As the AI industry moves beyond basic text generation and information retrieval, the ability to perform multi-step reasoning and maintain logical consistency has become the new frontier. General 365 aims to provide a standardized metric to quantify these high-level cognitive abilities, offering a clearer picture of how models perform when faced with complex problem-solving scenarios.
Benchmarking the Giants: Gemini 3 Pro and Others
To establish the baseline for General 365, the LongCat team conducted empirical tests on 26 of the most prominent mainstream models currently available. The results serve as a reality check for the state of artificial intelligence. Gemini 3 Pro, which is widely regarded as one of the most capable models in the world, achieved an accuracy rate of 62.8%. While this score placed it at the top of the leaderboard, the figure itself suggests that even the industry's leading models have significant room for improvement. The fact that the highest score sits just above 60% underscores the difficulty of the General 365 evaluation criteria.
The 60% Threshold: A Critical Performance Gap
One of the most significant findings from the Meituan LongCat report is the failure of most models to reach the 60% accuracy mark. In many academic and professional contexts, 60% is viewed as the minimum threshold for competency or a "passing grade." The discovery that the majority of mainstream models could not reach this level on General 365 highlights a critical deficiency in current AI development. It suggests that while models are becoming increasingly proficient at mimicking human language, their underlying reasoning engines are not yet robust enough to handle the complexities presented by this new benchmark. This gap between linguistic fluency and logical reasoning is a primary hurdle that the next generation of AI models will need to overcome.
Industry Impact
The release of General 365 by Meituan's LongCat team is expected to have a notable impact on how AI models are developed and marketed. By providing a benchmark where even top-tier models like Gemini 3 Pro struggle, Meituan is pushing the industry toward a more rigorous standard of accountability. This will likely encourage AI researchers to pivot their focus from increasing parameter counts to improving the qualitative aspects of machine reasoning. Furthermore, General 365 provides a transparent framework for enterprises to evaluate which models are truly capable of handling sophisticated logic-based tasks, potentially influencing future investments and adoption strategies across the tech sector.
Frequently Asked Questions
Question: What is the primary purpose of the General 365 benchmark?
General 365 was developed by the Meituan LongCat team to specifically evaluate and set a new standard for the reasoning capabilities of large language models, moving beyond general performance metrics.
Question: How did the top-performing model fare on General 365?
Gemini 3 Pro was the highest-scoring model among the 26 tested, achieving an accuracy rate of 62.8%. However, this was the only model to significantly exceed the 60% mark.
Question: What does the failure of most models to reach 60% accuracy signify?
It indicates that complex reasoning remains a major weakness for the majority of mainstream AI models. The results suggest that current AI technology still faces substantial challenges in performing logical tasks that meet a basic threshold of competency as defined by the General 365 benchmark.


