Meituan LongCat General 365: New AI Reasoning Benchmark

The Meituan LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most capable models, achieved a top score of only 62.8%. Most other models failed to reach the 60% accuracy threshold, which the team identifies as the 'passing mark.' This release establishes a more rigorous standard for the industry, suggesting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems.

Key Takeaways

New Evaluation Standard: Meituan's LongCat team has introduced General 365 as a specialized benchmark for testing AI reasoning.
Industry-Wide Performance Gap: Out of 26 mainstream models tested, the vast majority failed to reach a 60% accuracy rate.
Leading Model Performance: Gemini 3 Pro emerged as the top performer but only managed a score of 62.8%.
Raising the Bar: General 365 is positioned as a 'new ruler' that exposes the limitations of current large language models in complex reasoning tasks.

In-Depth Analysis

The Emergence of General 365 as a New Benchmark

The Meituan LongCat team has officially introduced General 365, a benchmark that aims to redefine how the industry measures reasoning in artificial intelligence. By positioning this tool as a 'new ruler' (新标尺), the team suggests that existing evaluation methods may not sufficiently challenge the current generation of large language models. The release of General 365 comes at a time when the AI industry is shifting its focus from simple generative tasks to complex logical reasoning, necessitating more rigorous and precise measurement tools. The benchmark's name and its initial rollout indicate a focus on comprehensive, perhaps year-round or all-encompassing, reasoning capabilities that mainstream models must now strive to meet.

Analyzing the Performance of 26 Mainstream Models

The initial data released alongside General 365 provides a sobering look at the current state of AI. The LongCat team conducted actual tests on 26 mainstream models to verify the benchmark's difficulty and the models' relative strengths. The results were telling: the majority of these models could not reach the 60% accuracy mark, which is traditionally considered the 'passing grade' or 'threshold of competence.' This widespread failure to meet a basic accuracy standard suggests that General 365 targets specific reasoning flaws that are prevalent across the industry, regardless of the model's architecture or training scale.

Gemini 3 Pro and the Ceiling of Current Reasoning

Even the most advanced models currently available are finding the General 365 benchmark to be a significant challenge. Gemini 3 Pro, which the report identifies as the strongest model currently available ('地表最强'), achieved an accuracy rate of 62.8%. While this score places it at the top of the pack, it remains only slightly above the 60% passing line. This narrow margin of success for a leading model underscores the difficulty of the General 365 tasks. It also highlights a plateau in reasoning performance, where even the industry's flagship models are struggling to achieve high levels of accuracy, pointing toward a need for fundamental breakthroughs in how AI processes logical sequences and complex problem-solving.

Industry Impact

The introduction of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By establishing a benchmark where the 'strongest' model only scores 62.8%, the LongCat team has effectively moved the goalposts for AI researchers. This creates a new competitive landscape where simply passing the 60% threshold becomes a primary objective for developers. Furthermore, the benchmark provides a transparent look at the limitations of current technology, encouraging the industry to move beyond surface-level performance and focus on the deep reasoning capabilities required for more sophisticated, real-world applications. As more teams adopt General 365 as a standard, it could lead to a more honest and rigorous era of AI evaluation.

Frequently Asked Questions

Question: What is the significance of the 60% score in the General 365 benchmark?

In the context of the General 365 release, the 60% mark is described as the 'passing line' (及格线). The fact that most of the 26 mainstream models failed to reach this score indicates that the benchmark is exceptionally difficult and that current AI reasoning capabilities are still in a relatively early stage of development.

Question: How did the top-performing model fare on this new test?

Gemini 3 Pro was identified as the top-performing model among those tested, yet it only achieved an accuracy rate of 62.8%. This suggests that even the most advanced AI models currently on the market have significant room for improvement when it comes to the specific reasoning challenges posed by General 365.

Question: Who developed the General 365 benchmark?

General 365 was developed and released by the Meituan LongCat team. It was introduced through the Meituan Technology Team's official channels as a new standard for evaluating the reasoning performance of large language models.

Meituan LongCat Team Launches General 365: A Challenging New Benchmark for AI Reasoning