Meituan General 365: New AI Reasoning Benchmark Results

The Meituan LongCat team has officially released General 365, a new open-source benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team sets a rigorous new standard for the industry, emphasizing that complex reasoning remains a formidable challenge even for the most advanced artificial intelligence systems.

Key Takeaways

New Evaluation Standard: Meituan's LongCat team has open-sourced General 365, a benchmark specifically focused on general reasoning capabilities.
Performance Gap: Out of 26 mainstream models tested, the industry-leading Gemini 3 Pro only managed a 62.8% accuracy rate.
Widespread Underperformance: Most current AI models failed to reach the 60% accuracy mark on the General 365 benchmark.
Open Source Contribution: The release provides the AI community with a "new ruler" to measure and improve reasoning logic in large language models.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has introduced General 365 at a critical juncture in AI development. As large language models evolve, the focus is shifting from simple information retrieval to complex logical reasoning. By open-sourcing General 365, Meituan is providing a structured framework to evaluate how models handle multi-step logic and problem-solving. The title of the release, "Setting a New Ruler for Reasoning Evaluation," suggests that existing benchmarks may not be sufficiently challenging or comprehensive enough to distinguish the reasoning depths of modern LLMs. General 365 aims to fill this gap by offering a more rigorous testing ground.

Analyzing the Performance of Mainstream Models

The data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 mainstream models, representing a broad cross-section of the industry's current capabilities. The results indicate that reasoning is still a significant hurdle. Even Gemini 3 Pro, which is described as the "strongest on the surface" (地表最强), only achieved an accuracy of 62.8%. This score, while leading the pack, suggests that even top-tier models struggle with nearly 40% of the reasoning tasks presented in the General 365 suite.

Perhaps more telling is the performance of the remaining 25 models. The report notes that the vast majority of these models did not even reach the 60% "passing line." This widespread failure to achieve a basic level of proficiency on the General 365 benchmark indicates that while AI has made strides in natural language processing, the underlying logical architecture required for consistent reasoning is still in its infancy for most developers. This data serves as a benchmark for the industry, highlighting the specific areas where current LLMs fall short.

Industry Impact

Redefining Success in AI Development

The introduction of General 365 is likely to shift the industry's focus toward more rigorous reasoning benchmarks. By demonstrating that even the most advanced models like Gemini 3 Pro have significant room for improvement, Meituan is encouraging a move away from superficial performance metrics toward deeper logical consistency. This "new ruler" provides a clear target for AI researchers, emphasizing that high-quality reasoning is the next frontier for model optimization.

Encouraging Transparency through Open Source

By open-sourcing the General 365 benchmark, the Meituan LongCat team is fostering a more transparent and competitive environment. Developers can now use this tool to identify specific weaknesses in their models' reasoning chains. As more teams adopt this benchmark, it could lead to a standardized way of reporting reasoning capabilities, making it easier for the industry to track progress and for users to understand the actual limitations of the AI tools they employ.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did top-tier models perform on this benchmark?

In tests involving 26 mainstream models, Gemini 3 Pro achieved the highest accuracy at 62.8%. However, most other models failed to reach a 60% accuracy rate, indicating that reasoning remains a major challenge for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report uses the 60% mark as a metaphorical "passing line." The fact that most models failed to reach this level suggests that current AI reasoning capabilities are not yet reliable for complex tasks defined by the General 365 benchmark.

Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI