
Meituan LongCat Team Releases General 365 Benchmark Revealing Significant Reasoning Gaps in Leading AI Models
The Meituan LongCat team has officially introduced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In a comprehensive assessment of 26 mainstream models, the results indicate a challenging landscape for current AI technology. Even Gemini 3 Pro, currently regarded as one of the most powerful models available, achieved an accuracy rate of only 62.8%. The benchmark results further reveal that the vast majority of tested models failed to reach a 60% accuracy threshold, which is often considered a basic passing grade. This release by Meituan's technical team establishes a rigorous new standard for measuring AI reasoning, highlighting that most current models still struggle with complex logical tasks.
Key Takeaways
- New Evaluation Standard: Meituan’s LongCat team has launched General 365, a benchmark specifically focused on testing the reasoning limits of AI models.
- Gemini 3 Pro Performance: As the top-performing model in the initial test, Gemini 3 Pro reached an accuracy of 62.8%, setting the current ceiling for the benchmark.
- Industry-Wide Challenge: Out of 26 mainstream models tested, the majority were unable to achieve a 60% accuracy rate, indicating a widespread struggle with the benchmark's requirements.
- Rigorous Benchmarking: The results suggest that General 365 serves as a high-bar metric, exposing the limitations of even the most advanced current large language models.
In-Depth Analysis
The Launch of General 365 and the Reasoning Frontier
The release of General 365 by the Meituan LongCat team marks a significant pivot in how AI performance is measured. While many existing benchmarks focus on general knowledge or linguistic fluency, General 365 appears to target the core of cognitive AI: reasoning. By testing 26 mainstream models, the LongCat team has provided a cross-sectional view of the industry's current capabilities. The fact that the benchmark was introduced with such a diverse range of models suggests a goal of creating a universal standard that can differentiate between surface-level pattern matching and deep logical reasoning.
Analyzing the Performance Gap
The data provided by the Meituan technical team highlights a stark reality in the AI field. Gemini 3 Pro, which represents the current state-of-the-art in model development, led the group but only managed a score of 62.8%. This score is particularly telling when compared to the rest of the field; the majority of the 26 models failed to reach the 60% mark. This "60% threshold" serves as a symbolic passing grade, and the failure of most models to meet it suggests that General 365 is designed to be exceptionally difficult. It exposes a significant gap between the perceived intelligence of modern LLMs and their actual performance on rigorous reasoning tasks. The results imply that while models are becoming more sophisticated, their ability to handle complex, multi-step reasoning remains a primary bottleneck.
Industry Impact
The introduction of General 365 is likely to influence the AI industry by shifting the focus toward more specialized and difficult evaluation metrics. As general-purpose benchmarks become saturated with high-performing models, the industry requires more nuanced tools like General 365 to identify true progress in reasoning. Meituan's findings serve as a reality check for developers and researchers, proving that even the most acclaimed models like Gemini 3 Pro have substantial room for improvement. This benchmark may drive a new wave of research focused specifically on logical consistency and reasoning depth, as the current "passing rate" for the industry remains notably low according to these new standards.
Frequently Asked Questions
Question: What is the General 365 benchmark?
General 365 is a reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to test the reasoning capabilities of large language models and has recently been used to evaluate 26 mainstream models in the industry.
Question: How did the top AI models perform on this benchmark?
According to the report from Meituan, Gemini 3 Pro was the top performer with an accuracy rate of 62.8%. However, the majority of the 26 models tested did not reach the 60% accuracy threshold, highlighting the difficulty of the benchmark.
Question: Why is the 60% accuracy mark significant in this report?
The 60% mark is described as a "passing grade" (or benchmark line). The fact that most mainstream models failed to reach this level indicates that current AI reasoning capabilities are still insufficient when faced with the specific challenges presented by General 365.


