
Meituan LongCat Team Unveils General 365: A Rigorous New Benchmark for Evaluating AI Reasoning Capabilities
The Meituan LongCat team has officially released General 365, a new evaluation benchmark designed to test the reasoning limits of large language models. In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as the most powerful model, achieved an accuracy rate of only 62.8%. Most other models failed to reach the 60% passing threshold, highlighting the intense difficulty of the General 365 evaluation. This release by Meituan aims to establish a more demanding standard for reasoning, pushing the AI industry to move beyond general knowledge toward more complex cognitive processing and problem-solving capabilities.
Key Takeaways
- New Benchmark Release: Meituan's LongCat team has launched General 365, a specialized benchmark for reasoning evaluation.
- Industry-Wide Testing: The benchmark was used to test 26 mainstream AI models to assess their logical and reasoning performance.
- Leading Performance: Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%.
- High Difficulty Level: The majority of the 26 models tested failed to reach the 60% accuracy mark, which is considered the passing grade for this benchmark.
In-Depth Analysis
The Emergence of General 365 as a Reasoning Standard
The release of General 365 by the Meituan LongCat team marks a significant shift in how artificial intelligence is evaluated. While many existing benchmarks focus on broad knowledge or linguistic fluency, General 365 is positioned specifically as a "new ruler" for reasoning. By focusing on the cognitive depth of models, Meituan is addressing a critical need in the AI community: the ability to distinguish between models that simply predict the next token and those that can truly perform complex reasoning. The fact that this benchmark was developed by a major technology team like Meituan suggests a growing demand for internal standards that can accurately measure the progress of high-level AI development.
Analyzing the Performance Gap: Gemini 3 Pro and the 60% Threshold
The results of the initial testing phase provide a sobering look at the current state of AI reasoning. Gemini 3 Pro, which is currently described as the strongest model available, achieved a score of 62.8%. While this places it at the top of the 26 models tested, the margin is surprisingly slim when compared to the benchmark's "passing line" of 60%. This data point suggests that even the most advanced systems are only just beginning to master the types of reasoning tasks presented in General 365.
Furthermore, the revelation that the vast majority of the 26 mainstream models failed to reach the 60% mark indicates a widespread struggle with complex reasoning across the industry. This "failure" to pass the 60% threshold by most models highlights that General 365 is not a standard test; it is a high-bar evaluation that exposes the limitations of current large language models (LLMs). The disparity between general performance and reasoning-specific performance suggests that while models are becoming more conversational, their underlying logic and problem-solving frameworks require significant refinement.
The Significance of the 26-Model Comparison
By testing 26 different models, the Meituan LongCat team has provided a comprehensive cross-section of the AI landscape. This broad scope ensures that the results are not an anomaly but a reflection of the current technological ceiling. The fact that 26 models—representing the mainstream of the industry—were subjected to this test provides a robust dataset for understanding where the industry stands. The collective struggle to meet the 60% accuracy requirement serves as a call to action for AI researchers to prioritize reasoning architectures over simple parameter scaling.
Industry Impact
The introduction of General 365 is likely to influence the AI industry in several key ways. First, it sets a new, higher standard for what constitutes "passing" in terms of reasoning. By establishing a 60% threshold that most current models cannot meet, Meituan has created a clear target for future development. This will likely encourage AI labs to focus more on the quality of reasoning rather than just the quantity of data or the size of the model.
Second, the benchmark provides a transparent look at the performance of leading models like Gemini 3 Pro in a specialized context. This transparency is vital for enterprises and developers who need to know the true capabilities of the models they are integrating into their systems. As reasoning becomes a core requirement for AI applications in fields like engineering, law, and medicine, benchmarks like General 365 will become essential tools for vetting and selecting the right technology. Finally, Meituan's contribution to the open-source or public evaluation space reinforces the importance of independent, rigorous testing in an industry often characterized by rapid, unverified claims of "human-level" performance.
Frequently Asked Questions
Question: What is the primary purpose of the General 365 benchmark?
General 365 was developed by the Meituan LongCat team to serve as a new standard for evaluating the reasoning capabilities of large language models. It aims to provide a more rigorous and accurate measure of a model's ability to perform complex logical tasks compared to traditional benchmarks.
Question: How did the top-performing models fare on General 365?
According to the results released by Meituan, Gemini 3 Pro was the highest-performing model among the 26 tested, achieving an accuracy rate of 62.8%. However, the majority of the other mainstream models failed to reach the 60% passing mark, indicating the high difficulty of the benchmark.
Question: Why is the 60% accuracy mark significant in this context?
The 60% mark is considered the "passing line" for the General 365 benchmark. The fact that most mainstream models failed to reach this score suggests that current AI technology still has significant room for improvement in the area of complex reasoning and logical problem-solving.


