
Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation
Meituan's LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. The initial testing phase involved 26 mainstream models, revealing a significant performance gap in the industry. According to the results, the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered a basic passing mark. This release by Meituan aims to provide a more challenging and accurate metric for assessing how well modern AI can handle complex reasoning tasks, highlighting that even the most advanced systems currently struggle with the demands of the General 365 evaluation.
Key Takeaways
- New Benchmark Release: Meituan's LongCat team has introduced General 365, a specialized evaluation tool for AI reasoning.
- Industry Performance Gap: Out of 26 mainstream models tested, most failed to reach a 60% accuracy rate.
- Top Performer Results: Gemini 3 Pro leads the current rankings but only managed a score of 62.8%.
- A New Standard: General 365 is positioned as a "new ruler" or benchmark for measuring the true reasoning depth of large language models.
In-Depth Analysis
The Challenge of General 365
The release of General 365 by the Meituan LongCat team marks a pivotal moment in the evolution of AI benchmarking. By testing 26 of the most prominent models currently available, the team has provided a comprehensive snapshot of the industry's reasoning capabilities. The core finding—that the majority of these models cannot achieve a 60% accuracy rate—suggests that General 365 is designed to be significantly more rigorous than existing benchmarks. This "passing grade" of 60% serves as a critical indicator, suggesting that current AI development may be hitting a plateau when it comes to complex, multi-step reasoning tasks that go beyond simple pattern matching or data retrieval.
Benchmarking the Best: Gemini 3 Pro's Performance
One of the most notable aspects of the General 365 release is the performance of Gemini 3 Pro. Despite being recognized as one of the most powerful models globally, it achieved an accuracy of 62.8%. While this score places it at the top of the 26 models tested, the narrow margin by which it cleared the 60% threshold is telling. It highlights that even the industry leaders have substantial room for improvement. The fact that the "strongest" model is only slightly above what Meituan considers a basic level of competency on this benchmark underscores the difficulty of the reasoning tasks included in General 365. This data point provides a realistic perspective on the current state of artificial intelligence, tempering expectations with hard data regarding reasoning proficiency.
Redefining Evaluation Metrics
Meituan's decision to open-source or release General 365 (referred to as "Open General 365") indicates a move toward standardized, transparent evaluation. By establishing a "new ruler" (标尺), the LongCat team is challenging the AI community to look beyond high scores on older, perhaps saturated, benchmarks. The focus here is clearly on "General" reasoning, implying a broad applicability across different domains. The results suggest that as models become larger and more complex, their ability to reason effectively does not necessarily scale at the same rate, necessitating new tools like General 365 to identify these specific weaknesses.
Industry Impact
The introduction of General 365 is likely to have a profound impact on how AI models are developed and marketed. For the AI industry, this benchmark serves as a wake-up call, demonstrating that current "state-of-the-art" models still struggle with fundamental reasoning when held to a higher standard. It shifts the focus from general performance to specific reasoning accuracy. Furthermore, by setting a benchmark where most models currently fail, Meituan has created a new target for developers. This will likely drive a new wave of research focused specifically on closing the reasoning gap, as companies strive to move their models past the 60% mark and eventually challenge the 62.8% benchmark set by Gemini 3 Pro.
Frequently Asked Questions
Question: What is Meituan's General 365?
General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the reasoning capabilities of mainstream AI models and currently serves as a rigorous new standard in the industry.
Question: How did mainstream AI models perform on the General 365 benchmark?
In a test of 26 mainstream models, most failed to reach a 60% accuracy rate. The highest-scoring model, Gemini 3 Pro, achieved an accuracy of 62.8%, indicating that the benchmark is highly challenging for current AI technology.
Question: Why is the 60% accuracy mark significant in this report?
The report notes that most models failed to reach the 60% mark, which is often viewed as a basic "passing" threshold. This highlights a significant gap in the reasoning abilities of current large language models when faced with the General 365 evaluation criteria.


