
Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation
Meituan's LongCat team has officially open-sourced General 365, a new evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the top-performing model, Gemini 3 Pro, achieved an accuracy of only 62.8%, while the vast majority of tested models failed to reach the 60% passing mark. This release aims to establish a more rigorous standard for the industry, highlighting the current limitations of even the most advanced AI systems in complex reasoning tasks. By providing a transparent and difficult metric, Meituan seeks to drive the development of more logically capable artificial intelligence.
Key Takeaways
- New Benchmark Released: Meituan's LongCat team has open-sourced General 365, a specialized benchmark for reasoning evaluation.
- Low Overall Performance: Out of 26 mainstream models tested, most failed to reach a 60% accuracy threshold.
- Top Performer: Gemini 3 Pro currently leads the benchmark but only achieved a score of 62.8%.
- Industry Standard: General 365 is positioned as a new, more difficult "yardstick" for measuring logical reasoning in AI.
In-Depth Analysis
The Launch of General 365
The Meituan LongCat team has introduced General 365 with the specific intent of setting a new benchmark for reasoning evaluation. In the current landscape of artificial intelligence, many models perform exceptionally well on standard benchmarks that focus on knowledge retrieval or basic linguistic tasks. However, General 365 is designed to probe deeper into the logical and reasoning structures of these models. By open-sourcing this tool, Meituan provides the global developer community with a rigorous framework to test the limits of large language models (LLMs).
Performance Gap in Mainstream Models
The initial results released alongside the benchmark provide a sobering look at the current state of AI reasoning. Meituan conducted empirical tests on 26 of the most prominent models in the industry. The findings indicate that reasoning remains a significant hurdle for AI development. Gemini 3 Pro, recognized as one of the most powerful models currently available, secured the top spot but only managed an accuracy rate of 62.8%.
More strikingly, the vast majority of the 26 models tested did not even reach the 60% mark, which is often considered a basic "passing" grade in academic and professional evaluations. This suggests that while AI has made strides in many areas, complex reasoning—the ability to follow logical chains and solve intricate problems—is still an area where even the most advanced systems struggle. The data from General 365 highlights that there is still a long way to go before AI can consistently master high-level reasoning tasks.
Industry Impact
The introduction of General 365 is significant for the AI industry as it shifts the focus from general performance to specialized reasoning capabilities. By establishing a benchmark where even the "strongest" models score relatively low, Meituan is challenging the industry to move beyond superficial improvements.
This benchmark serves as a reality check for AI researchers and developers. It provides a clear, quantifiable metric that exposes the weaknesses in current logical processing. As more organizations adopt General 365 for internal testing, it is likely to influence the direction of model training, pushing developers to prioritize reasoning depth and logical consistency. Furthermore, as an open-source project, it encourages transparency and collaborative improvement across the AI ecosystem, setting a high bar for what constitutes a "capable" reasoning model.
Frequently Asked Questions
Question: What is the primary purpose of Meituan's General 365?
General 365 is an open-source benchmark created by the Meituan LongCat team specifically to evaluate and set a new standard for the reasoning capabilities of large language models.
Question: How did the top AI models perform on this new benchmark?
Performance was generally low across the board. Out of 26 mainstream models, Gemini 3 Pro performed the best with a 62.8% accuracy rate, while most other models failed to reach the 60% threshold.
Question: Why is General 365 considered a "new yardstick" for the industry?
It is considered a new yardstick because it focuses on complex reasoning tasks that current models find difficult, providing a more rigorous and challenging evaluation than many existing benchmarks.

