
Meituan LongCat Open Sources General 365: A New Benchmark Revealing AI Reasoning Challenges
Meituan's LongCat team has officially released General 365, an open-source benchmark designed to evaluate the reasoning capabilities of modern AI models. Through a rigorous assessment of 26 mainstream models, the team discovered a significant performance gap in the industry. Gemini 3 Pro emerged as the top performer with an accuracy rate of 62.8%, yet it remains one of the few to surpass the 60% mark. The majority of the models tested failed to reach this basic competency level, highlighting the ongoing challenges in developing advanced reasoning within artificial intelligence. This benchmark serves as a critical new tool for the AI community to measure and improve logical processing, setting a high bar for future model development.
Key Takeaways
- New Benchmark Released: Meituan's LongCat team has open-sourced General 365, a specialized tool for evaluating AI reasoning.
- Industry-Wide Testing: The benchmark was used to test 26 mainstream AI models to assess their logical capabilities.
- Gemini 3 Pro Leads: Currently identified as the strongest model, Gemini 3 Pro achieved an accuracy rate of 62.8%.
- Performance Gap: The vast majority of tested models failed to reach a 60% accuracy threshold, indicating a widespread struggle with complex reasoning.
In-Depth Analysis
The Introduction of General 365
The Meituan LongCat team has officially introduced General 365 to the global AI community. As an open-source reasoning evaluation benchmark, General 365 aims to provide a more accurate and demanding standard for measuring how well large language models can handle complex logical tasks. By open-sourcing this tool, Meituan is providing a transparent framework that allows developers and researchers to test their models against a set of criteria that reflects real-world reasoning challenges.
Evaluation of Mainstream Models
In the initial rollout of General 365, the LongCat team conducted a comprehensive evaluation involving 26 of the most prominent AI models currently available in the market. The results of these tests offer a sobering look at the current state of artificial intelligence. Even the model recognized as the most powerful in this evaluation, Gemini 3 Pro, only managed to secure an accuracy rate of 62.8%. This score, while leading the pack, suggests that even the most advanced systems have significant room for improvement when it comes to deep reasoning.
The 60% Accuracy Threshold
One of the most striking findings from the LongCat team's report is the performance of the broader field of AI models. According to the data, the vast majority of the 26 models tested were unable to reach the 60% accuracy mark. In the context of this benchmark, the 60% level is viewed as a basic passing grade or a "passing line." The fact that most mainstream models failed to meet this standard highlights a critical bottleneck in AI development: while models are becoming increasingly proficient at language generation, their ability to consistently apply logic and reasoning remains underdeveloped.
Industry Impact
The release of General 365 and the subsequent performance data have significant implications for the AI industry. By establishing a benchmark where even the top-tier models struggle to exceed 60% accuracy, Meituan has set a new, more rigorous standard for what constitutes "strong" reasoning. This will likely shift the industry's focus toward improving the underlying logical architectures of models rather than simply increasing parameter counts or conversational fluency. Furthermore, as an open-source project, General 365 provides a standardized metric that can foster more honest and transparent competition among AI developers worldwide.
Frequently Asked Questions
Question: What is the primary purpose of Meituan's General 365?
General 365 is an open-source benchmark created by the Meituan LongCat team specifically to evaluate and set a new standard for the reasoning capabilities of AI models.
Question: Which model performed the best on the General 365 benchmark?
Gemini 3 Pro performed the best among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.
Question: How did most AI models fare in the reasoning tests?
Most of the 26 mainstream models tested failed to reach the 60% accuracy threshold, which is considered the passing line for the benchmark.

