
Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade
The Meituan LongCat team has officially open-sourced "General 365," a new evaluation benchmark designed to measure the reasoning capabilities of AI models. In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the industry-leading Gemini 3 Pro achieved an accuracy rate of only 62.8%, while the vast majority of tested models failed to reach the 60% threshold. This release aims to establish a more rigorous standard for evaluating complex reasoning tasks in the AI industry, highlighting the ongoing challenges in developing truly capable reasoning engines. By open-sourcing this tool, Meituan provides a new yardstick for the global AI community to assess and improve logical depth in large language models.
Key Takeaways
- Meituan LongCat Open-Sources General 365: A new benchmark specifically designed to evaluate the reasoning capabilities of AI models.
- Widespread Performance Gap: Out of 26 mainstream models tested, the majority failed to reach a 60% accuracy rate, which is considered a basic passing grade.
- Gemini 3 Pro Leads the Field: Currently the top performer on this benchmark, Gemini 3 Pro achieved an accuracy of 62.8%.
- New Industry Standard: General 365 sets a high bar for reasoning, suggesting that current AI models still struggle with complex logical tasks.
In-Depth Analysis
The Launch of General 365 by Meituan LongCat
The Meituan LongCat team has officially introduced General 365, an open-source benchmark that aims to redefine how the industry evaluates the reasoning performance of Large Language Models (LLMs). In the current AI landscape, where many models excel at creative writing or basic information retrieval, the ability to perform consistent and complex logical reasoning remains a significant differentiator. General 365 was developed to address this specific need, providing a standardized and rigorous framework for testing. By making this benchmark open-source, Meituan is inviting the global research community to subject their models to a more demanding set of criteria, potentially exposing the limitations of systems that otherwise perform well on less specialized tests.
Analyzing the Reasoning Performance Gap
The initial evaluation conducted by the LongCat team involved 26 of the most prominent AI models currently available. The results of these tests are telling: the vast majority of these models were unable to reach a 60% accuracy level. In many academic and professional settings, 60% is viewed as the minimum threshold for competency, or a "passing grade." The fact that most mainstream models fell below this line indicates that the tasks within General 365 are specifically designed to challenge the logical foundations of these systems. This widespread failure suggests that while AI has become highly proficient at pattern recognition and linguistic fluency, the transition to robust, reliable reasoning is still in its early stages. The benchmark serves as a reality check for the industry, highlighting that "intelligence" in AI is not yet synonymous with "reasoning."
Gemini 3 Pro and the Current State of the Art
Among the 26 models evaluated, Gemini 3 Pro emerged as the leader, yet its performance further illustrates the difficulty of the General 365 benchmark. With an accuracy rate of 62.8%, Gemini 3 Pro is currently the only model cited as being at the top of the leaderboard, but even this score leaves significant room for improvement. A score of 62.8% implies that nearly four out of every ten reasoning tasks were handled incorrectly or incompletely. This result establishes a new "ceiling" for the current generation of AI, showing that even the most advanced models from leading tech giants are only just beginning to cross the threshold of basic reasoning competency. The data provided by Meituan suggests that the path to achieving human-level reasoning in AI will require more than just incremental updates; it may require fundamental shifts in how these models process logical structures.
Industry Impact
The introduction of General 365 is poised to have a significant impact on the AI industry by shifting the focus from general performance to specialized reasoning capabilities. As developers and researchers strive to climb the General 365 leaderboard, we can expect a renewed emphasis on architectural innovations that prioritize logic and multi-step problem-solving. Furthermore, Meituan's decision to open-source the benchmark ensures that it can become a transparent and evolving standard, preventing "benchmark saturation" where models are trained specifically to pass certain tests without gaining actual underlying capability. For the broader industry, these results serve as a call to action to address the "reasoning deficit" that currently exists even in the most sophisticated AI systems.
Frequently Asked Questions
What is the General 365 benchmark?
General 365 is an open-source evaluation tool released by the Meituan LongCat team. It is specifically designed to measure and benchmark the reasoning capabilities of AI models, providing a more rigorous standard than many existing general-purpose tests.
How did the top AI models perform on this test?
In a test of 26 mainstream models, the performance was generally low. Gemini 3 Pro was the top performer with a 62.8% accuracy rate. However, the majority of the other models tested failed to reach the 60% accuracy mark, which is considered the passing threshold for the benchmark.
Why did Meituan LongCat open-source this tool?
By open-sourcing General 365, Meituan aims to provide the AI community with a transparent and standardized way to evaluate reasoning. This encourages the development of models that are not just good at generating text, but are also capable of complex, logical thought processes.


