Meituan General 365: New AI Reasoning Benchmark Results

The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. In an initial assessment of 26 mainstream models, the results reveal a significant performance gap in the industry. Google's Gemini 3 Pro, currently regarded as the strongest performer, achieved an accuracy rate of only 62.8%. Notably, the vast majority of the models tested failed to reach the 60% passing threshold, highlighting the intense difficulty of the General 365 evaluation. This release by Meituan sets a new standard for measuring high-level cognitive tasks in AI, suggesting that current large language models still face substantial hurdles in complex reasoning scenarios.

Key Takeaways

New Evaluation Standard: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on reasoning capabilities.
Industry Performance Gap: Out of 26 mainstream models tested, the majority failed to achieve a score of 60%.
Top Performer: Gemini 3 Pro currently leads the benchmark but only managed an accuracy rate of 62.8%.
Rigorous Testing: The benchmark is designed to be a "new yardstick," indicating a higher level of difficulty than previous evaluation methods.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has officially released General 365, positioning it as a critical new benchmark for the AI industry. The primary objective of this tool is to provide a "new yardstick" for reasoning evaluation, moving beyond simple task completion to test the underlying logic and cognitive depth of large language models. The introduction of General 365 comes at a time when the industry is seeking more nuanced ways to differentiate between models that can perform basic functions and those that truly possess advanced reasoning skills.

According to the data provided by the LongCat team, the benchmark is intentionally designed to be challenging. By focusing on reasoning, Meituan is targeting one of the most difficult frontiers in AI development. The name "General 365" suggests a comprehensive, perhaps year-round or all-encompassing approach to testing, though the core focus remains strictly on the accuracy of reasoning outputs across a wide variety of scenarios.

Comparative Performance of Mainstream Models

The initial testing phase of General 365 involved 26 of the most prominent AI models currently available in the market. The results of these tests serve as a sobering reality check for the state of AI reasoning. Even the most advanced models struggled to maintain high accuracy levels when subjected to the General 365 criteria.

Gemini 3 Pro, which is identified as the current industry leader in terms of raw performance, reached an accuracy of 62.8%. While this score places it at the top of the list among the 26 models tested, it also highlights how much room for improvement remains. Perhaps more significant is the finding that the "passing line" of 60% was out of reach for the vast majority of models. This failure to meet a basic 60% threshold suggests that many current AI architectures, while proficient in language generation, still lack the robust reasoning frameworks required to navigate the complexities presented by the General 365 benchmark.

Industry Impact

The release of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By establishing a benchmark where even the strongest models barely exceed a 60% accuracy rate, Meituan is forcing a shift in the industry's focus. Developers may now be incentivized to prioritize reasoning and logical consistency over mere fluency or parameter count.

Furthermore, the fact that a major technology player like Meituan is contributing to the evaluation ecosystem suggests a move toward more transparent and standardized testing. As models continue to evolve, benchmarks like General 365 will be essential for identifying which systems are truly capable of handling complex, real-world problem-solving. This benchmark sets a high bar, serving as both a challenge to current AI leaders and a roadmap for future research and development in the field of artificial intelligence reasoning.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a new reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to serve as a rigorous standard for testing the logical and reasoning capabilities of mainstream AI models.

Question: Which model performed the best on the General 365 test?

According to the initial results, Gemini 3 Pro is the top-performing model on the General 365 benchmark, achieving an accuracy rate of 62.8%.

Question: How did most models perform on this new benchmark?

The majority of the 26 mainstream models tested failed to reach the 60% accuracy mark, which is considered the passing line for the General 365 evaluation.

Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation