Meituan LongCat General 365: New AI Reasoning Benchmark Results

Meituan's LongCat team has officially launched General 365, a new evaluation benchmark designed to set a higher standard for measuring AI reasoning. In a comprehensive test involving 26 mainstream models, the benchmark revealed a significant performance gap in the current AI landscape. Even the industry-leading Gemini 3 Pro achieved only a 62.8% accuracy rate, while the vast majority of tested models failed to reach the 60% threshold. This release by Meituan's technical team highlights the ongoing challenges large language models face in achieving high-level reasoning accuracy and provides a new diagnostic tool for the industry to measure progress beyond simple linguistic fluency.

Key Takeaways

Meituan's LongCat team has officially released the General 365 benchmark to evaluate AI reasoning capabilities.
In a rigorous test of 26 mainstream models, Gemini 3 Pro emerged as the top performer with an accuracy rate of 62.8%.
The majority of models tested failed to reach the 60% accuracy mark, which is considered the benchmark's passing threshold.
General 365 establishes a new, more difficult standard for measuring the logical and reasoning progress of large language models.

In-Depth Analysis

The Introduction of General 365 by Meituan LongCat

The Meituan LongCat team has introduced General 365, a benchmark specifically designed to push the boundaries of how AI reasoning is measured. In an era where many models claim high performance on standard tests, General 365 arrives as a specialized evaluation framework intended to provide a more granular and challenging assessment of a model's logical processing. By focusing on reasoning, Meituan is addressing a critical gap in the current AI landscape: the difference between linguistic fluency and actual cognitive accuracy.

The release of this benchmark by a major technical team like Meituan signifies a shift toward more rigorous, industry-led evaluation standards. As AI moves from general-purpose assistants to specialized tools requiring high reliability, the "yardstick" used to measure them must become more demanding. General 365 is positioned as that new standard, challenging the current generation of models to prove their worth in complex scenarios. The benchmark serves as a diagnostic tool that identifies where current architectures are succeeding and, more importantly, where they are failing to meet basic logical requirements.

Analyzing the Performance Gap: The 60% Barrier

The initial results released alongside General 365 provide a sobering look at the state of modern artificial intelligence. Out of 26 mainstream models evaluated, the performance was notably lower than what is often seen in general marketing materials. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of 62.8%. While this placed it at the top of the leaderboard, the margin of success is slim, highlighting that even the "best" models are only slightly above a basic level of proficiency on this specific benchmark.

Perhaps more telling is the performance of the rest of the field. The Meituan LongCat team reported that the vast majority of the 26 models failed to reach the 60% "passing" mark. This suggests that for most current large language models, the reasoning tasks presented in General 365 represent a significant difficulty spike. The fact that so many models "failed" to hit the 60% threshold indicates that current AI development may be hitting a plateau in reasoning, or that General 365 has successfully identified a specific type of logic that current architectures struggle to master. This data point serves as a critical reality check for the industry, emphasizing that there is still a long way to go before AI can consistently handle complex reasoning tasks with high reliability.

Industry Impact

The launch of General 365 is likely to have a profound impact on how AI research is prioritized and evaluated. By exposing the limitations of even the most advanced models like Gemini 3 Pro, Meituan is forcing the industry to look beyond simple parameter scaling and toward architectural improvements that enhance logical reasoning. This benchmark provides a transparent and objective metric for developers and enterprises alike.

As companies look to integrate AI into critical business processes, benchmarks like General 365 offer a realistic expectation of performance. The realization that most models cannot yet reliably pass a 60% accuracy threshold in complex reasoning will likely lead to a more cautious and focused approach to AI deployment in sectors where precision is paramount. Furthermore, Meituan's contribution to the open-source and research community with this benchmark encourages a more competitive and transparent environment for model development, where the focus shifts from "chatting" to "thinking."

Frequently Asked Questions

Question: What is the General 365 benchmark released by Meituan?

General 365 is a new evaluation benchmark developed by Meituan's LongCat team specifically designed to test the reasoning capabilities of large language models. It aims to set a higher and more rigorous standard for AI performance measurement than existing benchmarks.

Question: Which model performed the best on the General 365 benchmark?

According to the initial results released by the Meituan technical team, Gemini 3 Pro was the top-performing model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Question: How did the majority of AI models fare in the General 365 evaluation?

The majority of the 26 mainstream models tested failed to reach an accuracy rate of 60%. This 60% mark is described by the Meituan LongCat team as the "passing line" for the benchmark, indicating that most current models struggle with the reasoning tasks it presents.

Meituan LongCat Unveils General 365: A Rigorous New Benchmark for AI Reasoning Capabilities