Meituan LongCat General 365: New AI Reasoning Benchmark

The Meituan LongCat team has officially released General 365, a sophisticated evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently regarded as one of the most capable models, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% threshold, which is considered a basic passing grade. This release by Meituan sets a new, more challenging standard for AI evaluation, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems today.

Key Takeaways

New Benchmark Release: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on evaluating the reasoning performance of AI models.
Industry-Wide Testing: The benchmark was used to evaluate 26 mainstream models to provide a comprehensive overview of the current state of AI reasoning.
Gemini 3 Pro Performance: Even the top-performing model in the test, Gemini 3 Pro, only reached an accuracy of 62.8%.
Low Success Rates: Most models evaluated failed to achieve a 60% accuracy score, indicating that current AI reasoning capabilities are still in their early stages relative to this new standard.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is designed to address the growing need for more rigorous testing of reasoning capabilities in large language models. As AI development shifts from simple conversational tasks to complex problem-solving, the industry requires benchmarks that can accurately differentiate between surface-level pattern matching and deep logical reasoning. General 365 appears to be positioned as a "high bar" for the industry, focusing on areas where current models still struggle significantly.

Analyzing the Performance Gap

The results released alongside the benchmark provide a sobering look at the current state of artificial intelligence. By testing 26 mainstream models, the LongCat team has established a broad baseline for performance. The fact that Gemini 3 Pro—a model recognized for its advanced capabilities—only managed a score of 62.8% suggests that General 365 contains tasks that are significantly more difficult than those found in traditional benchmarks.

Furthermore, the observation that the majority of models could not reach the 60% "passing line" highlights a critical bottleneck in AI development. This failure rate suggests that while models are becoming better at generating fluent text, their underlying logical frameworks are not yet robust enough to handle the specific reasoning challenges posed by General 365. This data indicates that the industry may have been overestimating the reasoning maturity of current LLMs based on older, less demanding benchmarks.

Setting a New Standard for Reasoning

By establishing a benchmark where even the "strongest" models are barely passing, Meituan is effectively recalibrating the expectations for AI performance. General 365 serves as a diagnostic tool that identifies the limits of current technology. The 60% threshold mentioned by the LongCat team acts as a symbolic barrier, separating models that possess basic reasoning competency from those that do not. This rigorous approach is essential for guiding future research and development, as it provides a clear target for engineers looking to improve the logical consistency and problem-solving depth of their models.

Industry Impact

The release of General 365 is likely to have a profound impact on how AI models are marketed and developed. For years, the industry has relied on benchmarks where top models frequently score in the 80th or 90th percentiles, leading to a perception that reasoning is a "solved" problem. General 365 shatters this illusion by showing that when the difficulty is increased, performance drops precipitously. This will likely push AI labs to focus more on the quality of reasoning rather than just the scale of the models.

Additionally, Meituan's involvement underscores the importance of real-world application providers in the AI ecosystem. As a company that relies on AI for complex logistics and consumer services, Meituan has a vested interest in ensuring that the models they use are truly capable of logical deduction. General 365 provides a transparent metric that can be used by both developers and enterprise users to assess the true utility of an AI model in high-stakes reasoning scenarios.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a new evaluation benchmark released by the Meituan LongCat team. It is specifically designed to test and measure the reasoning capabilities of mainstream large language models, providing a more rigorous standard than many existing evaluations.

Question: How did the top models perform on General 365?

According to the initial results, Gemini 3 Pro was the top performer with an accuracy rate of 62.8%. However, the vast majority of the 26 mainstream models tested failed to reach a 60% accuracy score, which is considered the passing threshold for the benchmark.

Question: Why is General 365 significant for the AI industry?

It is significant because it reveals a major gap in the reasoning abilities of current AI models. By setting a high difficulty level where most models fail to pass, it provides a more accurate and challenging metric for the next generation of AI development, moving beyond simpler benchmarks where models already achieve high scores.

Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning