Back to List
Meituan LongCat Team Launches General 365: A New Benchmark Revealing AI Reasoning Limitations
Industry NewsMeituanAI BenchmarkingReasoning

Meituan LongCat Team Launches General 365: A New Benchmark Revealing AI Reasoning Limitations

The Meituan LongCat team has officially released General 365, a new evaluation benchmark specifically designed to measure the reasoning capabilities of large language models. In an extensive test involving 26 mainstream models, the benchmark has highlighted a significant performance gap in the current AI landscape. According to the results, Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%. Strikingly, the vast majority of the tested models failed to reach the 60% threshold, which is typically considered a passing grade. This development suggests that while AI has made strides in general tasks, complex reasoning remains a formidable challenge for even the most advanced systems currently available on the market.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has introduced General 365, a rigorous new benchmark for evaluating AI reasoning.
  • A comprehensive test of 26 mainstream models shows that most current AI systems struggle with complex reasoning tasks.
  • Gemini 3 Pro recorded the highest accuracy at 62.8%, setting the current ceiling for the benchmark.
  • The majority of tested models failed to achieve a 60% accuracy score, indicating a widespread "reasoning gap" in the industry.

In-Depth Analysis

The Debut of General 365

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is positioned as a new yardstick for measuring the logical and reasoning depth of large language models (LLMs). As the AI industry moves beyond basic text generation and information retrieval, the ability to perform multi-step reasoning and maintain logical consistency has become the new frontier. General 365 aims to provide a standardized metric to quantify these high-level cognitive abilities, offering a clearer picture of how models perform when faced with complex problem-solving scenarios.

Benchmarking the Giants: Gemini 3 Pro and Others

To establish the baseline for General 365, the LongCat team conducted empirical tests on 26 of the most prominent mainstream models currently available. The results serve as a reality check for the state of artificial intelligence. Gemini 3 Pro, which is widely regarded as one of the most capable models in the world, achieved an accuracy rate of 62.8%. While this score placed it at the top of the leaderboard, the figure itself suggests that even the industry's leading models have significant room for improvement. The fact that the highest score sits just above 60% underscores the difficulty of the General 365 evaluation criteria.

The 60% Threshold: A Critical Performance Gap

One of the most significant findings from the Meituan LongCat report is the failure of most models to reach the 60% accuracy mark. In many academic and professional contexts, 60% is viewed as the minimum threshold for competency or a "passing grade." The discovery that the majority of mainstream models could not reach this level on General 365 highlights a critical deficiency in current AI development. It suggests that while models are becoming increasingly proficient at mimicking human language, their underlying reasoning engines are not yet robust enough to handle the complexities presented by this new benchmark. This gap between linguistic fluency and logical reasoning is a primary hurdle that the next generation of AI models will need to overcome.

Industry Impact

The release of General 365 by Meituan's LongCat team is expected to have a notable impact on how AI models are developed and marketed. By providing a benchmark where even top-tier models like Gemini 3 Pro struggle, Meituan is pushing the industry toward a more rigorous standard of accountability. This will likely encourage AI researchers to pivot their focus from increasing parameter counts to improving the qualitative aspects of machine reasoning. Furthermore, General 365 provides a transparent framework for enterprises to evaluate which models are truly capable of handling sophisticated logic-based tasks, potentially influencing future investments and adoption strategies across the tech sector.

Frequently Asked Questions

Question: What is the primary purpose of the General 365 benchmark?

General 365 was developed by the Meituan LongCat team to specifically evaluate and set a new standard for the reasoning capabilities of large language models, moving beyond general performance metrics.

Question: How did the top-performing model fare on General 365?

Gemini 3 Pro was the highest-scoring model among the 26 tested, achieving an accuracy rate of 62.8%. However, this was the only model to significantly exceed the 60% mark.

Question: What does the failure of most models to reach 60% accuracy signify?

It indicates that complex reasoning remains a major weakness for the majority of mainstream AI models. The results suggest that current AI technology still faces substantial challenges in performing logical tasks that meet a basic threshold of competency as defined by the General 365 benchmark.

Related News

Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Industry News

Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation framework designed to test the limits of interactive video world models. Positioned as the first systematic multi-round benchmark in its category, WBench functions as a diagnostic tool—likened to a "CT scanner"—to identify specific technical hurdles as AI transitions from passive video generation to active, interactive environmental simulation. By focusing on the boundaries between "passive viewing" and "active interaction," WBench provides a rigorous methodology for assessing how models maintain consistency across complex, multi-step scenarios. This open-source contribution aims to standardize the evaluation of world models, offering insights into their performance in diverse settings ranging from lunar landscapes to futuristic urban environments.

Meituan's Breakthroughs at ACL 2026: Redefining Generative Paradigms through Evaluation and Reasoning Optimization
Industry News

Meituan's Breakthroughs at ACL 2026: Redefining Generative Paradigms through Evaluation and Reasoning Optimization

Meituan's technical team has achieved a significant milestone at ACL 2026, the premier international conference for computational linguistics and natural language processing. With six papers accepted, Meituan's research spans critical frontiers including large model evaluation, complex process reasoning, competition-level mathematical thinking optimization, reinforcement learning, and generative recommendation systems. These contributions highlight a strategic shift toward building a new generation of AI paradigms that emphasize both the robustness of model assessment and the depth of logical reasoning. By addressing high-level challenges such as mathematical problem-solving and the evolution of recommendation engines, Meituan is bridging the gap between theoretical academic research and practical industrial application, setting a new standard for generative AI development.

Managing AI Coding with Agent Evaluation Logic: Lessons from a 310,000-Line AI Refactoring Project
Industry News

Managing AI Coding with Agent Evaluation Logic: Lessons from a 310,000-Line AI Refactoring Project

As AI-generated code accounts for over 90% of system development, the primary challenge has shifted from production speed to the effective constraint of AI capabilities. Without unified standards, AI risks exponentially increasing system chaos. This analysis explores the practice of the Meituan technical team in refactoring 310,000 lines of code by applying Agent evaluation logic to AI coding management. By implementing a structured framework consisting of technical debt sorting, rule construction, Refactoring Standard Operating Procedures (SOPs), and Pre-PR mechanisms, the team successfully transformed high-cost refactoring into a continuous, iterative daily process. This approach ensures that AI-driven development remains orderly and sustainable, preventing the accumulation of unmanaged technical debt while maintaining high code quality across large-scale systems.