Back to List
Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation

Meituan's LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive test of 26 mainstream models, the results revealed a significant performance gap in the industry. Even the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is considered the passing mark for this evaluation. This release sets a challenging new standard for AI development, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems currently available.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has introduced General 365, a new standard for evaluating AI reasoning.
  • Evaluation of 26 mainstream models shows that most current AI systems struggle with complex reasoning tasks.
  • Gemini 3 Pro emerged as the top performer but only achieved a 62.8% accuracy rate.
  • The majority of tested models failed to reach the 60% 'passing' threshold, indicating a significant industry-wide challenge.

In-Depth Analysis

Establishing a New Benchmark for Reasoning

The LongCat team at Meituan has officially released General 365, a benchmark specifically engineered to measure the reasoning depth of modern artificial intelligence. In the rapidly evolving landscape of large language models (LLMs), traditional benchmarks often fail to capture the nuances of logical deduction and complex problem-solving. General 365 aims to fill this gap by providing a more stringent and accurate scale for assessing how models handle intricate reasoning scenarios. By focusing on these high-level cognitive tasks, Meituan is positioning General 365 as a critical tool for developers to identify the true limits of their models beyond simple pattern recognition or data retrieval.

The Performance Gap in Mainstream Models

The initial testing phase of General 365 involved 26 of the most prominent AI models in the industry today. The results were telling, revealing that high-level reasoning remains an elusive goal for most AI architectures. Gemini 3 Pro, which is currently recognized as one of the most powerful models globally, led the evaluation but only managed to secure a 62.8% accuracy rate. This score, while the highest among the group, suggests that even the industry's 'state-of-the-art' models have significant room for improvement when faced with the specific challenges posed by the General 365 benchmark.

Perhaps more striking is the fact that the vast majority of the 26 models tested were unable to reach the 60% mark. In the context of this benchmark, 60% is treated as the baseline for a 'passing' grade. The failure of most mainstream models to meet this basic requirement underscores a widespread deficiency in reasoning capabilities across the current AI ecosystem. This data suggests that while models are becoming more conversational and versatile, their ability to perform consistent, logical reasoning is not yet at a mature level.

Industry Impact

The release of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By setting a benchmark where even the strongest models barely pass, Meituan is forcing a shift in focus from quantity of data to the quality of reasoning. This benchmark serves as a reality check for the industry, providing a clear metric that distinguishes between models that can simulate intelligence and those that can truly reason through problems. As developers strive to improve their scores on General 365, we can expect to see a new wave of research focused on enhancing the logical frameworks and cognitive processing abilities of future AI systems.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is designed to be a new benchmark for evaluating the reasoning capabilities of AI models, providing a more difficult and accurate measure of logical performance than previous standards.

Question: How did the top AI models perform on the General 365 test?

Out of 26 mainstream models tested, Gemini 3 Pro performed the best with a 62.8% accuracy rate. However, most other models failed to reach the 60% passing threshold.

Question: Why is the 60% score significant in this benchmark?

The 60% mark is considered the 'passing line' for General 365. The fact that most models failed to reach it highlights that current AI technology still faces major hurdles in mastering complex reasoning tasks.

Related News

Meituan Showcases AI Innovations at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations
Industry News

Meituan Showcases AI Innovations at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations

The Meituan technical team has achieved significant recognition at the ACL 2026 conference, with six papers accepted into this premier international forum for computational linguistics and natural language processing. These research contributions span critical frontiers in the AI landscape, including large language model (LLM) capability evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Additionally, the papers explore advancements in reinforcement learning and the evolution of generative recommendation systems. By addressing these diverse technical directions, Meituan is actively shaping a new paradigm for generative AI, focusing on bridging the gap between theoretical research and practical industrial applications. This selection of papers highlights Meituan's commitment to enhancing model intelligence and reasoning capabilities to solve sophisticated real-world problems.

Managing AI-Driven Development: Meituan’s Strategy for Refactoring 310,000 Lines of Code Using Agent Evaluation Logic
Industry News

Managing AI-Driven Development: Meituan’s Strategy for Refactoring 310,000 Lines of Code Using Agent Evaluation Logic

Meituan's technical team has shared a comprehensive analysis of their experience refactoring 310,000 lines of code in an environment where over 90% of code is AI-generated. The core insight is that while AI significantly accelerates code production, it can also amplify technical debt and systemic chaos without proper constraints. To mitigate this, the team adopted an 'Agent evaluation' mindset to manage AI coding. By implementing a framework consisting of technical debt sorting, rule construction, standardized operating procedures (SOPs), and a Pre-PR (Pull Request) mechanism, they successfully transformed large-scale refactoring from a high-cost, specialized effort into a continuous, daily iterative process. This approach ensures that AI remains a productive tool rather than a source of unmanaged complexity.

Meituan BI Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency and Performance
Industry News

Meituan BI Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency and Performance

Meituan's data platform team has introduced a next-generation Business Intelligence (BI) architecture centered on a unified metric platform. This strategic shift addresses critical challenges inherent in traditional BI models, specifically the data definition discrepancies and poor query performance resulting from fragmented, personalized datasets. By integrating "automatic semantics" and "enhanced computing," Meituan has developed a system that streamlines data interpretation and accelerates processing. This evolution represents a significant step in ensuring data accuracy and operational efficiency within large-scale data environments, providing a robust framework for metric-driven decision-making and solving the long-standing issue of inconsistent data definitions across the organization.