Back to List
Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation

Meituan's LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive test of 26 mainstream models, the results revealed a significant performance gap in the industry. Even the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is considered the passing mark for this evaluation. This release sets a challenging new standard for AI development, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems currently available.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has introduced General 365, a new standard for evaluating AI reasoning.
  • Evaluation of 26 mainstream models shows that most current AI systems struggle with complex reasoning tasks.
  • Gemini 3 Pro emerged as the top performer but only achieved a 62.8% accuracy rate.
  • The majority of tested models failed to reach the 60% 'passing' threshold, indicating a significant industry-wide challenge.

In-Depth Analysis

Establishing a New Benchmark for Reasoning

The LongCat team at Meituan has officially released General 365, a benchmark specifically engineered to measure the reasoning depth of modern artificial intelligence. In the rapidly evolving landscape of large language models (LLMs), traditional benchmarks often fail to capture the nuances of logical deduction and complex problem-solving. General 365 aims to fill this gap by providing a more stringent and accurate scale for assessing how models handle intricate reasoning scenarios. By focusing on these high-level cognitive tasks, Meituan is positioning General 365 as a critical tool for developers to identify the true limits of their models beyond simple pattern recognition or data retrieval.

The Performance Gap in Mainstream Models

The initial testing phase of General 365 involved 26 of the most prominent AI models in the industry today. The results were telling, revealing that high-level reasoning remains an elusive goal for most AI architectures. Gemini 3 Pro, which is currently recognized as one of the most powerful models globally, led the evaluation but only managed to secure a 62.8% accuracy rate. This score, while the highest among the group, suggests that even the industry's 'state-of-the-art' models have significant room for improvement when faced with the specific challenges posed by the General 365 benchmark.

Perhaps more striking is the fact that the vast majority of the 26 models tested were unable to reach the 60% mark. In the context of this benchmark, 60% is treated as the baseline for a 'passing' grade. The failure of most mainstream models to meet this basic requirement underscores a widespread deficiency in reasoning capabilities across the current AI ecosystem. This data suggests that while models are becoming more conversational and versatile, their ability to perform consistent, logical reasoning is not yet at a mature level.

Industry Impact

The release of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By setting a benchmark where even the strongest models barely pass, Meituan is forcing a shift in focus from quantity of data to the quality of reasoning. This benchmark serves as a reality check for the industry, providing a clear metric that distinguishes between models that can simulate intelligence and those that can truly reason through problems. As developers strive to improve their scores on General 365, we can expect to see a new wave of research focused on enhancing the logical frameworks and cognitive processing abilities of future AI systems.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is designed to be a new benchmark for evaluating the reasoning capabilities of AI models, providing a more difficult and accurate measure of logical performance than previous standards.

Question: How did the top AI models perform on the General 365 test?

Out of 26 mainstream models tested, Gemini 3 Pro performed the best with a 62.8% accuracy rate. However, most other models failed to reach the 60% passing threshold.

Question: Why is the 60% score significant in this benchmark?

The 60% mark is considered the 'passing line' for General 365. The fact that most models failed to reach it highlights that current AI technology still faces major hurdles in mastering complex reasoning tasks.

Related News

Managing AI Coding with Agent Evaluation Logic: Lessons from a 310,000-Line Code Refactoring Project
Industry News

Managing AI Coding with Agent Evaluation Logic: Lessons from a 310,000-Line Code Refactoring Project

Meituan's technical team has introduced a novel approach to managing AI-driven development by applying Agent evaluation logic to a massive 310,000-line code refactoring initiative. With AI now capable of generating over 90% of code, the primary challenge has shifted from production speed to the management of system complexity and chaos. By implementing a structured framework—including technical debt sorting, rule construction, a standardized refactoring SOP, and a Pre-PR mechanism—the team has successfully transitioned refactoring from a high-cost, periodic task into a continuous, iterative daily action. This methodology ensures that AI's capabilities are constrained by unified standards, preventing the amplification of technical debt and ensuring long-term system stability in an AI-native development environment.

openpilot: The Robotics Operating System Revolutionizing Driver Assistance for 300+ Vehicle Models
Industry News

openpilot: The Robotics Operating System Revolutionizing Driver Assistance for 300+ Vehicle Models

openpilot, developed by commaai, has positioned itself as a pivotal operating system specifically designed for the robotics sector. Its current primary application is the enhancement and upgrading of driver assistance systems across a vast range of automotive hardware. With compatibility extending to over 300 supported car models, openpilot demonstrates a unique approach to scalable automation. By functioning as a foundational operating system rather than a standalone application, it provides the necessary infrastructure to bridge complex robotic software with diverse vehicle hardware. This development signifies a major step in the democratization of advanced driving technologies, offering a standardized platform for robotic control that can be integrated into a wide variety of existing consumer vehicles, thereby extending their functional capabilities through software-driven innovation.

Asia’s Most Active AI Investors: A Comprehensive Analysis of Regional Capital Inflow
Industry News

Asia’s Most Active AI Investors: A Comprehensive Analysis of Regional Capital Inflow

Tech in Asia has released a significant report identifying the most active investors currently directing capital toward the artificial intelligence sector within Asia. The report highlights a major trend where substantial financial resources are being poured into AI startups across the continent. This compilation serves as a critical guide for understanding which entities are driving the growth of the Asian AI ecosystem. By focusing on the most active participants, the list provides a clear picture of the investment landscape, emphasizing the high level of interest and financial commitment from the investment community toward Asian AI innovation. This influx of capital is a defining characteristic of the current technological and financial environment in the region.