Back to List
Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation

Meituan's LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. The initial testing phase involved 26 mainstream models, revealing a significant performance gap in the industry. According to the results, the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered a basic passing mark. This release by Meituan aims to provide a more challenging and accurate metric for assessing how well modern AI can handle complex reasoning tasks, highlighting that even the most advanced systems currently struggle with the demands of the General 365 evaluation.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has introduced General 365, a specialized evaluation tool for AI reasoning.
  • Industry Performance Gap: Out of 26 mainstream models tested, most failed to reach a 60% accuracy rate.
  • Top Performer Results: Gemini 3 Pro leads the current rankings but only managed a score of 62.8%.
  • A New Standard: General 365 is positioned as a "new ruler" or benchmark for measuring the true reasoning depth of large language models.

In-Depth Analysis

The Challenge of General 365

The release of General 365 by the Meituan LongCat team marks a pivotal moment in the evolution of AI benchmarking. By testing 26 of the most prominent models currently available, the team has provided a comprehensive snapshot of the industry's reasoning capabilities. The core finding—that the majority of these models cannot achieve a 60% accuracy rate—suggests that General 365 is designed to be significantly more rigorous than existing benchmarks. This "passing grade" of 60% serves as a critical indicator, suggesting that current AI development may be hitting a plateau when it comes to complex, multi-step reasoning tasks that go beyond simple pattern matching or data retrieval.

Benchmarking the Best: Gemini 3 Pro's Performance

One of the most notable aspects of the General 365 release is the performance of Gemini 3 Pro. Despite being recognized as one of the most powerful models globally, it achieved an accuracy of 62.8%. While this score places it at the top of the 26 models tested, the narrow margin by which it cleared the 60% threshold is telling. It highlights that even the industry leaders have substantial room for improvement. The fact that the "strongest" model is only slightly above what Meituan considers a basic level of competency on this benchmark underscores the difficulty of the reasoning tasks included in General 365. This data point provides a realistic perspective on the current state of artificial intelligence, tempering expectations with hard data regarding reasoning proficiency.

Redefining Evaluation Metrics

Meituan's decision to open-source or release General 365 (referred to as "Open General 365") indicates a move toward standardized, transparent evaluation. By establishing a "new ruler" (标尺), the LongCat team is challenging the AI community to look beyond high scores on older, perhaps saturated, benchmarks. The focus here is clearly on "General" reasoning, implying a broad applicability across different domains. The results suggest that as models become larger and more complex, their ability to reason effectively does not necessarily scale at the same rate, necessitating new tools like General 365 to identify these specific weaknesses.

Industry Impact

The introduction of General 365 is likely to have a profound impact on how AI models are developed and marketed. For the AI industry, this benchmark serves as a wake-up call, demonstrating that current "state-of-the-art" models still struggle with fundamental reasoning when held to a higher standard. It shifts the focus from general performance to specific reasoning accuracy. Furthermore, by setting a benchmark where most models currently fail, Meituan has created a new target for developers. This will likely drive a new wave of research focused specifically on closing the reasoning gap, as companies strive to move their models past the 60% mark and eventually challenge the 62.8% benchmark set by Gemini 3 Pro.

Frequently Asked Questions

Question: What is Meituan's General 365?

General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the reasoning capabilities of mainstream AI models and currently serves as a rigorous new standard in the industry.

Question: How did mainstream AI models perform on the General 365 benchmark?

In a test of 26 mainstream models, most failed to reach a 60% accuracy rate. The highest-scoring model, Gemini 3 Pro, achieved an accuracy of 62.8%, indicating that the benchmark is highly challenging for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report notes that most models failed to reach the 60% mark, which is often viewed as a basic "passing" threshold. This highlights a significant gap in the reasoning abilities of current large language models when faced with the General 365 evaluation criteria.

Related News

Managing AI Coding with Agent Evaluation Logic: Insights from a 310,000-Line Code Refactoring Practice
Industry News

Managing AI Coding with Agent Evaluation Logic: Insights from a 310,000-Line Code Refactoring Practice

As AI-generated code begins to comprise over 90% of modern systems, the technical challenge shifts from speed to governance. Meituan's technical team has shared a comprehensive framework for managing AI coding based on their experience refactoring 310,000 lines of code. The core of their approach involves using an 'Agent evaluation' mindset to prevent AI from amplifying system chaos. By implementing technical debt sorting, rule construction, standardized operating procedures (SOPs), and a Pre-PR mechanism, the team successfully transitioned large-scale refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This shift emphasizes that the ultimate trajectory of a system is determined by the constraints placed on AI rather than the speed of code generation.

LongCat Powers OpenClaw with Efficiency Engine: Boosting Automation Performance by 30% via Official API
Industry News

LongCat Powers OpenClaw with Efficiency Engine: Boosting Automation Performance by 30% via Official API

The LongCat team has officially introduced a stable and compliant free API for OpenClaw, aimed at significantly enhancing the efficiency of automated tasks. By providing a direct official channel, LongCat addresses the inherent risks associated with third-party subscriptions, such as account security vulnerabilities and service instability. This new efficiency engine allows developers to optimize their automation workflows, potentially increasing speed by 30%. The initiative by the Meituan Technical Team emphasizes the importance of using official, secure pathways to maintain the integrity of developer tools and ensure consistent service performance in complex automation environments.

Meituan Data Platform Revolutionizes BI Architecture with Metric-Centric Design and Enhanced Computing Capabilities
Industry News

Meituan Data Platform Revolutionizes BI Architecture with Metric-Centric Design and Enhanced Computing Capabilities

Meituan's technical team has unveiled a new generation of Business Intelligence (BI) architecture centered on a dedicated metric platform. By implementing two core capabilities—automatic semantics and enhanced computing—the platform addresses long-standing challenges in traditional BI systems. These challenges often include inconsistent data definitions (data mouthpieces) and degraded query performance resulting from fragmented, personalized datasets. This strategic shift aims to unify data logic and optimize computational efficiency, ensuring that business decisions are based on accurate, high-performance data analysis. The transition marks a significant evolution from traditional dataset-driven models to a more robust, metric-driven framework within Meituan's data ecosystem, focusing on solving the core pain points of data chaos and slow response times in large-scale enterprise environments.