Back to List
Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning

Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning Evaluation

The Meituan LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as one of the most advanced models, achieved a top accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is traditionally considered a passing grade. This release by Meituan's technical team establishes a more demanding standard for measuring AI reasoning, highlighting that current models still face substantial challenges in complex logical tasks.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has introduced General 365, specifically designed to test the reasoning limits of AI models.
  • Industry-Wide Testing: The benchmark was applied to 26 mainstream models to provide a comprehensive overview of current AI capabilities.
  • Performance Ceiling: Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%.
  • Reasoning Deficit: Most tested models failed to achieve a 60% score, indicating a widespread struggle with the reasoning tasks presented in General 365.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially open-sourced General 365, positioning it as a new yardstick for the evaluation of artificial intelligence. Unlike traditional benchmarks that may focus on general knowledge or linguistic fluency, General 365 appears to target the core cognitive function of reasoning. By releasing this tool, the LongCat team provides the developer community with a rigorous framework to identify the strengths and weaknesses of various large language models (LLMs) in logical processing.

The decision to open-source this benchmark suggests a move toward greater transparency and standardization in how AI progress is measured. As models become more sophisticated, the industry requires more difficult and nuanced testing environments to differentiate between superficial pattern matching and genuine logical reasoning.

Benchmarking the Leaders: Gemini 3 Pro and Beyond

In the initial testing phase conducted by the LongCat team, 26 mainstream models were put to the test. The results offer a sobering look at the current state of AI development. Gemini 3 Pro, which is currently identified as the strongest model in the field, reached an accuracy of 62.8%. While this represents the leading edge of current technology, it also highlights a significant margin for improvement.

The data reveals a steep drop-off in performance beyond the top-tier models. The fact that the majority of the 26 models could not reach a 60% accuracy level—often considered the minimum standard for competency—suggests that General 365 is a highly challenging benchmark. This performance gap underscores the difficulty of the reasoning tasks included in the set and indicates that many current LLMs may still struggle when faced with complex, multi-step logical requirements.

Industry Impact

The release of General 365 is significant for the AI industry as it shifts the focus from simple performance metrics to deep reasoning capabilities. By setting a benchmark where even the most advanced models score near the 60% mark, Meituan is effectively raising the bar for what constitutes a "high-performing" model. This encourages AI researchers and developers to move beyond optimizing for existing, potentially saturated benchmarks and instead focus on the fundamental challenges of machine reasoning.

Furthermore, the benchmark serves as a reality check for the industry. While marketing for AI models often emphasizes human-like capabilities, the General 365 results demonstrate that there is still a long way to go before AI can consistently master complex reasoning tasks. This new standard will likely drive a new wave of innovation focused on cognitive depth rather than just model size or data volume.

Frequently Asked Questions

Question: What is General 365?

General 365 is a new reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to provide a rigorous standard for testing the logical reasoning capabilities of large language models.

Question: How did mainstream models perform on this benchmark?

In a test of 26 mainstream models, the performance was generally low. Gemini 3 Pro led the group with a 62.8% accuracy rate, but the majority of models failed to reach a 60% score.

Question: Why is the 60% score significant in this context?

The 60% mark is often viewed as a basic passing grade or a threshold for competency. The fact that most models fell below this line indicates that General 365 is a particularly difficult test that exposes the reasoning limitations of current AI technology.

Related News

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code

As AI begins to generate over 90% of code, the focus of software engineering is shifting from the speed of generation to the necessity of constraining AI capabilities to prevent systemic chaos. This article explores the Meituan technical team's experience in refactoring 310,000 lines of code using an Agent evaluation approach. By implementing technical debt sorting, rule construction, standardized operating procedures (SOPs), and a Pre-PR mechanism, the team successfully transformed high-cost refactoring into a sustainable, daily iterative process. The core philosophy emphasizes that without unified standards, AI-driven development can amplify technical debt, making structured management and rigorous evaluation essential for long-term system stability and code quality in the era of AI coding.

Meituan Data Platform Evolves BI Architecture with Metrics Platforms and Enhanced Computing Engines
Industry News

Meituan Data Platform Evolves BI Architecture with Metrics Platforms and Enhanced Computing Engines

The Meituan technical team has announced a significant evolution in its Business Intelligence (BI) architecture, transitioning to a system centered on a dedicated metrics platform. This new generation of BI infrastructure is designed to overcome the limitations of traditional models that rely on fragmented, personalized datasets. By implementing two core technical capabilities—automatic semantics and enhanced computing—Meituan has successfully addressed the persistent issues of data caliber confusion and suboptimal query performance. This strategic shift ensures that data definitions remain consistent across the organization while providing the high-speed analytical power necessary for large-scale operations. The development marks a critical step in Meituan's efforts to streamline data governance and improve the efficiency of its data-driven decision-making processes.

NousResearch Unveils Hermes Agent: A New Paradigm for AI That Grows With the User
Industry News

NousResearch Unveils Hermes Agent: A New Paradigm for AI That Grows With the User

NousResearch has officially introduced 'Hermes Agent,' a project that marks a significant evolution in their AI development roadmap. Defined by the core philosophy of being 'an agent that grows with you,' this new release on GitHub signals a shift from static large language models toward dynamic, adaptive intelligent entities. While the initial documentation remains focused on the project's vision, the introduction of the Hermes Agent suggests a move toward personalized AI experiences where the system evolves based on user interaction and shared history. As an extension of the well-known Hermes series, this project emphasizes the transition from simple chat interfaces to sophisticated agents capable of long-term development alongside their human counterparts.