Back to List
Meituan LongCat Team Launches General 365: A New Benchmark Revealing Critical Gaps in AI Reasoning Capabilities
Industry NewsMeituanLongCatAI Benchmarking

Meituan LongCat Team Launches General 365: A New Benchmark Revealing Critical Gaps in AI Reasoning Capabilities

The Meituan LongCat team has officially released General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of modern artificial intelligence. In an initial assessment of 26 mainstream models, the results reveal a significant performance gap across the industry. Even Gemini 3 Pro, currently identified as the most powerful model in the test, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team establishes a new standard for measuring logical depth in AI and highlights the substantial room for improvement in complex reasoning tasks.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on reasoning performance.
  • Industry-Wide Testing: The benchmark was applied to 26 mainstream AI models to provide a comprehensive overview of current capabilities.
  • Gemini 3 Pro Leads: Currently the top performer in this evaluation, Gemini 3 Pro reached an accuracy of 62.8%.
  • Widespread Failure: Most models tested were unable to achieve a 60% accuracy rate, indicating a general struggle with the benchmark's requirements.

In-Depth Analysis

The Launch of General 365 by Meituan LongCat

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark arrives at a time when the industry is shifting its focus from simple conversational fluency to deep logical reasoning. By testing 26 mainstream models, Meituan has provided a broad cross-section of the current state of artificial intelligence. The introduction of General 365 is positioned as a "new yardstick" (标尺) for the industry, suggesting that existing benchmarks may not be sufficiently challenging or specific enough to differentiate the reasoning prowess of high-tier models.

Analyzing the Performance Gap: The 60% Threshold

The data released alongside General 365 provides a sobering look at the current limitations of large language models. The fact that Gemini 3 Pro, cited as the strongest model currently available, only managed a score of 62.8% suggests that the General 365 benchmark is exceptionally rigorous.

Perhaps more significant is the finding that the majority of the 26 models failed to reach the 60% mark. In many academic and professional contexts, 60% represents the minimum standard for passing or basic competency. The failure of most mainstream models to hit this benchmark indicates that while AI has made strides in many areas, complex reasoning remains a significant hurdle. This "passing line" serves as a clear indicator of the difficulty inherent in the General 365 evaluation set and the current ceiling for AI logic.

Meituan's Role in AI Standardization

By releasing this benchmark through their technical team, Meituan is asserting itself as a key player in the infrastructure of AI development. General 365 does not just rank models; it defines the criteria for what constitutes successful reasoning. The focus on 26 different models ensures that the benchmark is not tailored to a specific architecture but is instead a general assessment of the industry's progress. The results suggest that the path to truly "intelligent" reasoning is still in its early stages, with even the market leaders having significant room for growth.

Industry Impact

The release of General 365 is likely to have a multi-faceted impact on the AI industry. First, it provides a transparent and difficult target for model developers. When the "strongest" model only scores 62.8%, it creates a competitive drive for other labs to optimize for these specific reasoning challenges.

Second, it shifts the narrative away from general performance toward specialized reasoning. As more companies integrate AI into complex decision-making processes, benchmarks like General 365 become essential for determining which models are actually capable of handling logical tasks without failure. Meituan’s contribution highlights that the next frontier of AI development is not just about more data, but about higher quality logical processing and the ability to clear the 60% "competency" hurdle.

Frequently Asked Questions

Question: What is General 365?

General 365 is a new reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to test the logical and reasoning capabilities of AI models, providing a standardized metric for the industry.

Question: How did the top AI models perform on this benchmark?

According to the report from Meituan, Gemini 3 Pro is currently the strongest performer with an accuracy of 62.8%. However, most of the 26 mainstream models tested failed to reach a 60% accuracy rate.

Question: Why is the 60% score significant in the General 365 test?

The 60% score is often viewed as a basic passing grade or a threshold for competency. The fact that most mainstream models failed to reach this level underscores the high difficulty of the General 365 benchmark and the current limitations of AI reasoning.

Related News

Meituan Showcases AI Innovations at ACL 2026: From Model Evaluation to Reasoning Optimization and Generative Paradigms
Industry News

Meituan Showcases AI Innovations at ACL 2026: From Model Evaluation to Reasoning Optimization and Generative Paradigms

Meituan's technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference in computational linguistics and natural language processing. The papers cover a broad spectrum of cutting-edge AI fields, including large model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Additionally, the research explores advancements in reinforcement learning and generative recommendation systems. These contributions signify Meituan's strategic focus on building a new paradigm for generative AI, aiming to enhance the logical depth and practical utility of language models. By addressing both theoretical benchmarks and real-world application challenges, Meituan continues to position itself at the forefront of NLP research, contributing to the evolution of how AI systems reason, learn, and interact with users in complex environments.

Managing AI Coding with Agent Evaluation: Meituan's Practice in Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding with Agent Evaluation: Meituan's Practice in Refactoring 310,000 Lines of Code

Meituan's technical team has introduced a groundbreaking approach to managing AI-assisted development, focusing on the refactoring of 310,000 lines of code. As AI now generates over 90% of code in certain environments, the primary challenge has shifted from production speed to the management of AI's output quality. The team argues that without unified standards, AI can exponentially increase technical debt and system chaos. To combat this, Meituan implemented an 'Agent evaluation' mindset, utilizing four key pillars: technical debt sorting, rule construction, a standardized Refactoring SOP, and a Pre-PR (Pull Request) mechanism. This strategy successfully transitions code refactoring from a high-cost, specialized project into a sustainable, daily iterative process, ensuring long-term system stability in the era of AI-dominated coding.

Meituan Data Platform Unveils New BI Architecture Centered on Metrics Platform and Enhanced Computing Engines
Industry News

Meituan Data Platform Unveils New BI Architecture Centered on Metrics Platform and Enhanced Computing Engines

Meituan's technical team has introduced a transformative Business Intelligence (BI) architecture. By shifting the focus to a centralized metrics platform, the company addresses critical bottlenecks in traditional BI workflows. The new system leverages automatic semantics and enhanced computing to eliminate data caliber confusion—a common issue where different users derive different results from the same data—and to drastically improve query performance. This evolution represents a significant step in Meituan's data strategy, moving away from fragmented, personalized datasets toward a unified, high-performance analytical environment that ensures data integrity and operational efficiency across the enterprise. The practice highlights the importance of semantic consistency and computational optimization in modern data-driven decision-making processes.