Back to List
Meituan LongCat Team Launches General 365 Benchmark: Gemini 3 Pro Leads with 62.8% Accuracy
Industry NewsMeituanLongCatAI Benchmarking

Meituan LongCat Team Launches General 365 Benchmark: Gemini 3 Pro Leads with 62.8% Accuracy

The Meituan LongCat team has officially introduced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive assessment of 26 mainstream models, the results reveal a significant performance gap in the industry. Gemini 3 Pro, currently identified as the top-performing model, achieved an accuracy rate of 62.8%. However, the benchmark results highlight a broader challenge: the vast majority of tested models failed to reach the 60% accuracy threshold. This release establishes a new standard for measuring AI intelligence and underscores the current limitations of complex reasoning in even the most advanced AI systems.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has officially released General 365, a benchmark specifically designed to measure AI reasoning.
  • Gemini 3 Pro Performance: Among 26 mainstream models tested, Gemini 3 Pro emerged as the leader with an accuracy rate of 62.8%.
  • Industry-Wide Challenge: The majority of models tested failed to reach a 60% accuracy score, indicating a significant gap in current reasoning capabilities.
  • Comprehensive Testing: The benchmark results are based on the real-world performance of 26 different mainstream AI models.

In-Depth Analysis

The Introduction of General 365 by Meituan LongCat

The Meituan LongCat team has officially unveiled General 365, a new evaluation benchmark that aims to set a more rigorous standard for the artificial intelligence industry. As large language models (LLMs) continue to proliferate, the need for precise and challenging evaluation tools has become paramount. General 365 is positioned as a "new yardstick" for reasoning, focusing on the ability of models to process complex logical tasks rather than simply retrieving information or performing basic linguistic functions. By introducing this benchmark, the Meituan technical team provides a structured framework to differentiate between models that possess genuine reasoning depth and those that do not.

Analyzing the Performance of Gemini 3 Pro

In the initial testing phase conducted by the LongCat team, 26 mainstream models were subjected to the General 365 evaluation. The results identify Google's Gemini 3 Pro as the current industry leader in terms of reasoning accuracy. However, the data also reveals the difficulty of the benchmark itself; Gemini 3 Pro achieved an accuracy rate of 62.8%. While this score places it at the top of the current field, it also suggests that even the most advanced AI models currently available have significant room for growth when it comes to mastering complex reasoning. The 62.8% figure serves as a high-water mark for the industry, yet it remains far from perfect, highlighting the "frontier" nature of AI reasoning tasks.

The 60% Accuracy Threshold

Perhaps the most significant finding from the release of General 365 is the overall performance distribution across the 26 tested models. The Meituan LongCat team noted that the vast majority of these mainstream models were unable to reach a 60% accuracy level. In many academic and professional contexts, 60% is often viewed as a baseline for a "passing" grade or a minimum standard of competency. The fact that most models failed to meet this threshold suggests that complex reasoning remains a formidable obstacle for the current generation of AI. This widespread failure to reach the 60% mark underscores the necessity of benchmarks like General 365, which can expose the limitations of models that might otherwise appear highly capable in simpler evaluation scenarios.

Industry Impact

The release of General 365 is likely to have a profound impact on how AI development is prioritized. By establishing a benchmark where even the strongest models score in the low 60s, Meituan is pushing the industry toward more meaningful and difficult evaluations. This shift encourages developers to move beyond optimizing for simple benchmarks and instead focus on the core logical capabilities that define true intelligence. Furthermore, the transparency provided by testing 26 different models allows for a clearer understanding of where the industry stands as a whole. As models strive to surpass the 60% mark and challenge Gemini 3 Pro's lead, General 365 will serve as a critical tool for tracking the evolution of AI reasoning.

Frequently Asked Questions

Question: What is the primary purpose of the General 365 benchmark?

General 365 was developed by the Meituan LongCat team to serve as a new standard for evaluating the reasoning capabilities of large language models, providing a more accurate measure of AI intelligence.

Question: How many models were tested in the initial General 365 report?

The Meituan LongCat team tested 26 mainstream AI models to provide a comprehensive overview of the current state of reasoning performance in the industry.

Question: Which model currently holds the highest score on General 365?

Gemini 3 Pro currently holds the highest accuracy score on the General 365 benchmark, with a recorded accuracy of 62.8%.

Related News

Meituan Showcases AI Innovations at ACL 2026: From Model Evaluation to Advanced Reasoning Paradigms
Industry News

Meituan Showcases AI Innovations at ACL 2026: From Model Evaluation to Advanced Reasoning Paradigms

At the prestigious ACL 2026 conference, the Meituan technical team presented six groundbreaking papers that signal a shift toward a new generative paradigm in artificial intelligence. These research contributions span a diverse array of critical NLP and AI domains, including large-scale model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Additionally, the papers explore advancements in reinforcement learning and generative recommendation systems. By focusing on these specific technical directions, Meituan aims to enhance the reasoning capabilities and practical utility of AI models. This selection highlights Meituan's commitment to pushing the boundaries of computational linguistics and natural language processing, providing insights into how the industry can transition from simple generation to more sophisticated, optimized reasoning and recommendation frameworks.

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code

The Meituan technical team has shared a comprehensive framework for managing AI-driven development, centered on the successful refactoring of 310,000 lines of code. As AI begins to generate over 90% of codebases, the team argues that the bottleneck has shifted from coding speed to the implementation of effective constraints. Without standardized management, AI risks magnifying system complexity and chaos. The team's approach utilizes 'Agent evaluation thinking' to transform refactoring from a high-cost, specialized project into a continuous daily activity. This is achieved through four key pillars: technical debt assessment, rule construction, standardized operating procedures (SOPs), and a Pre-PR (Pull Request) mechanism. This methodology ensures that AI-generated code remains aligned with system architecture and quality standards, providing a blueprint for sustainable AI-assisted software engineering.

Meituan BI Evolution: Implementing Metric Platforms and Analysis Engines for Enhanced Data Consistency
Industry News

Meituan BI Evolution: Implementing Metric Platforms and Analysis Engines for Enhanced Data Consistency

Meituan's technical team has unveiled a new generation of Business Intelligence (BI) architecture centered on a centralized Metric Platform. This strategic shift aims to resolve persistent issues found in traditional BI environments, such as "data caliber confusion" and poor query performance. By developing two core capabilities—Automatic Semantics and Enhanced Computing—Meituan has successfully addressed the limitations of personalized dataset-driven models. This new framework ensures that data definitions remain consistent across the organization while significantly optimizing the speed and efficiency of data analysis. The implementation marks a significant milestone in Meituan's journey toward a more robust and scalable data infrastructure, providing a blueprint for handling complex enterprise-level BI challenges.