Back to List
Meituan LongCat Team Unveils General 365: A Rigorous New Benchmark for Evaluating AI Reasoning Capabilities
Industry NewsMeituanAI BenchmarkingReasoning

Meituan LongCat Team Unveils General 365: A Rigorous New Benchmark for Evaluating AI Reasoning Capabilities

The Meituan LongCat team has officially released General 365, a new evaluation benchmark designed to test the reasoning limits of large language models. In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as the most powerful model, achieved an accuracy rate of only 62.8%. Most other models failed to reach the 60% passing threshold, highlighting the intense difficulty of the General 365 evaluation. This release by Meituan aims to establish a more demanding standard for reasoning, pushing the AI industry to move beyond general knowledge toward more complex cognitive processing and problem-solving capabilities.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has launched General 365, a specialized benchmark for reasoning evaluation.
  • Industry-Wide Testing: The benchmark was used to test 26 mainstream AI models to assess their logical and reasoning performance.
  • Leading Performance: Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%.
  • High Difficulty Level: The majority of the 26 models tested failed to reach the 60% accuracy mark, which is considered the passing grade for this benchmark.

In-Depth Analysis

The Emergence of General 365 as a Reasoning Standard

The release of General 365 by the Meituan LongCat team marks a significant shift in how artificial intelligence is evaluated. While many existing benchmarks focus on broad knowledge or linguistic fluency, General 365 is positioned specifically as a "new ruler" for reasoning. By focusing on the cognitive depth of models, Meituan is addressing a critical need in the AI community: the ability to distinguish between models that simply predict the next token and those that can truly perform complex reasoning. The fact that this benchmark was developed by a major technology team like Meituan suggests a growing demand for internal standards that can accurately measure the progress of high-level AI development.

Analyzing the Performance Gap: Gemini 3 Pro and the 60% Threshold

The results of the initial testing phase provide a sobering look at the current state of AI reasoning. Gemini 3 Pro, which is currently described as the strongest model available, achieved a score of 62.8%. While this places it at the top of the 26 models tested, the margin is surprisingly slim when compared to the benchmark's "passing line" of 60%. This data point suggests that even the most advanced systems are only just beginning to master the types of reasoning tasks presented in General 365.

Furthermore, the revelation that the vast majority of the 26 mainstream models failed to reach the 60% mark indicates a widespread struggle with complex reasoning across the industry. This "failure" to pass the 60% threshold by most models highlights that General 365 is not a standard test; it is a high-bar evaluation that exposes the limitations of current large language models (LLMs). The disparity between general performance and reasoning-specific performance suggests that while models are becoming more conversational, their underlying logic and problem-solving frameworks require significant refinement.

The Significance of the 26-Model Comparison

By testing 26 different models, the Meituan LongCat team has provided a comprehensive cross-section of the AI landscape. This broad scope ensures that the results are not an anomaly but a reflection of the current technological ceiling. The fact that 26 models—representing the mainstream of the industry—were subjected to this test provides a robust dataset for understanding where the industry stands. The collective struggle to meet the 60% accuracy requirement serves as a call to action for AI researchers to prioritize reasoning architectures over simple parameter scaling.

Industry Impact

The introduction of General 365 is likely to influence the AI industry in several key ways. First, it sets a new, higher standard for what constitutes "passing" in terms of reasoning. By establishing a 60% threshold that most current models cannot meet, Meituan has created a clear target for future development. This will likely encourage AI labs to focus more on the quality of reasoning rather than just the quantity of data or the size of the model.

Second, the benchmark provides a transparent look at the performance of leading models like Gemini 3 Pro in a specialized context. This transparency is vital for enterprises and developers who need to know the true capabilities of the models they are integrating into their systems. As reasoning becomes a core requirement for AI applications in fields like engineering, law, and medicine, benchmarks like General 365 will become essential tools for vetting and selecting the right technology. Finally, Meituan's contribution to the open-source or public evaluation space reinforces the importance of independent, rigorous testing in an industry often characterized by rapid, unverified claims of "human-level" performance.

Frequently Asked Questions

Question: What is the primary purpose of the General 365 benchmark?

General 365 was developed by the Meituan LongCat team to serve as a new standard for evaluating the reasoning capabilities of large language models. It aims to provide a more rigorous and accurate measure of a model's ability to perform complex logical tasks compared to traditional benchmarks.

Question: How did the top-performing models fare on General 365?

According to the results released by Meituan, Gemini 3 Pro was the highest-performing model among the 26 tested, achieving an accuracy rate of 62.8%. However, the majority of the other mainstream models failed to reach the 60% passing mark, indicating the high difficulty of the benchmark.

Question: Why is the 60% accuracy mark significant in this context?

The 60% mark is considered the "passing line" for the General 365 benchmark. The fact that most mainstream models failed to reach this score suggests that current AI technology still has significant room for improvement in the area of complex reasoning and logical problem-solving.

Related News

Meituan Unveils AI Breakthroughs at ACL 2026: Advancing Evaluation, Reasoning, and Generative Paradigms
Industry News

Meituan Unveils AI Breakthroughs at ACL 2026: Advancing Evaluation, Reasoning, and Generative Paradigms

Meituan's technical team has achieved a significant milestone at ACL 2026, the premier international conference for computational linguistics and natural language processing. With six papers accepted, Meituan's research spans a wide array of cutting-edge AI domains, including large-scale model evaluation, complex process reasoning, and competition-level mathematical thinking optimization. The research also delves into reinforcement learning and generative recommendation systems. These contributions are centered on establishing a new paradigm for generative AI, aiming to enhance the intelligence, reliability, and practical utility of large language models. By addressing both theoretical challenges and optimization strategies, Meituan continues to push the boundaries of how AI systems reason and interact within complex environments.

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding Through Agent Evaluation: A Case Study of Refactoring 310,000 Lines of Code

The Meituan technical team has introduced a groundbreaking approach to managing AI-driven development, centered on the refactoring of 310,000 lines of code. As AI now generates over 90% of code in certain environments, the team argues that the primary challenge is no longer the speed of generation but the constraints placed upon the AI to prevent systemic chaos. By adopting 'Agent evaluation thinking,' Meituan has implemented a structured framework involving technical debt sorting, rule construction, a standardized refactoring SOP, and a Pre-PR mechanism. This strategy successfully transforms high-cost, specialized refactoring projects into sustainable, daily iterative actions, ensuring that AI-generated code remains organized, maintainable, and aligned with technical standards.

Meituan Technical Team Explores New Generation BI Architecture via Metric Platforms and Enhanced Computing Engines
Industry News

Meituan Technical Team Explores New Generation BI Architecture via Metric Platforms and Enhanced Computing Engines

Meituan's data platform team has unveiled a transformative approach to Business Intelligence (BI) by constructing a new generation architecture centered on a unified Metric Platform. This initiative specifically targets the systemic failures of traditional BI frameworks, which often suffer from inconsistent data definitions—referred to as data caliber confusion—and degraded query performance when handling diverse, personalized datasets. By implementing two core technical pillars, "Automatic Semantics" and "Enhanced Computing," Meituan has successfully streamlined its data operations. This shift ensures that business logic is centralized and computational efficiency is maximized, providing a robust foundation for high-concurrency and high-precision data analysis across the organization's expansive ecosystem.