Back to List
Meituan LongCat Launches General 365: New Reasoning Benchmark Reveals AI Performance Gaps
Industry NewsAI BenchmarkingMeituanReasoning Models

Meituan LongCat Launches General 365: New Reasoning Benchmark Reveals AI Performance Gaps

Meituan's LongCat team has officially released General 365, a new evaluation benchmark specifically designed to measure the reasoning capabilities of large language models. In a comprehensive assessment of 26 mainstream AI models, the benchmark revealed a significant struggle across the industry to handle complex reasoning tasks. According to the results, Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%. Most notably, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered the passing mark. This release by Meituan's technical team establishes a more rigorous standard for AI evaluation, highlighting that even the most advanced models currently available face substantial challenges in logical reasoning.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has introduced General 365, a specialized benchmark for evaluating AI reasoning.
  • Testing of 26 mainstream models shows that reasoning remains a significant challenge for current AI technology.
  • Gemini 3 Pro recorded the highest accuracy at 62.8%, yet this remains relatively low for a top-tier model.
  • The majority of tested models failed to achieve a 60% accuracy rate, falling below the benchmark's passing line.

In-Depth Analysis

The Introduction of General 365 by Meituan LongCat

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is positioned as a new "ruler" or standard for measuring the reasoning capabilities of large language models (LLMs). By focusing specifically on reasoning, Meituan aims to provide a more nuanced understanding of how models process complex logic rather than just retrieving information. The launch of General 365 comes at a time when the industry is seeking more rigorous ways to differentiate between models that can simulate conversation and those that can truly perform logical deduction.

Analyzing the Performance of Mainstream Models

The initial data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 of the most prominent models in the industry. The results indicate a widespread inability to master the reasoning tasks presented in the General 365 suite.

Even Gemini 3 Pro, which the report identifies as the strongest model currently available ("the strongest on the surface"), only achieved an accuracy rate of 62.8%. This figure is particularly telling because it represents the ceiling of current performance within this specific testing framework. Perhaps more significant is the finding that the "passing line" of 60% was out of reach for the vast majority of the 26 models tested. This suggests that while AI has made strides in natural language processing, the leap to consistent, high-level reasoning is still in progress.

Industry Impact

Setting a New Standard for AI Reasoning

The release of General 365 is significant for the AI industry as it shifts the focus from general performance to specific reasoning depth. By establishing a benchmark where even the leading models struggle to pass a 60% threshold, Meituan is challenging developers to move beyond superficial improvements. This "new ruler" provides a clear metric for progress, forcing a shift toward solving more complex cognitive tasks.

Identifying the Reasoning Bottleneck

The fact that 26 mainstream models were tested and most failed to reach a basic level of proficiency on this benchmark highlights a critical bottleneck in AI development. The data suggests that current training methodologies may be reaching a plateau in terms of logical reasoning. For the industry, this serves as a call to action to refine how models are taught to think and reason, rather than just how they are taught to predict the next word in a sequence. The benchmark results for Gemini 3 Pro set a baseline that other developers will now aim to surpass, potentially accelerating the next wave of reasoning-focused AI research.

Frequently Asked Questions

What is the General 365 benchmark?

General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the logical and reasoning capabilities of large language models through a series of rigorous assessments.

How did Gemini 3 Pro perform on this benchmark?

Gemini 3 Pro was the highest-scoring model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%. While it was the top performer, its score highlights the difficulty of the General 365 reasoning tasks.

Why did most models fail the General 365 test?

According to the findings from the Meituan LongCat team, the majority of the 26 mainstream models tested could not reach the 60% accuracy mark. This indicates that complex reasoning remains a major weakness for most current AI models, regardless of their general popularity or performance in other areas.

Related News

Meituan Launches LongCat-2.0: A 1.6 Trillion Parameter Model Trained on Domestic Computing Clusters
Industry News

Meituan Launches LongCat-2.0: A 1.6 Trillion Parameter Model Trained on Domestic Computing Clusters

Meituan's technology team has officially announced the release of LongCat-2.0, a massive 1.6 trillion parameter model. This release marks a significant milestone as the industry's first model of this scale to complete its entire training and inference lifecycle on a domestic computing cluster consisting of 50,000 cards. LongCat-2.0 was pre-trained from scratch and features a dynamic activation architecture, with an average of 48B parameters active during operation. Designed with a native 1 million (1M) token ultra-long context window, the model is specifically optimized for Agentic Coding tasks. Its core objective is to provide superior stability and efficiency in code understanding, generation, and execution, addressing the complex needs of modern software development environments.

Meituan Technical Team Presents Selected Academic Research at ICML 2026
Industry News

Meituan Technical Team Presents Selected Academic Research at ICML 2026

The Meituan Technical Team has announced its participation in the International Conference on Machine Learning (ICML) 2026, showcasing a selection of academic papers. As one of the most influential international academic conferences in the field, ICML serves as a premier platform for discussing the critical challenges and core issues facing the future of machine learning. Meituan's involvement highlights its commitment to contributing to frontier research that possesses both significant theoretical value and practical impact. By engaging with this global community, the Meituan Technical Team aims to help drive the development of the field and influence future research directions through the evaluation and dissemination of high-impact research results.

Meituan Technical Team Showcases Cutting-Edge AI Research in Search and Recommendation at Top Global Conferences
Industry News

Meituan Technical Team Showcases Cutting-Edge AI Research in Search and Recommendation at Top Global Conferences

Meituan's Business R&D Platform/Search & Recommendation ASX (Agentic System X) team has recently shared insights from their latest research published at premier AI conferences. Focusing on the development of an Agent technology system powered by Large Language Models (LLMs), the team has made significant strides in LLM post-training, Agentic Reinforcement Learning, and multi-modal understanding. With dozens of papers accepted by prestigious venues such as ICLR, NeurIPS, CVPR, and AAAI, Meituan is positioning itself at the forefront of AI innovation. This special feature highlights six selected papers that demonstrate the team's commitment to advancing search and recommendation technologies through sophisticated agentic systems and multi-modal integration, providing valuable insights for the broader AI research community.