Back to List
Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI
Industry NewsMeituanAI ReasoningBenchmarking

Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI

The Meituan LongCat team has officially released General 365, a new open-source benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team sets a rigorous new standard for the industry, emphasizing that complex reasoning remains a formidable challenge even for the most advanced artificial intelligence systems.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has open-sourced General 365, a benchmark specifically focused on general reasoning capabilities.
  • Performance Gap: Out of 26 mainstream models tested, the industry-leading Gemini 3 Pro only managed a 62.8% accuracy rate.
  • Widespread Underperformance: Most current AI models failed to reach the 60% accuracy mark on the General 365 benchmark.
  • Open Source Contribution: The release provides the AI community with a "new ruler" to measure and improve reasoning logic in large language models.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has introduced General 365 at a critical juncture in AI development. As large language models evolve, the focus is shifting from simple information retrieval to complex logical reasoning. By open-sourcing General 365, Meituan is providing a structured framework to evaluate how models handle multi-step logic and problem-solving. The title of the release, "Setting a New Ruler for Reasoning Evaluation," suggests that existing benchmarks may not be sufficiently challenging or comprehensive enough to distinguish the reasoning depths of modern LLMs. General 365 aims to fill this gap by offering a more rigorous testing ground.

Analyzing the Performance of Mainstream Models

The data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 mainstream models, representing a broad cross-section of the industry's current capabilities. The results indicate that reasoning is still a significant hurdle. Even Gemini 3 Pro, which is described as the "strongest on the surface" (地表最强), only achieved an accuracy of 62.8%. This score, while leading the pack, suggests that even top-tier models struggle with nearly 40% of the reasoning tasks presented in the General 365 suite.

Perhaps more telling is the performance of the remaining 25 models. The report notes that the vast majority of these models did not even reach the 60% "passing line." This widespread failure to achieve a basic level of proficiency on the General 365 benchmark indicates that while AI has made strides in natural language processing, the underlying logical architecture required for consistent reasoning is still in its infancy for most developers. This data serves as a benchmark for the industry, highlighting the specific areas where current LLMs fall short.

Industry Impact

Redefining Success in AI Development

The introduction of General 365 is likely to shift the industry's focus toward more rigorous reasoning benchmarks. By demonstrating that even the most advanced models like Gemini 3 Pro have significant room for improvement, Meituan is encouraging a move away from superficial performance metrics toward deeper logical consistency. This "new ruler" provides a clear target for AI researchers, emphasizing that high-quality reasoning is the next frontier for model optimization.

Encouraging Transparency through Open Source

By open-sourcing the General 365 benchmark, the Meituan LongCat team is fostering a more transparent and competitive environment. Developers can now use this tool to identify specific weaknesses in their models' reasoning chains. As more teams adopt this benchmark, it could lead to a standardized way of reporting reasoning capabilities, making it easier for the industry to track progress and for users to understand the actual limitations of the AI tools they employ.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did top-tier models perform on this benchmark?

In tests involving 26 mainstream models, Gemini 3 Pro achieved the highest accuracy at 62.8%. However, most other models failed to reach a 60% accuracy rate, indicating that reasoning remains a major challenge for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report uses the 60% mark as a metaphorical "passing line." The fact that most models failed to reach this level suggests that current AI reasoning capabilities are not yet reliable for complex tasks defined by the General 365 benchmark.

Related News

Meituan Launches LongCat-2.0: A Trillion-Parameter Model Trained on 50,000-Card Domestic Computing Clusters
Industry News

Meituan Launches LongCat-2.0: A Trillion-Parameter Model Trained on 50,000-Card Domestic Computing Clusters

Meituan's technology team has officially announced the release of LongCat-2.0, a groundbreaking trillion-parameter large language model. This release marks a significant milestone as the industry's first model of this scale—boasting 1.6 trillion total parameters—to complete its entire training and inference lifecycle on a domestic computing cluster featuring 50,000 cards. LongCat-2.0 was pre-trained from scratch and features native support for an ultra-long context window of 1 million tokens. Specifically engineered for "Agentic Coding" tasks, the model is designed to enhance efficiency and stability in code understanding, generation, and execution. With an average activation of approximately 48B parameters and a dynamic range of 33B to 56B, LongCat-2.0 represents a major leap in domestic AI infrastructure and specialized software engineering capabilities.

Meituan Technical Team Showcases Research Excellence with Selected Papers at ICML 2026
Industry News

Meituan Technical Team Showcases Research Excellence with Selected Papers at ICML 2026

The Meituan Technical Team has announced the selection of its academic papers for the International Conference on Machine Learning (ICML) 2026. As one of the most influential global platforms in the machine learning field, ICML focuses on addressing future challenges and core issues within the industry. The conference prioritizes research that demonstrates significant theoretical value and practical impact, aiming to drive the development of the field and lead future research directions. Meituan's participation underscores its commitment to high-level academic contribution and the exploration of cutting-edge machine learning solutions. This selection highlights the team's role in contributing to the global academic discourse and its focus on research that balances theoretical innovation with real-world application.

Meituan Showcases AI Innovation at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations
Industry News

Meituan Showcases AI Innovation at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations

The Meituan technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference in computational linguistics and natural language processing (NLP). These papers represent Meituan's latest breakthroughs in building a new paradigm for generative AI. The research spans five critical domains: large model evaluation, complex process reasoning, competition-level mathematical thinking optimization, reinforcement learning (RL) optimization, and generative recommendation systems. By focusing on these high-impact areas, Meituan aims to bridge the gap between theoretical AI capabilities and practical, real-world applications. This selection highlights Meituan's strategic investment in enhancing the intelligence, reasoning depth, and efficiency of AI models within its vast service ecosystem.