Back to List
Meituan LongCat Team Launches General 365: A Challenging New Benchmark for AI Reasoning
Industry NewsMeituanLongCatAI Benchmarking

Meituan LongCat Team Launches General 365: A Challenging New Benchmark for AI Reasoning

The Meituan LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models. In a comprehensive assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most capable models, achieved a top score of only 62.8%. Most other models failed to reach the 60% accuracy threshold, which the team identifies as the 'passing mark.' This release establishes a more rigorous standard for the industry, suggesting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has introduced General 365 as a specialized benchmark for testing AI reasoning.
  • Industry-Wide Performance Gap: Out of 26 mainstream models tested, the vast majority failed to reach a 60% accuracy rate.
  • Leading Model Performance: Gemini 3 Pro emerged as the top performer but only managed a score of 62.8%.
  • Raising the Bar: General 365 is positioned as a 'new ruler' that exposes the limitations of current large language models in complex reasoning tasks.

In-Depth Analysis

The Emergence of General 365 as a New Benchmark

The Meituan LongCat team has officially introduced General 365, a benchmark that aims to redefine how the industry measures reasoning in artificial intelligence. By positioning this tool as a 'new ruler' (新标尺), the team suggests that existing evaluation methods may not sufficiently challenge the current generation of large language models. The release of General 365 comes at a time when the AI industry is shifting its focus from simple generative tasks to complex logical reasoning, necessitating more rigorous and precise measurement tools. The benchmark's name and its initial rollout indicate a focus on comprehensive, perhaps year-round or all-encompassing, reasoning capabilities that mainstream models must now strive to meet.

Analyzing the Performance of 26 Mainstream Models

The initial data released alongside General 365 provides a sobering look at the current state of AI. The LongCat team conducted actual tests on 26 mainstream models to verify the benchmark's difficulty and the models' relative strengths. The results were telling: the majority of these models could not reach the 60% accuracy mark, which is traditionally considered the 'passing grade' or 'threshold of competence.' This widespread failure to meet a basic accuracy standard suggests that General 365 targets specific reasoning flaws that are prevalent across the industry, regardless of the model's architecture or training scale.

Gemini 3 Pro and the Ceiling of Current Reasoning

Even the most advanced models currently available are finding the General 365 benchmark to be a significant challenge. Gemini 3 Pro, which the report identifies as the strongest model currently available ('地表最强'), achieved an accuracy rate of 62.8%. While this score places it at the top of the pack, it remains only slightly above the 60% passing line. This narrow margin of success for a leading model underscores the difficulty of the General 365 tasks. It also highlights a plateau in reasoning performance, where even the industry's flagship models are struggling to achieve high levels of accuracy, pointing toward a need for fundamental breakthroughs in how AI processes logical sequences and complex problem-solving.

Industry Impact

The introduction of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By establishing a benchmark where the 'strongest' model only scores 62.8%, the LongCat team has effectively moved the goalposts for AI researchers. This creates a new competitive landscape where simply passing the 60% threshold becomes a primary objective for developers. Furthermore, the benchmark provides a transparent look at the limitations of current technology, encouraging the industry to move beyond surface-level performance and focus on the deep reasoning capabilities required for more sophisticated, real-world applications. As more teams adopt General 365 as a standard, it could lead to a more honest and rigorous era of AI evaluation.

Frequently Asked Questions

Question: What is the significance of the 60% score in the General 365 benchmark?

In the context of the General 365 release, the 60% mark is described as the 'passing line' (及格线). The fact that most of the 26 mainstream models failed to reach this score indicates that the benchmark is exceptionally difficult and that current AI reasoning capabilities are still in a relatively early stage of development.

Question: How did the top-performing model fare on this new test?

Gemini 3 Pro was identified as the top-performing model among those tested, yet it only achieved an accuracy rate of 62.8%. This suggests that even the most advanced AI models currently on the market have significant room for improvement when it comes to the specific reasoning challenges posed by General 365.

Question: Who developed the General 365 benchmark?

General 365 was developed and released by the Meituan LongCat team. It was introduced through the Meituan Technology Team's official channels as a new standard for evaluating the reasoning performance of large language models.

Related News

Meituan Launches LongCat-2.0: A Trillion-Parameter Model Trained on 50,000-Card Domestic Computing Clusters
Industry News

Meituan Launches LongCat-2.0: A Trillion-Parameter Model Trained on 50,000-Card Domestic Computing Clusters

Meituan's technology team has officially announced the release of LongCat-2.0, a groundbreaking trillion-parameter large language model. This release marks a significant milestone as the industry's first model of this scale—boasting 1.6 trillion total parameters—to complete its entire training and inference lifecycle on a domestic computing cluster featuring 50,000 cards. LongCat-2.0 was pre-trained from scratch and features native support for an ultra-long context window of 1 million tokens. Specifically engineered for "Agentic Coding" tasks, the model is designed to enhance efficiency and stability in code understanding, generation, and execution. With an average activation of approximately 48B parameters and a dynamic range of 33B to 56B, LongCat-2.0 represents a major leap in domestic AI infrastructure and specialized software engineering capabilities.

Meituan Technical Team Showcases Research Excellence with Selected Papers at ICML 2026
Industry News

Meituan Technical Team Showcases Research Excellence with Selected Papers at ICML 2026

The Meituan Technical Team has announced the selection of its academic papers for the International Conference on Machine Learning (ICML) 2026. As one of the most influential global platforms in the machine learning field, ICML focuses on addressing future challenges and core issues within the industry. The conference prioritizes research that demonstrates significant theoretical value and practical impact, aiming to drive the development of the field and lead future research directions. Meituan's participation underscores its commitment to high-level academic contribution and the exploration of cutting-edge machine learning solutions. This selection highlights the team's role in contributing to the global academic discourse and its focus on research that balances theoretical innovation with real-world application.

Meituan Showcases AI Innovation at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations
Industry News

Meituan Showcases AI Innovation at ACL 2026: Advancing LLM Evaluation, Reasoning, and Generative Recommendations

The Meituan technical team has announced the acceptance of six research papers at ACL 2026, a premier international conference in computational linguistics and natural language processing (NLP). These papers represent Meituan's latest breakthroughs in building a new paradigm for generative AI. The research spans five critical domains: large model evaluation, complex process reasoning, competition-level mathematical thinking optimization, reinforcement learning (RL) optimization, and generative recommendation systems. By focusing on these high-impact areas, Meituan aims to bridge the gap between theoretical AI capabilities and practical, real-world applications. This selection highlights Meituan's strategic investment in enhancing the intelligence, reasoning depth, and efficiency of AI models within its vast service ecosystem.