Back to List
Meituan LongCat Open-Sources General 365: A Rigorous New Benchmark for AI Reasoning Performance
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Open-Sources General 365: A Rigorous New Benchmark for AI Reasoning Performance

Meituan's LongCat team has officially released General 365, a new open-source benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). The benchmark's debut has sent ripples through the AI community by revealing a significant performance gap in current technology. In a comprehensive test of 26 mainstream models, even the industry-leading Gemini 3 Pro managed an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% threshold, which is typically considered a passing grade. This release by Meituan Technical Team establishes a new, more challenging standard for AI reasoning, suggesting that current models still face substantial hurdles in complex cognitive tasks.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has launched General 365, an open-source benchmark specifically focused on AI reasoning.
  • Gemini 3 Pro Performance: The model currently regarded as the strongest, Gemini 3 Pro, achieved an accuracy of 62.8% on the benchmark.
  • Widespread Failure to Pass: Most of the 26 mainstream models tested failed to reach a 60% accuracy score, highlighting a significant deficiency in current reasoning capabilities.
  • Industry Benchmark Shift: General 365 aims to set a more rigorous bar for evaluating how models handle complex reasoning compared to existing metrics.

In-Depth Analysis

The Launch of General 365 and the Reasoning Crisis

The release of General 365 by Meituan's LongCat team marks a pivotal moment in the evolution of AI benchmarking. For years, the industry has relied on a variety of metrics to measure the progress of large language models. However, as models become more sophisticated, many existing benchmarks have begun to suffer from saturation, where top-tier models achieve near-perfect scores, making it difficult to distinguish true reasoning ability from pattern matching or data memorization.

General 365 addresses this by introducing a framework that appears significantly more demanding than its predecessors. By open-sourcing this tool, Meituan is providing the global developer community with a "reality check." The initial data provided by the LongCat team suggests that the industry is currently facing a "reasoning crisis." When 26 of the most prominent models are put to the test and the majority cannot even secure a 60% accuracy rate, it indicates that the path toward true artificial general intelligence (AGI) is still fraught with fundamental challenges in logical processing and multi-step reasoning.

Analyzing the Performance of Gemini 3 Pro

The most telling data point from the General 365 release is the performance of Gemini 3 Pro. As a model widely recognized as one of the most capable in the world, its score of 62.8% serves as a benchmark for the current "ceiling" of AI reasoning. While 62.8% represents the top of the class in this specific evaluation, it is a modest figure in absolute terms.

This score suggests that even the most advanced architectures are struggling with the specific types of reasoning tasks curated in General 365. The fact that the "strongest" model is only slightly above the 60% mark implies that General 365 is designed to expose the edge cases and complex logical dependencies where current LLMs typically fail. It shifts the narrative from how well models can generate text to how accurately they can navigate complex problem-solving environments. For researchers, the 62.8% mark is not just a score; it is a target that defines the current frontier of the industry.

The 60% Threshold: A New Baseline for AI Maturity

Perhaps the most alarming revelation from the LongCat team's report is that the vast majority of mainstream models failed to reach the 60% accuracy threshold. In many academic and professional contexts, 60% is the baseline for a passing grade. The failure of most models to reach this level on General 365 suggests that many current AI solutions may be less reliable in high-stakes reasoning scenarios than previously thought.

This widespread underperformance highlights a potential over-optimization of models for conversational fluency at the expense of deep reasoning. As Meituan sets this new "ruler" for the industry, it forces a re-evaluation of what constitutes a "capable" model. If a model can write poetry but cannot pass a basic reasoning threshold on General 365, its utility in technical, legal, or scientific fields may be limited. The benchmark effectively separates models that are merely good at language from those that possess genuine analytical depth.

Industry Impact

The introduction of General 365 is likely to influence the AI industry in several key ways. First, it provides a transparent, open-source metric that discourages "benchmark gaming," as the difficulty level is high enough to reveal true performance variances. Second, it places pressure on major AI labs to improve the logical consistency of their models rather than just increasing parameter counts or training data volume.

Furthermore, Meituan’s decision to open-source the benchmark allows smaller research teams to align their development with industry-leading standards. By identifying that even the best models are currently hovering around the 60% mark, General 365 defines the next phase of the AI arms race: the quest for robust, reliable reasoning. This will likely lead to a shift in training methodologies, with a greater emphasis on synthetic reasoning data and reinforcement learning from human feedback (RLHF) focused on logical accuracy.

Frequently Asked Questions

Question: What is General 365?

General 365 is an open-source reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to test the complex reasoning capabilities of mainstream large language models through a rigorous set of evaluations.

Question: How did the top AI models perform on this benchmark?

According to the report, Gemini 3 Pro, currently considered the strongest model, achieved an accuracy of 62.8%. However, the majority of the 26 mainstream models tested failed to reach an accuracy of 60%.

Question: Why is the 60% score significant in this context?

The 60% score is significant because it is often viewed as the minimum threshold for a "passing" grade. The fact that most models failed to reach this mark suggests that current AI technology still has a long way to go in mastering complex reasoning tasks.

Related News

Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Industry News

Meituan LongCat Team Open-Sources WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation framework designed to test the limits of interactive video world models. Positioned as the first systematic multi-round benchmark in its category, WBench functions as a diagnostic tool—likened to a "CT scanner"—to identify specific technical hurdles as AI transitions from passive video generation to active, interactive environmental simulation. By focusing on the boundaries between "passive viewing" and "active interaction," WBench provides a rigorous methodology for assessing how models maintain consistency across complex, multi-step scenarios. This open-source contribution aims to standardize the evaluation of world models, offering insights into their performance in diverse settings ranging from lunar landscapes to futuristic urban environments.

Meituan's Breakthroughs at ACL 2026: Redefining Generative Paradigms through Evaluation and Reasoning Optimization
Industry News

Meituan's Breakthroughs at ACL 2026: Redefining Generative Paradigms through Evaluation and Reasoning Optimization

Meituan's technical team has achieved a significant milestone at ACL 2026, the premier international conference for computational linguistics and natural language processing. With six papers accepted, Meituan's research spans critical frontiers including large model evaluation, complex process reasoning, competition-level mathematical thinking optimization, reinforcement learning, and generative recommendation systems. These contributions highlight a strategic shift toward building a new generation of AI paradigms that emphasize both the robustness of model assessment and the depth of logical reasoning. By addressing high-level challenges such as mathematical problem-solving and the evolution of recommendation engines, Meituan is bridging the gap between theoretical academic research and practical industrial application, setting a new standard for generative AI development.

Meituan LongCat Team Launches General 365: A New Benchmark Revealing AI Reasoning Limitations
Industry News

Meituan LongCat Team Launches General 365: A New Benchmark Revealing AI Reasoning Limitations

The Meituan LongCat team has officially released General 365, a new evaluation benchmark specifically designed to measure the reasoning capabilities of large language models. In an extensive test involving 26 mainstream models, the benchmark has highlighted a significant performance gap in the current AI landscape. According to the results, Gemini 3 Pro emerged as the top performer but only managed an accuracy rate of 62.8%. Strikingly, the vast majority of the tested models failed to reach the 60% threshold, which is typically considered a passing grade. This development suggests that while AI has made strides in general tasks, complex reasoning remains a formidable challenge for even the most advanced systems currently available on the market.