Back to List
Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Unveils General 365: A Rigorous New Standard for AI Reasoning Evaluation

Meituan's LongCat team has officially released General 365, a new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. The initial testing phase involved 26 mainstream models, revealing a significant performance gap in the industry. According to the results, the top-performing model, Gemini 3 Pro, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% accuracy threshold, which is considered a basic passing mark. This release by Meituan aims to provide a more challenging and accurate metric for assessing how well modern AI can handle complex reasoning tasks, highlighting that even the most advanced systems currently struggle with the demands of the General 365 evaluation.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has introduced General 365, a specialized evaluation tool for AI reasoning.
  • Industry Performance Gap: Out of 26 mainstream models tested, most failed to reach a 60% accuracy rate.
  • Top Performer Results: Gemini 3 Pro leads the current rankings but only managed a score of 62.8%.
  • A New Standard: General 365 is positioned as a "new ruler" or benchmark for measuring the true reasoning depth of large language models.

In-Depth Analysis

The Challenge of General 365

The release of General 365 by the Meituan LongCat team marks a pivotal moment in the evolution of AI benchmarking. By testing 26 of the most prominent models currently available, the team has provided a comprehensive snapshot of the industry's reasoning capabilities. The core finding—that the majority of these models cannot achieve a 60% accuracy rate—suggests that General 365 is designed to be significantly more rigorous than existing benchmarks. This "passing grade" of 60% serves as a critical indicator, suggesting that current AI development may be hitting a plateau when it comes to complex, multi-step reasoning tasks that go beyond simple pattern matching or data retrieval.

Benchmarking the Best: Gemini 3 Pro's Performance

One of the most notable aspects of the General 365 release is the performance of Gemini 3 Pro. Despite being recognized as one of the most powerful models globally, it achieved an accuracy of 62.8%. While this score places it at the top of the 26 models tested, the narrow margin by which it cleared the 60% threshold is telling. It highlights that even the industry leaders have substantial room for improvement. The fact that the "strongest" model is only slightly above what Meituan considers a basic level of competency on this benchmark underscores the difficulty of the reasoning tasks included in General 365. This data point provides a realistic perspective on the current state of artificial intelligence, tempering expectations with hard data regarding reasoning proficiency.

Redefining Evaluation Metrics

Meituan's decision to open-source or release General 365 (referred to as "Open General 365") indicates a move toward standardized, transparent evaluation. By establishing a "new ruler" (标尺), the LongCat team is challenging the AI community to look beyond high scores on older, perhaps saturated, benchmarks. The focus here is clearly on "General" reasoning, implying a broad applicability across different domains. The results suggest that as models become larger and more complex, their ability to reason effectively does not necessarily scale at the same rate, necessitating new tools like General 365 to identify these specific weaknesses.

Industry Impact

The introduction of General 365 is likely to have a profound impact on how AI models are developed and marketed. For the AI industry, this benchmark serves as a wake-up call, demonstrating that current "state-of-the-art" models still struggle with fundamental reasoning when held to a higher standard. It shifts the focus from general performance to specific reasoning accuracy. Furthermore, by setting a benchmark where most models currently fail, Meituan has created a new target for developers. This will likely drive a new wave of research focused specifically on closing the reasoning gap, as companies strive to move their models past the 60% mark and eventually challenge the 62.8% benchmark set by Gemini 3 Pro.

Frequently Asked Questions

Question: What is Meituan's General 365?

General 365 is a reasoning evaluation benchmark released by Meituan's LongCat team. It is designed to test the reasoning capabilities of mainstream AI models and currently serves as a rigorous new standard in the industry.

Question: How did mainstream AI models perform on the General 365 benchmark?

In a test of 26 mainstream models, most failed to reach a 60% accuracy rate. The highest-scoring model, Gemini 3 Pro, achieved an accuracy of 62.8%, indicating that the benchmark is highly challenging for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report notes that most models failed to reach the 60% mark, which is often viewed as a basic "passing" threshold. This highlights a significant gap in the reasoning abilities of current large language models when faced with the General 365 evaluation criteria.

Related News

Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation
Industry News

Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation

Meituan's LongCat team has officially open-sourced General 365, a new evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the top-performing model, Gemini 3 Pro, achieved an accuracy of only 62.8%, while the vast majority of tested models failed to reach the 60% passing mark. This release aims to establish a more rigorous standard for the industry, highlighting the current limitations of even the most advanced AI systems in complex reasoning tasks. By providing a transparent and difficult metric, Meituan seeks to drive the development of more logically capable artificial intelligence.

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code

As AI-generated code now accounts for over 90% of development in certain environments, the primary challenge has shifted from generation speed to the effective management and constraint of AI capabilities. Meituan's technical team recently shared their experience refactoring 310,000 lines of code using a strategy centered on "Agent evaluation thinking." By implementing technical debt assessment, standardized rules, a specialized Refactoring SOP, and a Pre-PR (Pull Request) mechanism, they have successfully transformed large-scale refactoring from a high-cost, periodic project into a continuous, daily operational task. This approach ensures that AI-driven development does not amplify systemic chaos but instead adheres to unified technical standards, maintaining long-term code quality and system stability in an AI-dominated coding era.

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI
Industry News

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of universal latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI by providing a standardized way to measure how models learn actions from visual inputs. Experimental results from the benchmark reveal that general vision models significantly outperform specialized embodied action expert models in both action generalization and control precision. Furthermore, the research demonstrates that embodied action representations can naturally emerge from large-scale human video data, suggesting that broad visual training is a viable path toward achieving more sophisticated and adaptable robotic control systems.