Back to List
Meituan LongCat Unveils General 365: A Rigorous New Benchmark for AI Reasoning Capabilities
Industry NewsMeituanAI BenchmarkingLongCat

Meituan LongCat Unveils General 365: A Rigorous New Benchmark for AI Reasoning Capabilities

Meituan's LongCat team has officially launched General 365, a new evaluation benchmark designed to set a higher standard for measuring AI reasoning. In a comprehensive test involving 26 mainstream models, the benchmark revealed a significant performance gap in the current AI landscape. Even the industry-leading Gemini 3 Pro achieved only a 62.8% accuracy rate, while the vast majority of tested models failed to reach the 60% threshold. This release by Meituan's technical team highlights the ongoing challenges large language models face in achieving high-level reasoning accuracy and provides a new diagnostic tool for the industry to measure progress beyond simple linguistic fluency.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has officially released the General 365 benchmark to evaluate AI reasoning capabilities.
  • In a rigorous test of 26 mainstream models, Gemini 3 Pro emerged as the top performer with an accuracy rate of 62.8%.
  • The majority of models tested failed to reach the 60% accuracy mark, which is considered the benchmark's passing threshold.
  • General 365 establishes a new, more difficult standard for measuring the logical and reasoning progress of large language models.

In-Depth Analysis

The Introduction of General 365 by Meituan LongCat

The Meituan LongCat team has introduced General 365, a benchmark specifically designed to push the boundaries of how AI reasoning is measured. In an era where many models claim high performance on standard tests, General 365 arrives as a specialized evaluation framework intended to provide a more granular and challenging assessment of a model's logical processing. By focusing on reasoning, Meituan is addressing a critical gap in the current AI landscape: the difference between linguistic fluency and actual cognitive accuracy.

The release of this benchmark by a major technical team like Meituan signifies a shift toward more rigorous, industry-led evaluation standards. As AI moves from general-purpose assistants to specialized tools requiring high reliability, the "yardstick" used to measure them must become more demanding. General 365 is positioned as that new standard, challenging the current generation of models to prove their worth in complex scenarios. The benchmark serves as a diagnostic tool that identifies where current architectures are succeeding and, more importantly, where they are failing to meet basic logical requirements.

Analyzing the Performance Gap: The 60% Barrier

The initial results released alongside General 365 provide a sobering look at the state of modern artificial intelligence. Out of 26 mainstream models evaluated, the performance was notably lower than what is often seen in general marketing materials. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of 62.8%. While this placed it at the top of the leaderboard, the margin of success is slim, highlighting that even the "best" models are only slightly above a basic level of proficiency on this specific benchmark.

Perhaps more telling is the performance of the rest of the field. The Meituan LongCat team reported that the vast majority of the 26 models failed to reach the 60% "passing" mark. This suggests that for most current large language models, the reasoning tasks presented in General 365 represent a significant difficulty spike. The fact that so many models "failed" to hit the 60% threshold indicates that current AI development may be hitting a plateau in reasoning, or that General 365 has successfully identified a specific type of logic that current architectures struggle to master. This data point serves as a critical reality check for the industry, emphasizing that there is still a long way to go before AI can consistently handle complex reasoning tasks with high reliability.

Industry Impact

The launch of General 365 is likely to have a profound impact on how AI research is prioritized and evaluated. By exposing the limitations of even the most advanced models like Gemini 3 Pro, Meituan is forcing the industry to look beyond simple parameter scaling and toward architectural improvements that enhance logical reasoning. This benchmark provides a transparent and objective metric for developers and enterprises alike.

As companies look to integrate AI into critical business processes, benchmarks like General 365 offer a realistic expectation of performance. The realization that most models cannot yet reliably pass a 60% accuracy threshold in complex reasoning will likely lead to a more cautious and focused approach to AI deployment in sectors where precision is paramount. Furthermore, Meituan's contribution to the open-source and research community with this benchmark encourages a more competitive and transparent environment for model development, where the focus shifts from "chatting" to "thinking."

Frequently Asked Questions

Question: What is the General 365 benchmark released by Meituan?

General 365 is a new evaluation benchmark developed by Meituan's LongCat team specifically designed to test the reasoning capabilities of large language models. It aims to set a higher and more rigorous standard for AI performance measurement than existing benchmarks.

Question: Which model performed the best on the General 365 benchmark?

According to the initial results released by the Meituan technical team, Gemini 3 Pro was the top-performing model among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Question: How did the majority of AI models fare in the General 365 evaluation?

The majority of the 26 mainstream models tested failed to reach an accuracy rate of 60%. This 60% mark is described by the Meituan LongCat team as the "passing line" for the benchmark, indicating that most current models struggle with the reasoning tasks it presents.

Related News

Managing AI Coding with Agent Evaluation Strategies: A Practice of Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding with Agent Evaluation Strategies: A Practice of Refactoring 310,000 Lines of Code

The Meituan technical team has shared a comprehensive approach to managing AI-driven development, based on a large-scale project involving the refactoring of 310,000 lines of code. As AI now generates over 90% of code in certain environments, the team argues that the critical factor for system stability is no longer the speed of generation, but the ability to effectively constrain AI capabilities. Without unified standards, AI-generated code can significantly amplify technical chaos. To address this, Meituan implemented an 'Agent evaluation' framework, which includes technical debt assessment, rule construction, standardized operating procedures (SOPs), and a Pre-PR mechanism. This strategy successfully transformed code refactoring from a high-cost, specialized effort into a continuous, daily activity integrated into the standard development lifecycle.

Meituan BI Architecture Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency
Industry News

Meituan BI Architecture Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency

Meituan's data platform team has introduced a next-generation Business Intelligence (BI) architecture centered on a unified metric platform. By developing core capabilities in automatic semantics and enhanced computing, the team has addressed critical pain points in traditional BI systems, such as inconsistent data logic and slow query speeds. This shift from personalized dataset-driven models to a centralized metric-centric approach marks a significant advancement in Meituan's data processing efficiency and accuracy. The new architecture specifically targets the challenges of data definition confusion and performance bottlenecks, providing a more robust framework for enterprise-level data analysis and decision-making.

The Value of Human Effort: Why Readers Are Gravitating Toward Pre-2022 Books in the Age of AI
Industry News

The Value of Human Effort: Why Readers Are Gravitating Toward Pre-2022 Books in the Age of AI

A growing sentiment among readers suggests a subconscious preference for books published on or before 2022, driven by the perceived value of manual human labor. While Large Language Models (LLMs) have become essential tools for tasks like coding, their influence on the publishing industry has sparked a unique skepticism toward newer works, particularly from unknown authors. The core of this preference lies in the assurance that pre-2022 texts underwent a rigorous, manual process of typing, editing, and proofreading. This reflection highlights a tension between the efficiency of AI tools and the traditional weight given to human-crafted content. As society navigates this technological shift, the industry faces questions about how the 'effort' behind a creative work influences its perceived authority and value in a post-AI world.