Back to List
Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI
Industry NewsMeituanAI ReasoningBenchmarking

Meituan LongCat Open Sources General 365: A New Benchmark Revealing the Reasoning Limits of Modern AI

The Meituan LongCat team has officially released General 365, a new open-source benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the results highlight a significant gap in current AI reasoning performance. Gemini 3 Pro, currently regarded as one of the most powerful models globally, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested failed to reach the 60% threshold, which is traditionally considered a passing grade. This release by Meituan's technical team sets a rigorous new standard for the industry, emphasizing that complex reasoning remains a formidable challenge even for the most advanced artificial intelligence systems.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has open-sourced General 365, a benchmark specifically focused on general reasoning capabilities.
  • Performance Gap: Out of 26 mainstream models tested, the industry-leading Gemini 3 Pro only managed a 62.8% accuracy rate.
  • Widespread Underperformance: Most current AI models failed to reach the 60% accuracy mark on the General 365 benchmark.
  • Open Source Contribution: The release provides the AI community with a "new ruler" to measure and improve reasoning logic in large language models.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has introduced General 365 at a critical juncture in AI development. As large language models evolve, the focus is shifting from simple information retrieval to complex logical reasoning. By open-sourcing General 365, Meituan is providing a structured framework to evaluate how models handle multi-step logic and problem-solving. The title of the release, "Setting a New Ruler for Reasoning Evaluation," suggests that existing benchmarks may not be sufficiently challenging or comprehensive enough to distinguish the reasoning depths of modern LLMs. General 365 aims to fill this gap by offering a more rigorous testing ground.

Analyzing the Performance of Mainstream Models

The data released alongside General 365 provides a sobering look at the current state of artificial intelligence. The LongCat team conducted practical tests on 26 mainstream models, representing a broad cross-section of the industry's current capabilities. The results indicate that reasoning is still a significant hurdle. Even Gemini 3 Pro, which is described as the "strongest on the surface" (地表最强), only achieved an accuracy of 62.8%. This score, while leading the pack, suggests that even top-tier models struggle with nearly 40% of the reasoning tasks presented in the General 365 suite.

Perhaps more telling is the performance of the remaining 25 models. The report notes that the vast majority of these models did not even reach the 60% "passing line." This widespread failure to achieve a basic level of proficiency on the General 365 benchmark indicates that while AI has made strides in natural language processing, the underlying logical architecture required for consistent reasoning is still in its infancy for most developers. This data serves as a benchmark for the industry, highlighting the specific areas where current LLMs fall short.

Industry Impact

Redefining Success in AI Development

The introduction of General 365 is likely to shift the industry's focus toward more rigorous reasoning benchmarks. By demonstrating that even the most advanced models like Gemini 3 Pro have significant room for improvement, Meituan is encouraging a move away from superficial performance metrics toward deeper logical consistency. This "new ruler" provides a clear target for AI researchers, emphasizing that high-quality reasoning is the next frontier for model optimization.

Encouraging Transparency through Open Source

By open-sourcing the General 365 benchmark, the Meituan LongCat team is fostering a more transparent and competitive environment. Developers can now use this tool to identify specific weaknesses in their models' reasoning chains. As more teams adopt this benchmark, it could lead to a standardized way of reporting reasoning capabilities, making it easier for the industry to track progress and for users to understand the actual limitations of the AI tools they employ.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did top-tier models perform on this benchmark?

In tests involving 26 mainstream models, Gemini 3 Pro achieved the highest accuracy at 62.8%. However, most other models failed to reach a 60% accuracy rate, indicating that reasoning remains a major challenge for current AI technology.

Question: Why is the 60% accuracy mark significant in this report?

The report uses the 60% mark as a metaphorical "passing line." The fact that most models failed to reach this level suggests that current AI reasoning capabilities are not yet reliable for complex tasks defined by the General 365 benchmark.

Related News

Meituan BI Architecture Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency
Industry News

Meituan BI Architecture Evolution: Leveraging Metric Platforms and Enhanced Computing for Data Consistency

Meituan's Data Platform team has unveiled a new generation of Business Intelligence (BI) architecture centered on a unified Metric Platform. By developing two core capabilities—Automatic Semantics and Enhanced Computing—the team addresses critical challenges inherent in traditional BI systems. These challenges include inconsistent data definitions, often described as 'data caliber confusion,' and suboptimal query performance resulting from the proliferation of personalized datasets. This strategic shift aims to streamline data analysis workflows, ensuring that metrics remain consistent across the organization while maintaining high-performance data retrieval and processing capabilities.

Managing AI Coding Through Agent Evaluation: Lessons from Meituan’s 310,000-Line Code Refactoring Project
Industry News

Managing AI Coding Through Agent Evaluation: Lessons from Meituan’s 310,000-Line Code Refactoring Project

The Meituan technical team has introduced a novel approach to managing AI-driven software development by applying Agent evaluation logic to large-scale code refactoring. With AI now capable of generating over 90% of code, the team argues that the primary challenge has shifted from generation speed to the implementation of effective constraints. Without unified standards, AI risks amplifying technical chaos. By refactoring 310,000 lines of code, Meituan demonstrated a framework involving technical debt sorting, rule construction, a standardized Refactoring SOP, and a Pre-PR mechanism. This system transforms high-cost refactoring projects into continuous, daily iterative actions. The practice highlights the necessity of moving beyond simple code generation toward a structured management model that ensures long-term system maintainability in an AI-centric development environment.

Personal AI Infrastructure: A New Framework for Agentic AI Designed to Enhance Human Capabilities
Industry News

Personal AI Infrastructure: A New Framework for Agentic AI Designed to Enhance Human Capabilities

Daniel Miessler has introduced a new project titled "Personal AI Infrastructure," which is currently gaining traction on GitHub. The project is defined as an agentic AI infrastructure specifically designed to augment and enhance human capabilities. Unlike traditional AI tools that function as isolated applications, this initiative focuses on building the foundational infrastructure required to support autonomous agents that work on behalf of the individual. The core philosophy of the project centers on the shift from AI as a simple conversational interface to a robust, integrated system that serves as an extension of the user. By prioritizing the enhancement of human potential through structured agentic frameworks, the project aims to redefine how individuals interact with and leverage artificial intelligence in their daily lives and professional workflows.