Back to List
Meituan LongCat Releases General 365 Reasoning Benchmark as Leading AI Models Struggle to Pass
Industry NewsMeituanArtificial IntelligenceBenchmarking

Meituan LongCat Releases General 365 Reasoning Benchmark as Leading AI Models Struggle to Pass

The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream AI models, the results revealed a significant performance gap in the industry. Even the high-performing Gemini 3 Pro, currently regarded as one of the most capable models available, achieved an accuracy rate of only 62.8%. Furthermore, the evaluation demonstrated that the vast majority of tested models were unable to reach the 60% accuracy threshold, which is traditionally considered a passing grade. This release by Meituan's technology team establishes a challenging new standard for AI reasoning, highlighting that current frontier models still face substantial hurdles in mastering complex logical tasks.

美团技术团队

Key Takeaways

  • Launch of General 365: Meituan's LongCat team has introduced a new evaluation standard specifically focused on reasoning capabilities.
  • Widespread Performance Gap: Out of 26 mainstream models tested, most failed to achieve a 60% accuracy rate.
  • Gemini 3 Pro Results: The industry-leading Gemini 3 Pro secured the top spot but only managed a 62.8% accuracy score.
  • New Industry Benchmark: General 365 is positioned as a high-bar metric that exposes the limitations of current large language models in complex reasoning scenarios.

In-Depth Analysis

The Challenge of General 365

The introduction of General 365 by the Meituan LongCat team marks a pivotal shift in how artificial intelligence reasoning is measured. By testing 26 of the most prominent models in the current market, the benchmark provides a sobering look at the state of AI development. The fact that a majority of these models could not surpass the 60% mark suggests that General 365 is designed to test depth and logical consistency rather than simple pattern matching. This high level of difficulty serves to differentiate truly capable reasoning engines from those that rely on surface-level heuristics.

Benchmarking the Frontier: Gemini 3 Pro

One of the most significant findings from the Meituan report is the performance of Gemini 3 Pro. Despite its reputation as a leading model in the global AI landscape, its accuracy on the General 365 benchmark was limited to 62.8%. While this score placed it at the top of the 26 models tested, the narrow margin by which it passed the 60% threshold indicates that even the most advanced systems have considerable room for improvement. This data point underscores the rigor of the General 365 evaluation framework and suggests that the "reasoning" capabilities of modern LLMs are still in a relatively early stage of evolution when subjected to such stringent criteria.

The 60% Threshold and Model Failure Rates

The report highlights a concerning trend: the "passing line" of 60% remains out of reach for the bulk of the AI industry. With 26 mainstream models under review, the failure of the majority to reach this basic benchmark suggests a systemic challenge in current model architectures or training methodologies. Meituan's findings imply that while models are becoming more conversational and versatile, their ability to navigate the specific logical complexities demanded by General 365 remains a significant bottleneck. This creates a clear roadmap for future research, emphasizing the need for more robust reasoning frameworks.

Industry Impact

The release of General 365 by Meituan is likely to have a profound impact on the AI research community. By establishing a benchmark where even the strongest models struggle, Meituan has effectively raised the ceiling for what is considered "advanced" reasoning. This move encourages developers to move beyond traditional benchmarks that may have become saturated or prone to data contamination.

Furthermore, the transparency of these results—showing that most mainstream models fall below a 60% accuracy rate—provides a realistic baseline for enterprise expectations. As companies look to integrate AI into complex decision-making processes, benchmarks like General 365 offer a more accurate reflection of a model's reliability in high-stakes reasoning tasks. This will likely drive a new wave of optimization focused specifically on the logical gaps identified by the LongCat team.

Frequently Asked Questions

Question: What is the primary focus of the General 365 benchmark?

General 365 is an open-source benchmark released by the Meituan LongCat team specifically designed to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did the top-performing models fare on this benchmark?

According to the test results of 26 mainstream models, Gemini 3 Pro was the top performer with an accuracy of 62.8%. However, the majority of the other models tested failed to reach the 60% accuracy mark.

Question: Why is the 60% accuracy mark significant in this report?

The 60% mark is described as the "passing line." The fact that most mainstream models failed to reach this level highlights the extreme difficulty of the General 365 benchmark and the current limitations of AI reasoning.

Related News

World Monitor: An AI-Driven Real-Time Dashboard for Global Intelligence and Geopolitical Monitoring
Industry News

World Monitor: An AI-Driven Real-Time Dashboard for Global Intelligence and Geopolitical Monitoring

World Monitor is an innovative real-time global intelligence dashboard designed to provide comprehensive situational awareness. Developed by koala73, the platform integrates AI-driven news aggregation with specialized modules for geopolitical monitoring and infrastructure tracking. By offering a unified interface, World Monitor allows users to observe and analyze global events and critical infrastructure status in real-time. This project, which has gained traction on GitHub, represents a significant step in utilizing artificial intelligence to streamline the processing of complex international data. The tool aims to provide a centralized hub for tracking the pulse of global developments, making it a noteworthy addition to the landscape of open-source intelligence and situational awareness platforms.

Former Infosys Chief Vishal Sikka Launches New Startup to Disrupt Global IT Services Sector
Industry News

Former Infosys Chief Vishal Sikka Launches New Startup to Disrupt Global IT Services Sector

Vishal Sikka, the former CEO of Infosys and a prominent figure in the technology industry, has officially launched a new startup aimed at challenging the established order of the IT services world. The venture is backed by high-profile investors, including Mayfield and Aramco Ventures, signaling strong institutional confidence in Sikka's vision. The startup's founding team is composed of seasoned veterans from major industry players such as SAP, Infosys, and VianAI. By leveraging this deep pool of expertise in enterprise software and artificial intelligence, the new venture seeks to redefine the delivery and execution of IT services. This move comes at a pivotal time for the industry, as traditional service models face increasing pressure to evolve in the face of emerging technological shifts.

Cerebras Stock Plunges Following First Post-IPO Earnings Report Amid Concerns Over Core Business Gross Margin Outlook
Industry News

Cerebras Stock Plunges Following First Post-IPO Earnings Report Amid Concerns Over Core Business Gross Margin Outlook

AI chipmaker Cerebras experienced a significant decline in its stock price following the release of its inaugural earnings report as a public company. The primary driver for the investor sell-off was the company's forecast of narrower gross margins within its core business operations. Despite the negative market reaction, the CEO of Cerebras has publicly stated that the margin outlook provided in the report was misunderstood by the investment community. This development highlights the intense scrutiny faced by AI hardware companies as they transition to public markets and the high sensitivity of investors to profitability metrics in the competitive semiconductor landscape. The report marks a pivotal moment for the company as it navigates the expectations of shareholders while managing its core business growth.