Back to List
Meituan LongCat Releases General 365: A New Reasoning Benchmark Where Most AI Models Fail to Pass
Industry NewsArtificial IntelligenceBenchmarkingMeituan

Meituan LongCat Releases General 365: A New Reasoning Benchmark Where Most AI Models Fail to Pass

The Meituan LongCat team has officially open-sourced 'General 365,' a rigorous new benchmark designed to evaluate the reasoning capabilities of large language models. In an initial assessment of 26 mainstream AI models, the results highlight a significant gap in current cognitive performance. Even Gemini 3 Pro, identified as the top performer in the test, achieved an accuracy rate of only 62.8%. Furthermore, the vast majority of the models tested were unable to reach the 60% passing threshold. This release by Meituan's technology team provides a new standard for the industry, revealing that complex reasoning remains a substantial challenge for even the most advanced artificial intelligence systems currently available.

美团技术团队

Key Takeaways

  • Meituan's LongCat team has officially released and open-sourced the General 365 reasoning benchmark.
  • Evaluation of 26 mainstream models reveals that most current AI systems struggle with complex reasoning tasks.
  • Gemini 3 Pro emerged as the top performer with an accuracy of 62.8%, yet this remains relatively low for a leading model.
  • The majority of tested models failed to reach a 60% accuracy score, establishing a high difficulty ceiling for the benchmark.

In-Depth Analysis

The Launch of General 365

The Meituan LongCat team has introduced General 365 as a specialized tool for evaluating the reasoning depth of artificial intelligence. By open-sourcing this benchmark, the team provides the global AI community with a new metric to measure progress beyond simple linguistic fluency. The focus of General 365 is specifically on 'General' reasoning, suggesting a broad application across various logical domains. The release comes at a time when the industry is shifting its focus from model size to the quality of logical output and problem-solving efficiency.

Performance Gap in Mainstream Models

The results released alongside the benchmark provide a sobering look at the current state of AI. Out of 26 mainstream models tested, the performance was notably lower than what is typically seen on standard benchmarks. The fact that Gemini 3 Pro, currently regarded as one of the most capable models globally, only secured a 62.8% accuracy rate indicates that General 365 contains highly challenging reasoning problems. This data point serves as a critical indicator that even the 'strongest' models have significant room for improvement when faced with the specific criteria set by the LongCat team.

The 60% Passing Threshold

A striking finding from the LongCat team's report is that the vast majority of the 26 models failed to reach the 60% mark. In many academic and professional contexts, 60% is considered the minimum threshold for a 'passing' grade. The failure of most mainstream models to meet this baseline suggests that current LLM (Large Language Model) architectures may still lack the robust logical frameworks required for consistent reasoning. This gap between current performance and the 'passing' line highlights the rigorous nature of General 365 as a evaluative standard.

Industry Impact

The introduction of General 365 by Meituan is significant for the AI industry as it establishes a more demanding yardstick for reasoning. By making the benchmark open-source, Meituan allows other developers to stress-test their models against the same 26-model baseline. This could lead to a shift in development priorities, moving away from general knowledge retrieval and toward the enhancement of internal logic and multi-step reasoning. As models strive to exceed the 62.8% mark set by Gemini 3 Pro, General 365 will likely become a key reference point for future iterations of large language models.

Frequently Asked Questions

Question: What is the significance of the 62.8% score achieved by Gemini 3 Pro?

Within the context of the General 365 benchmark, 62.8% represents the highest accuracy among 26 mainstream models. While it leads the field, the score suggests that even top-tier AI models face difficulty with the reasoning tasks included in this specific evaluation.

Question: Why did most models fail to reach the 60% mark on General 365?

The failure of the majority of models to reach 60% indicates that the General 365 benchmark is designed with a high level of difficulty that targets the weaknesses in current AI reasoning capabilities, setting a new and more difficult standard for the industry.

Question: Is General 365 available for public use?

Yes, the Meituan LongCat team has officially open-sourced General 365, allowing the broader technology community to use it for evaluating and improving AI model reasoning.

Related News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study
Industry News

Managing AI Coding Through Agent Evaluation: A 310,000-Line Code Refactoring Case Study

As AI-generated code begins to account for over 90% of total software production, the technical landscape is shifting from a focus on development speed to a focus on systemic constraints. Meituan's technical team recently shared their experience refactoring 310,000 lines of code by applying Agent evaluation methodologies to AI coding management. The core of their strategy involves addressing technical debt, establishing strict rules, and implementing a Refactoring SOP alongside a Pre-PR (Pull Request) mechanism. By transitioning from high-cost, specialized refactoring projects to continuous, iteration-based maintenance, the team has demonstrated how to prevent AI from amplifying system chaos. This case study highlights the necessity of structured frameworks in the era of AI-led development to ensure long-term code quality and system stability.

LLM-Driven Stock Analysis: Exploring the ZhuLinsen Daily Stock Analysis System for Multi-Market Intelligence
Industry News

LLM-Driven Stock Analysis: Exploring the ZhuLinsen Daily Stock Analysis System for Multi-Market Intelligence

The 'daily_stock_analysis' project, developed by ZhuLinsen and recently trending on GitHub, introduces a sophisticated Large Language Model (LLM) driven system designed for comprehensive stock market intelligence. By synthesizing multi-source market data and real-time news, the system offers users a centralized decision-making dashboard and automated push notifications. A defining characteristic of this tool is its support for zero-cost scheduled operations, making high-level financial analysis more accessible to a broader audience. This article provides an in-depth look at how the system leverages AI to transform raw market data into actionable insights, the significance of its multi-market support, and the implications of automated, low-cost financial monitoring in the modern investment landscape.

WazirX Integrates AI and Futures Trading as Recovery Efforts Continue Following Major 2024 Security Breach
Industry News

WazirX Integrates AI and Futures Trading as Recovery Efforts Continue Following Major 2024 Security Breach

Indian cryptocurrency exchange WazirX has officially announced the addition of artificial intelligence (AI) features and futures trading to its platform. This development marks a significant product expansion for the exchange as it navigates the long-term repercussions of a major security incident. According to recent reports, WazirX has successfully frozen approximately US$3 million in assets linked to the massive US$234.9 million hack that occurred in July 2024. The introduction of advanced trading tools like AI-driven analytics and futures contracts suggests a strategic move to regain market momentum and enhance user utility. While the recovery of $3 million represents a step forward in addressing the 2024 breach, it remains a fraction of the total losses sustained, highlighting the ongoing challenges in asset retrieval within the decentralized finance ecosystem.