Back to List
Meituan LongCat Releases General 365: A New Rigorous Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning

Meituan LongCat Releases General 365: A New Rigorous Benchmark for AI Reasoning Evaluation

The Meituan LongCat team has officially launched General 365, a new benchmark specifically designed to evaluate the reasoning capabilities of large language models. In an initial assessment involving 26 mainstream AI models, the benchmark revealed a significant performance gap in the industry. Gemini 3 Pro, currently regarded as one of the most capable models, achieved an accuracy rate of only 62.8%. Furthermore, the evaluation found that the vast majority of tested models failed to reach a 60% accuracy threshold, which is considered a basic passing grade. This release by Meituan sets a new standard for measuring cognitive depth in AI, highlighting that complex reasoning remains a formidable challenge for even the most advanced systems currently available.

美团技术团队

Key Takeaways

  • Launch of General 365: Meituan's LongCat team has introduced a new evaluation standard focused on AI reasoning.
  • Industry-Wide Testing: The benchmark was applied to 26 mainstream models to assess the current state of the industry.
  • Gemini 3 Pro Performance: As the top-performing model in the test, Gemini 3 Pro reached an accuracy of 62.8%.
  • The 60% Threshold: Most models evaluated failed to achieve a 60% accuracy score, indicating a widespread struggle with complex reasoning tasks.

In-Depth Analysis

The Launch of General 365 by Meituan LongCat

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is positioned as a new "yardstick" for reasoning, a critical area where many large language models (LLMs) still face significant hurdles. By focusing on reasoning rather than simple knowledge retrieval or linguistic fluency, General 365 aims to provide a more granular look at the logical processing capabilities of modern AI. The introduction of this benchmark by a major technology team like Meituan suggests a growing need for specialized tools that can distinguish between surface-level performance and deep cognitive reasoning.

Performance Disparity Among Mainstream Models

In the inaugural testing phase of General 365, the LongCat team evaluated 26 mainstream models. The results provide a sobering look at the current limitations of artificial intelligence. Even the most advanced model currently available, Gemini 3 Pro, only managed to secure an accuracy rate of 62.8%. While this score places it at the top of the current field, it also highlights how much room for improvement remains in the realm of complex reasoning. The fact that the "strongest on earth" model is barely clearing the 60% mark suggests that the tasks within General 365 are designed to push models to their absolute logical limits.

The Reasoning Performance Gap

Perhaps the most striking finding from the Meituan LongCat report is that the vast majority of models failed to reach the 60% accuracy threshold. In many academic and professional settings, 60% is considered the minimum passing grade. The failure of most mainstream models to meet this benchmark indicates a systemic gap in the reasoning capabilities of current AI architectures. This data suggests that while AI has made massive strides in natural language processing and creative generation, the ability to maintain logical consistency and solve complex multi-step reasoning problems remains an elusive goal for the bulk of the industry's current offerings.

Industry Impact

The release of General 365 is likely to have a significant impact on how AI models are developed and marketed. By establishing a benchmark where even the industry leaders struggle, Meituan is shifting the focus toward logical precision and cognitive depth. This may encourage AI labs to move beyond scaling parameters and instead focus on architectural innovations that enhance reasoning. Furthermore, as more companies look to integrate AI into decision-making processes, benchmarks like General 365 will become essential for identifying which models can actually handle complex, real-world logic versus those that merely simulate intelligence through pattern matching.

Frequently Asked Questions

Question: What is the primary focus of the General 365 benchmark?

Answer: General 365 is specifically designed to evaluate the reasoning capabilities of AI models, serving as a new standard for measuring logical depth and problem-solving accuracy.

Question: Which model performed the best on the General 365 evaluation?

Answer: Gemini 3 Pro achieved the highest accuracy among the 26 models tested, with a score of 62.8%.

Question: How did the majority of mainstream AI models perform?

Answer: Most of the 26 mainstream models tested were unable to reach the 60% accuracy mark, which is often considered the baseline for a passing grade in reasoning tasks.

Related News

Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation
Industry News

Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation

Meituan's LongCat team has officially open-sourced General 365, a new evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the top-performing model, Gemini 3 Pro, achieved an accuracy of only 62.8%, while the vast majority of tested models failed to reach the 60% passing mark. This release aims to establish a more rigorous standard for the industry, highlighting the current limitations of even the most advanced AI systems in complex reasoning tasks. By providing a transparent and difficult metric, Meituan seeks to drive the development of more logically capable artificial intelligence.

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code

As AI-generated code now accounts for over 90% of development in certain environments, the primary challenge has shifted from generation speed to the effective management and constraint of AI capabilities. Meituan's technical team recently shared their experience refactoring 310,000 lines of code using a strategy centered on "Agent evaluation thinking." By implementing technical debt assessment, standardized rules, a specialized Refactoring SOP, and a Pre-PR (Pull Request) mechanism, they have successfully transformed large-scale refactoring from a high-cost, periodic project into a continuous, daily operational task. This approach ensures that AI-driven development does not amplify systemic chaos but instead adheres to unified technical standards, maintaining long-term code quality and system stability in an AI-dominated coding era.

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI
Industry News

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of universal latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI by providing a standardized way to measure how models learn actions from visual inputs. Experimental results from the benchmark reveal that general vision models significantly outperform specialized embodied action expert models in both action generalization and control precision. Furthermore, the research demonstrates that embodied action representations can naturally emerge from large-scale human video data, suggesting that broad visual training is a viable path toward achieving more sophisticated and adaptable robotic control systems.