Back to List
Meituan LongCat Open Sources General 365: A New Benchmark Revealing AI Reasoning Challenges
Industry NewsMeituanArtificial IntelligenceOpen Source

Meituan LongCat Open Sources General 365: A New Benchmark Revealing AI Reasoning Challenges

Meituan's LongCat team has officially released General 365, an open-source benchmark designed to evaluate the reasoning capabilities of modern AI models. Through a rigorous assessment of 26 mainstream models, the team discovered a significant performance gap in the industry. Gemini 3 Pro emerged as the top performer with an accuracy rate of 62.8%, yet it remains one of the few to surpass the 60% mark. The majority of the models tested failed to reach this basic competency level, highlighting the ongoing challenges in developing advanced reasoning within artificial intelligence. This benchmark serves as a critical new tool for the AI community to measure and improve logical processing, setting a high bar for future model development.

美团技术团队

Key Takeaways

  • New Benchmark Released: Meituan's LongCat team has open-sourced General 365, a specialized tool for evaluating AI reasoning.
  • Industry-Wide Testing: The benchmark was used to test 26 mainstream AI models to assess their logical capabilities.
  • Gemini 3 Pro Leads: Currently identified as the strongest model, Gemini 3 Pro achieved an accuracy rate of 62.8%.
  • Performance Gap: The vast majority of tested models failed to reach a 60% accuracy threshold, indicating a widespread struggle with complex reasoning.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially introduced General 365 to the global AI community. As an open-source reasoning evaluation benchmark, General 365 aims to provide a more accurate and demanding standard for measuring how well large language models can handle complex logical tasks. By open-sourcing this tool, Meituan is providing a transparent framework that allows developers and researchers to test their models against a set of criteria that reflects real-world reasoning challenges.

Evaluation of Mainstream Models

In the initial rollout of General 365, the LongCat team conducted a comprehensive evaluation involving 26 of the most prominent AI models currently available in the market. The results of these tests offer a sobering look at the current state of artificial intelligence. Even the model recognized as the most powerful in this evaluation, Gemini 3 Pro, only managed to secure an accuracy rate of 62.8%. This score, while leading the pack, suggests that even the most advanced systems have significant room for improvement when it comes to deep reasoning.

The 60% Accuracy Threshold

One of the most striking findings from the LongCat team's report is the performance of the broader field of AI models. According to the data, the vast majority of the 26 models tested were unable to reach the 60% accuracy mark. In the context of this benchmark, the 60% level is viewed as a basic passing grade or a "passing line." The fact that most mainstream models failed to meet this standard highlights a critical bottleneck in AI development: while models are becoming increasingly proficient at language generation, their ability to consistently apply logic and reasoning remains underdeveloped.

Industry Impact

The release of General 365 and the subsequent performance data have significant implications for the AI industry. By establishing a benchmark where even the top-tier models struggle to exceed 60% accuracy, Meituan has set a new, more rigorous standard for what constitutes "strong" reasoning. This will likely shift the industry's focus toward improving the underlying logical architectures of models rather than simply increasing parameter counts or conversational fluency. Furthermore, as an open-source project, General 365 provides a standardized metric that can foster more honest and transparent competition among AI developers worldwide.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark created by the Meituan LongCat team specifically to evaluate and set a new standard for the reasoning capabilities of AI models.

Question: Which model performed the best on the General 365 benchmark?

Gemini 3 Pro performed the best among the 26 mainstream models tested, achieving an accuracy rate of 62.8%.

Question: How did most AI models fare in the reasoning tests?

Most of the 26 mainstream models tested failed to reach the 60% accuracy threshold, which is considered the passing line for the benchmark.

Related News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Industry News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has announced the release and open-sourcing of WBench, a pioneering systematic multi-round evaluation benchmark specifically designed for interactive video world models. Positioned as a diagnostic "CT scanner" for AI, WBench aims to provide precise insights into the technical bottlenecks that occur during the transition from passive video generation to active user interaction. By evaluating models across diverse scenarios—ranging from lunar walks to futuristic cyber cities—WBench addresses the critical need for standardized metrics in the evolving field of world models. This benchmark represents a significant step in identifying where current AI systems struggle to maintain consistency and logic during complex, multi-stage interactive sequences, offering a roadmap for future development in the industry.

Meituan at ACL 2026: Advancing Generative AI Through Evaluation, Reasoning, and Optimization
Industry News

Meituan at ACL 2026: Advancing Generative AI Through Evaluation, Reasoning, and Optimization

The Meituan Technical Team has announced that six of its research papers have been accepted for ACL 2026, a premier international conference in computational linguistics and natural language processing (NLP). These papers represent a significant contribution to the field, covering a diverse range of cutting-edge topics including large language model (LLM) evaluation, complex process reasoning, and competition-level mathematical thinking optimization. Furthermore, the research explores advancements in reinforcement learning and the emerging field of generative recommendation systems. By focusing on these critical areas, Meituan aims to establish a new paradigm for generative AI, bridging the gap between theoretical research and practical industry applications. This selection underscores Meituan's growing influence in the global AI research community and its commitment to solving complex technical challenges in the NLP domain.

Anthropic-Cybersecurity-Skills: 817 Structured AI Agent Capabilities Mapped to Global Security Frameworks
Industry News

Anthropic-Cybersecurity-Skills: 817 Structured AI Agent Capabilities Mapped to Global Security Frameworks

A significant new repository titled 'Anthropic-Cybersecurity-Skills' has been released, providing a comprehensive library of 817 structured cybersecurity skills specifically designed for AI agents. This initiative utilizes the agentskills.io standard to ensure interoperability across more than 20 major platforms, including Claude Code, GitHub Copilot, and Gemini CLI. The skills are meticulously mapped to six essential industry frameworks: MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND, NIST AI RMF, and MITRE F3 (Fight Fraud). By bridging the gap between AI automation and standardized security protocols, this project offers a structured roadmap for deploying AI agents in complex security environments, focusing on threat detection, risk management, and fraud prevention.