Back to List
Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanLongCatAI Benchmark

Meituan LongCat Releases General 365: A Challenging New Benchmark for AI Reasoning Evaluation

Meituan's LongCat team has officially open-sourced General 365, a new evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the top-performing model, Gemini 3 Pro, achieved an accuracy of only 62.8%, while the vast majority of tested models failed to reach the 60% passing mark. This release aims to establish a more rigorous standard for the industry, highlighting the current limitations of even the most advanced AI systems in complex reasoning tasks. By providing a transparent and difficult metric, Meituan seeks to drive the development of more logically capable artificial intelligence.

美团技术团队

Key Takeaways

  • New Benchmark Released: Meituan's LongCat team has open-sourced General 365, a specialized benchmark for reasoning evaluation.
  • Low Overall Performance: Out of 26 mainstream models tested, most failed to reach a 60% accuracy threshold.
  • Top Performer: Gemini 3 Pro currently leads the benchmark but only achieved a score of 62.8%.
  • Industry Standard: General 365 is positioned as a new, more difficult "yardstick" for measuring logical reasoning in AI.

In-Depth Analysis

The Launch of General 365

The Meituan LongCat team has introduced General 365 with the specific intent of setting a new benchmark for reasoning evaluation. In the current landscape of artificial intelligence, many models perform exceptionally well on standard benchmarks that focus on knowledge retrieval or basic linguistic tasks. However, General 365 is designed to probe deeper into the logical and reasoning structures of these models. By open-sourcing this tool, Meituan provides the global developer community with a rigorous framework to test the limits of large language models (LLMs).

Performance Gap in Mainstream Models

The initial results released alongside the benchmark provide a sobering look at the current state of AI reasoning. Meituan conducted empirical tests on 26 of the most prominent models in the industry. The findings indicate that reasoning remains a significant hurdle for AI development. Gemini 3 Pro, recognized as one of the most powerful models currently available, secured the top spot but only managed an accuracy rate of 62.8%.

More strikingly, the vast majority of the 26 models tested did not even reach the 60% mark, which is often considered a basic "passing" grade in academic and professional evaluations. This suggests that while AI has made strides in many areas, complex reasoning—the ability to follow logical chains and solve intricate problems—is still an area where even the most advanced systems struggle. The data from General 365 highlights that there is still a long way to go before AI can consistently master high-level reasoning tasks.

Industry Impact

The introduction of General 365 is significant for the AI industry as it shifts the focus from general performance to specialized reasoning capabilities. By establishing a benchmark where even the "strongest" models score relatively low, Meituan is challenging the industry to move beyond superficial improvements.

This benchmark serves as a reality check for AI researchers and developers. It provides a clear, quantifiable metric that exposes the weaknesses in current logical processing. As more organizations adopt General 365 for internal testing, it is likely to influence the direction of model training, pushing developers to prioritize reasoning depth and logical consistency. Furthermore, as an open-source project, it encourages transparency and collaborative improvement across the AI ecosystem, setting a high bar for what constitutes a "capable" reasoning model.

Frequently Asked Questions

Question: What is the primary purpose of Meituan's General 365?

General 365 is an open-source benchmark created by the Meituan LongCat team specifically to evaluate and set a new standard for the reasoning capabilities of large language models.

Question: How did the top AI models perform on this new benchmark?

Performance was generally low across the board. Out of 26 mainstream models, Gemini 3 Pro performed the best with a 62.8% accuracy rate, while most other models failed to reach the 60% threshold.

Question: Why is General 365 considered a "new yardstick" for the industry?

It is considered a new yardstick because it focuses on complex reasoning tasks that current models find difficult, providing a more rigorous and challenging evaluation than many existing benchmarks.

Related News

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code
Industry News

Managing AI Coding with Agent Evaluation Thinking: Meituan's Practice in Refactoring 310,000 Lines of Code

As AI-generated code now accounts for over 90% of development in certain environments, the primary challenge has shifted from generation speed to the effective management and constraint of AI capabilities. Meituan's technical team recently shared their experience refactoring 310,000 lines of code using a strategy centered on "Agent evaluation thinking." By implementing technical debt assessment, standardized rules, a specialized Refactoring SOP, and a Pre-PR (Pull Request) mechanism, they have successfully transformed large-scale refactoring from a high-cost, periodic project into a continuous, daily operational task. This approach ensures that AI-driven development does not amplify systemic chaos but instead adheres to unified technical standards, maintaining long-term code quality and system stability in an AI-dominated coding era.

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI
Industry News

Meituan Technical Team Releases LARYBench: A New Benchmark for Universal Latent Action Representation in Embodied AI

The Meituan Technical Team has officially introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of universal latent action representations from large-scale visual data. This benchmark marks a significant milestone in embodied AI by providing a standardized way to measure how models learn actions from visual inputs. Experimental results from the benchmark reveal that general vision models significantly outperform specialized embodied action expert models in both action generalization and control precision. Furthermore, the research demonstrates that embodied action representations can naturally emerge from large-scale human video data, suggesting that broad visual training is a viable path toward achieving more sophisticated and adaptable robotic control systems.

Industry News

US Government Grants Anthropic Permission to Release Mythos Model to Selected Trusted Partners

In a significant development for the artificial intelligence sector, the United States government has officially authorized Anthropic to release its latest AI model, known as 'Mythos,' to a restricted group of 'trusted partners.' This decision, reported on June 26, 2026, underscores a growing trend of federal oversight in the deployment of high-capability AI systems. By limiting the initial rollout to specific entities, the move aims to balance the rapid pace of technological innovation with rigorous safety and security protocols. While the specific technical specifications of Mythos have not been publicly detailed, the requirement for government clearance suggests that the model possesses advanced capabilities that fall under current regulatory scrutiny. This event marks a pivotal moment in the relationship between AI developers and national regulators, establishing a framework for the controlled release of sensitive technology.