Back to List
Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning
Research BreakthroughMeituanLongCatAI Benchmarking

Meituan LongCat Team Launches General 365: A Rigorous New Benchmark for AI Reasoning

The Meituan LongCat team has officially released General 365, a sophisticated evaluation benchmark designed to measure the reasoning capabilities of large language models (LLMs). In an initial assessment of 26 mainstream models, the benchmark revealed a significant performance gap across the industry. Gemini 3 Pro, currently regarded as one of the most capable models, achieved an accuracy rate of only 62.8%. More strikingly, the vast majority of the models tested failed to reach the 60% threshold, which is considered a basic passing grade. This release by Meituan sets a new, more challenging standard for AI evaluation, highlighting that complex reasoning remains a major hurdle for even the most advanced artificial intelligence systems today.

美团技术团队

Key Takeaways

  • New Benchmark Release: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on evaluating the reasoning performance of AI models.
  • Industry-Wide Testing: The benchmark was used to evaluate 26 mainstream models to provide a comprehensive overview of the current state of AI reasoning.
  • Gemini 3 Pro Performance: Even the top-performing model in the test, Gemini 3 Pro, only reached an accuracy of 62.8%.
  • Low Success Rates: Most models evaluated failed to achieve a 60% accuracy score, indicating that current AI reasoning capabilities are still in their early stages relative to this new standard.

In-Depth Analysis

The Introduction of General 365

The Meituan LongCat team has officially entered the AI evaluation space with the release of General 365. This benchmark is designed to address the growing need for more rigorous testing of reasoning capabilities in large language models. As AI development shifts from simple conversational tasks to complex problem-solving, the industry requires benchmarks that can accurately differentiate between surface-level pattern matching and deep logical reasoning. General 365 appears to be positioned as a "high bar" for the industry, focusing on areas where current models still struggle significantly.

Analyzing the Performance Gap

The results released alongside the benchmark provide a sobering look at the current state of artificial intelligence. By testing 26 mainstream models, the LongCat team has established a broad baseline for performance. The fact that Gemini 3 Pro—a model recognized for its advanced capabilities—only managed a score of 62.8% suggests that General 365 contains tasks that are significantly more difficult than those found in traditional benchmarks.

Furthermore, the observation that the majority of models could not reach the 60% "passing line" highlights a critical bottleneck in AI development. This failure rate suggests that while models are becoming better at generating fluent text, their underlying logical frameworks are not yet robust enough to handle the specific reasoning challenges posed by General 365. This data indicates that the industry may have been overestimating the reasoning maturity of current LLMs based on older, less demanding benchmarks.

Setting a New Standard for Reasoning

By establishing a benchmark where even the "strongest" models are barely passing, Meituan is effectively recalibrating the expectations for AI performance. General 365 serves as a diagnostic tool that identifies the limits of current technology. The 60% threshold mentioned by the LongCat team acts as a symbolic barrier, separating models that possess basic reasoning competency from those that do not. This rigorous approach is essential for guiding future research and development, as it provides a clear target for engineers looking to improve the logical consistency and problem-solving depth of their models.

Industry Impact

The release of General 365 is likely to have a profound impact on how AI models are marketed and developed. For years, the industry has relied on benchmarks where top models frequently score in the 80th or 90th percentiles, leading to a perception that reasoning is a "solved" problem. General 365 shatters this illusion by showing that when the difficulty is increased, performance drops precipitously. This will likely push AI labs to focus more on the quality of reasoning rather than just the scale of the models.

Additionally, Meituan's involvement underscores the importance of real-world application providers in the AI ecosystem. As a company that relies on AI for complex logistics and consumer services, Meituan has a vested interest in ensuring that the models they use are truly capable of logical deduction. General 365 provides a transparent metric that can be used by both developers and enterprise users to assess the true utility of an AI model in high-stakes reasoning scenarios.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a new evaluation benchmark released by the Meituan LongCat team. It is specifically designed to test and measure the reasoning capabilities of mainstream large language models, providing a more rigorous standard than many existing evaluations.

Question: How did the top models perform on General 365?

According to the initial results, Gemini 3 Pro was the top performer with an accuracy rate of 62.8%. However, the vast majority of the 26 mainstream models tested failed to reach a 60% accuracy score, which is considered the passing threshold for the benchmark.

Question: Why is General 365 significant for the AI industry?

It is significant because it reveals a major gap in the reasoning abilities of current AI models. By setting a high difficulty level where most models fail to pass, it provides a more accurate and challenging metric for the next generation of AI development, moving beyond simpler benchmarks where models already achieve high scores.

Related News

LARYBench: Defining the ImageNet for Embodied Action Representation and Generalization
Research Breakthrough

LARYBench: Defining the ImageNet for Embodied Action Representation and Generalization

The Meituan Technical Team has introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to measure general latent action representations derived from large-scale visual data. This benchmark marks a significant milestone in embodied AI, often compared to the 'ImageNet' moment for action representation. Experimental findings reveal that general vision models significantly outperform specialized embodied AI expert models in both action generalization and control precision. Crucially, the research demonstrates that embodied action representations can effectively emerge from large-scale human video data, suggesting a new paradigm for training AI to understand and execute physical movements without relying solely on specialized robotic datasets.

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space Diffusion
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Advancing Zero-Shot TTS Voice Cloning via Waveform Latent Space Diffusion

The Meituan LongCat team has officially announced the release of LongCat-AudioDiT, a specialized model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally rethinking the audio synthesis pipeline, the team has moved away from traditional intermediate representations such as Mel-spectrograms. Instead, LongCat-AudioDiT operates directly within the waveform latent space using a diffusion-based framework. This strategic shift is intended to eliminate the cascade errors that typically arise during multi-stage data conversion processes in conventional TTS systems. By allowing the AI to learn the inherent patterns of sound directly, the model aims to achieve a higher level of fidelity and accuracy in voice cloning, representing a significant technical breakthrough in the field of generative audio.

Challenging Anthropomorphism: Why Age of Empires II Might Have Human-Like Attributes if LLMs Do
Research Breakthrough

Challenging Anthropomorphism: Why Age of Empires II Might Have Human-Like Attributes if LLMs Do

A provocative research paper by Adrian de Wynter, titled 'If LLMs Have Human-Like Attributes, Then So Does Age of Empires II,' challenges the prevailing tendency in AI research to ascribe anthropomorphic qualities to Large Language Models (LLMs). The study argues that attributes such as morality or natural language understanding, often assumed to emerge in LLMs, are empirically non-unique. By training a simple neural network on the classic videogame Age of Empires II, de Wynter demonstrates that if these attributes are granted to LLMs, they could logically be attributed to any entity within a sufficiently powerful substrate, including LEGO or even the Greater Boston Area. The paper calls for explicit measurement criteria in AI evaluation and proposes a 'null assumption' of non-uniqueness to prevent circular or uninformative conclusions in the field of computation and language.