Back to List
Meituan LongCat Team Releases General 365 Benchmark Revealing Significant Reasoning Gaps in Leading AI Models
Research BreakthroughMeituanLongCatAI Benchmarking

Meituan LongCat Team Releases General 365 Benchmark Revealing Significant Reasoning Gaps in Leading AI Models

The Meituan LongCat team has officially introduced General 365, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs). In a comprehensive assessment of 26 mainstream models, the results indicate a challenging landscape for current AI technology. Even Gemini 3 Pro, currently regarded as one of the most powerful models available, achieved an accuracy rate of only 62.8%. The benchmark results further reveal that the vast majority of tested models failed to reach a 60% accuracy threshold, which is often considered a basic passing grade. This release by Meituan's technical team establishes a rigorous new standard for measuring AI reasoning, highlighting that most current models still struggle with complex logical tasks.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan’s LongCat team has launched General 365, a benchmark specifically focused on testing the reasoning limits of AI models.
  • Gemini 3 Pro Performance: As the top-performing model in the initial test, Gemini 3 Pro reached an accuracy of 62.8%, setting the current ceiling for the benchmark.
  • Industry-Wide Challenge: Out of 26 mainstream models tested, the majority were unable to achieve a 60% accuracy rate, indicating a widespread struggle with the benchmark's requirements.
  • Rigorous Benchmarking: The results suggest that General 365 serves as a high-bar metric, exposing the limitations of even the most advanced current large language models.

In-Depth Analysis

The Launch of General 365 and the Reasoning Frontier

The release of General 365 by the Meituan LongCat team marks a significant pivot in how AI performance is measured. While many existing benchmarks focus on general knowledge or linguistic fluency, General 365 appears to target the core of cognitive AI: reasoning. By testing 26 mainstream models, the LongCat team has provided a cross-sectional view of the industry's current capabilities. The fact that the benchmark was introduced with such a diverse range of models suggests a goal of creating a universal standard that can differentiate between surface-level pattern matching and deep logical reasoning.

Analyzing the Performance Gap

The data provided by the Meituan technical team highlights a stark reality in the AI field. Gemini 3 Pro, which represents the current state-of-the-art in model development, led the group but only managed a score of 62.8%. This score is particularly telling when compared to the rest of the field; the majority of the 26 models failed to reach the 60% mark. This "60% threshold" serves as a symbolic passing grade, and the failure of most models to meet it suggests that General 365 is designed to be exceptionally difficult. It exposes a significant gap between the perceived intelligence of modern LLMs and their actual performance on rigorous reasoning tasks. The results imply that while models are becoming more sophisticated, their ability to handle complex, multi-step reasoning remains a primary bottleneck.

Industry Impact

The introduction of General 365 is likely to influence the AI industry by shifting the focus toward more specialized and difficult evaluation metrics. As general-purpose benchmarks become saturated with high-performing models, the industry requires more nuanced tools like General 365 to identify true progress in reasoning. Meituan's findings serve as a reality check for developers and researchers, proving that even the most acclaimed models like Gemini 3 Pro have substantial room for improvement. This benchmark may drive a new wave of research focused specifically on logical consistency and reasoning depth, as the current "passing rate" for the industry remains notably low according to these new standards.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to test the reasoning capabilities of large language models and has recently been used to evaluate 26 mainstream models in the industry.

Question: How did the top AI models perform on this benchmark?

According to the report from Meituan, Gemini 3 Pro was the top performer with an accuracy rate of 62.8%. However, the majority of the 26 models tested did not reach the 60% accuracy threshold, highlighting the difficulty of the benchmark.

Question: Why is the 60% accuracy mark significant in this report?

The 60% mark is described as a "passing grade" (or benchmark line). The fact that most mainstream models failed to reach this level indicates that current AI reasoning capabilities are still insufficient when faced with the specific challenges presented by General 365.

Related News

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Unveils WBench: The First Systematic Multi-Round Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a pioneering evaluation benchmark designed to measure the capabilities of interactive video world models. As the first systematic framework for multi-round interaction assessment, WBench serves as a diagnostic tool—likened to a 'CT scanner'—to identify the specific technical hurdles AI models face when transitioning from passive observation to active, multi-stage interaction. By testing models across diverse scenarios ranging from lunar environments to futuristic urban settings, WBench establishes a new standard for defining the boundaries of world models. This release marks a significant step in providing the AI research community with the tools necessary to pinpoint and resolve the bottlenecks currently limiting the development of truly interactive artificial intelligence.

LARYBench Launch: Defining the ImageNet for Embodied Action Representations and Measuring Generalization from Human Video Data
Research Breakthrough

LARYBench Launch: Defining the ImageNet for Embodied Action Representations and Measuring Generalization from Human Video Data

The Meituan Technical Team has introduced LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. This benchmark serves as a foundational tool, akin to ImageNet for computer vision, but specifically tailored for embodied intelligence. Experimental results from the benchmark reveal a significant discovery: general vision models demonstrate superior performance in action generalization and control precision compared to specialized action expert models designed specifically for embodied AI. This indicates that sophisticated embodied action representations can emerge naturally from training on extensive human video datasets, suggesting a new pathway for developing robotic control systems through general-purpose visual learning.

Meituan LongCat Team Unveils LongCat-AudioDiT: Revolutionizing Zero-Shot TTS Voice Cloning via Waveform Latent Space Diffusion
Research Breakthrough

Meituan LongCat Team Unveils LongCat-AudioDiT: Revolutionizing Zero-Shot TTS Voice Cloning via Waveform Latent Space Diffusion

The Meituan LongCat team has officially released LongCat-AudioDiT, a pioneering model designed to overcome the technical limitations of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the synthesis pipeline, the model abandons traditional intermediate representations such as Mel-spectrograms. Instead, it operates directly within the waveform latent space using a diffusion-based framework. This strategic shift is intended to eliminate cascade errors caused by multi-stage data conversion, allowing the AI to learn the inherent laws of sound directly from the source. LongCat-AudioDiT represents a significant advancement in audio synthesis, offering a more streamlined and high-fidelity approach to replicating human voices without the need for extensive target-specific training, thereby setting a new benchmark for the industry.