Back to List
Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade
Research BreakthroughMeituanAI BenchmarkingReasoning Models

Meituan LongCat Releases General 365 Reasoning Benchmark: Most AI Models Fail to Reach Passing Grade

The Meituan LongCat team has officially open-sourced "General 365," a new evaluation benchmark designed to measure the reasoning capabilities of AI models. In a comprehensive test involving 26 mainstream models, the results revealed a significant gap in current AI reasoning performance. Even the industry-leading Gemini 3 Pro achieved an accuracy rate of only 62.8%, while the vast majority of tested models failed to reach the 60% threshold. This release aims to establish a more rigorous standard for evaluating complex reasoning tasks in the AI industry, highlighting the ongoing challenges in developing truly capable reasoning engines. By open-sourcing this tool, Meituan provides a new yardstick for the global AI community to assess and improve logical depth in large language models.

美团技术团队

Key Takeaways

  • Meituan LongCat Open-Sources General 365: A new benchmark specifically designed to evaluate the reasoning capabilities of AI models.
  • Widespread Performance Gap: Out of 26 mainstream models tested, the majority failed to reach a 60% accuracy rate, which is considered a basic passing grade.
  • Gemini 3 Pro Leads the Field: Currently the top performer on this benchmark, Gemini 3 Pro achieved an accuracy of 62.8%.
  • New Industry Standard: General 365 sets a high bar for reasoning, suggesting that current AI models still struggle with complex logical tasks.

In-Depth Analysis

The Launch of General 365 by Meituan LongCat

The Meituan LongCat team has officially introduced General 365, an open-source benchmark that aims to redefine how the industry evaluates the reasoning performance of Large Language Models (LLMs). In the current AI landscape, where many models excel at creative writing or basic information retrieval, the ability to perform consistent and complex logical reasoning remains a significant differentiator. General 365 was developed to address this specific need, providing a standardized and rigorous framework for testing. By making this benchmark open-source, Meituan is inviting the global research community to subject their models to a more demanding set of criteria, potentially exposing the limitations of systems that otherwise perform well on less specialized tests.

Analyzing the Reasoning Performance Gap

The initial evaluation conducted by the LongCat team involved 26 of the most prominent AI models currently available. The results of these tests are telling: the vast majority of these models were unable to reach a 60% accuracy level. In many academic and professional settings, 60% is viewed as the minimum threshold for competency, or a "passing grade." The fact that most mainstream models fell below this line indicates that the tasks within General 365 are specifically designed to challenge the logical foundations of these systems. This widespread failure suggests that while AI has become highly proficient at pattern recognition and linguistic fluency, the transition to robust, reliable reasoning is still in its early stages. The benchmark serves as a reality check for the industry, highlighting that "intelligence" in AI is not yet synonymous with "reasoning."

Gemini 3 Pro and the Current State of the Art

Among the 26 models evaluated, Gemini 3 Pro emerged as the leader, yet its performance further illustrates the difficulty of the General 365 benchmark. With an accuracy rate of 62.8%, Gemini 3 Pro is currently the only model cited as being at the top of the leaderboard, but even this score leaves significant room for improvement. A score of 62.8% implies that nearly four out of every ten reasoning tasks were handled incorrectly or incompletely. This result establishes a new "ceiling" for the current generation of AI, showing that even the most advanced models from leading tech giants are only just beginning to cross the threshold of basic reasoning competency. The data provided by Meituan suggests that the path to achieving human-level reasoning in AI will require more than just incremental updates; it may require fundamental shifts in how these models process logical structures.

Industry Impact

The introduction of General 365 is poised to have a significant impact on the AI industry by shifting the focus from general performance to specialized reasoning capabilities. As developers and researchers strive to climb the General 365 leaderboard, we can expect a renewed emphasis on architectural innovations that prioritize logic and multi-step problem-solving. Furthermore, Meituan's decision to open-source the benchmark ensures that it can become a transparent and evolving standard, preventing "benchmark saturation" where models are trained specifically to pass certain tests without gaining actual underlying capability. For the broader industry, these results serve as a call to action to address the "reasoning deficit" that currently exists even in the most sophisticated AI systems.

Frequently Asked Questions

What is the General 365 benchmark?

General 365 is an open-source evaluation tool released by the Meituan LongCat team. It is specifically designed to measure and benchmark the reasoning capabilities of AI models, providing a more rigorous standard than many existing general-purpose tests.

How did the top AI models perform on this test?

In a test of 26 mainstream models, the performance was generally low. Gemini 3 Pro was the top performer with a 62.8% accuracy rate. However, the majority of the other models tested failed to reach the 60% accuracy mark, which is considered the passing threshold for the benchmark.

Why did Meituan LongCat open-source this tool?

By open-sourcing General 365, Meituan aims to provide the AI community with a transparent and standardized way to evaluate reasoning. This encourages the development of models that are not just good at generating text, but are also capable of complex, logical thought processes.

Related News

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models
Research Breakthrough

Meituan LongCat Team Launches WBench: The First Systematic Multi-Round Evaluation Benchmark for Interactive Video World Models

The Meituan LongCat team has officially introduced and open-sourced WBench, a groundbreaking systematic multi-round evaluation benchmark designed specifically for interactive video world models. Described as a diagnostic "CT scanner" for AI, WBench is engineered to pinpoint the exact limitations of current models as they transition from passive observation to active, multi-turn interaction. By providing a structured framework for assessment, WBench allows researchers to identify where world models struggle in complex scenarios, ranging from lunar simulations to futuristic urban environments. This open-source initiative marks a significant milestone in the AI industry, offering a standardized tool to measure the boundaries of world models and facilitating the development of more sophisticated, interactive artificial intelligence systems.

Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI
Research Breakthrough

Meituan Tech Team Launches LARYBench: A New Benchmark for General Latent Action Representation in Embodied AI

The Meituan Technology Team has officially released LARYBench (Latent Action Representation Yielding Benchmark), a systematic evaluation framework designed to guide the learning of general latent action representations from large-scale visual data. Positioned as the 'ImageNet' for the field of embodied action, LARYBench provides a standardized metric for measuring how models learn from human video datasets. Experimental findings associated with the benchmark reveal that general-purpose vision models significantly outperform specialized embodied AI expert models in both action generalization and control precision. This research confirms that sophisticated embodied action representations can emerge naturally from massive human video data, marking a pivotal shift in how researchers approach robotic control and autonomous system training.

Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations
Research Breakthrough

Meituan LongCat-AudioDiT Revolutionizes Zero-Shot TTS Voice Cloning by Eliminating Intermediate Mel-Spectrogram Representations

The Meituan LongCat technical team has officially unveiled LongCat-AudioDiT, a pioneering model designed to push the boundaries of zero-shot Text-to-Speech (TTS) voice cloning. By fundamentally redesigning the synthesis pipeline, the model abandons traditional intermediate representations like Mel-spectrograms in favor of direct operation within the waveform latent space. Utilizing a Diffusion Transformer (DiT) architecture, LongCat-AudioDiT aims to learn the inherent laws of sound directly, thereby eliminating the cascaded errors typically associated with multi-stage data conversion. This breakthrough addresses a critical technical bottleneck in audio generation, offering a more streamlined and accurate approach to replicating human voices without the need for extensive speaker-specific training data.