Back to List
Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation
Industry NewsMeituanAI BenchmarkingReasoning Models

Meituan LongCat Releases General 365: A New Benchmark for AI Reasoning Evaluation

The Meituan LongCat team has officially launched General 365, a rigorous new benchmark designed to evaluate the reasoning capabilities of artificial intelligence models. In an initial assessment of 26 mainstream models, the results reveal a significant performance gap in the industry. Google's Gemini 3 Pro, currently regarded as the strongest performer, achieved an accuracy rate of only 62.8%. Notably, the vast majority of the models tested failed to reach the 60% passing threshold, highlighting the intense difficulty of the General 365 evaluation. This release by Meituan sets a new standard for measuring high-level cognitive tasks in AI, suggesting that current large language models still face substantial hurdles in complex reasoning scenarios.

美团技术团队

Key Takeaways

  • New Evaluation Standard: Meituan's LongCat team has introduced General 365, a benchmark specifically focused on reasoning capabilities.
  • Industry Performance Gap: Out of 26 mainstream models tested, the majority failed to achieve a score of 60%.
  • Top Performer: Gemini 3 Pro currently leads the benchmark but only managed an accuracy rate of 62.8%.
  • Rigorous Testing: The benchmark is designed to be a "new yardstick," indicating a higher level of difficulty than previous evaluation methods.

In-Depth Analysis

The Launch of General 365 and the Reasoning Challenge

The Meituan LongCat team has officially released General 365, positioning it as a critical new benchmark for the AI industry. The primary objective of this tool is to provide a "new yardstick" for reasoning evaluation, moving beyond simple task completion to test the underlying logic and cognitive depth of large language models. The introduction of General 365 comes at a time when the industry is seeking more nuanced ways to differentiate between models that can perform basic functions and those that truly possess advanced reasoning skills.

According to the data provided by the LongCat team, the benchmark is intentionally designed to be challenging. By focusing on reasoning, Meituan is targeting one of the most difficult frontiers in AI development. The name "General 365" suggests a comprehensive, perhaps year-round or all-encompassing approach to testing, though the core focus remains strictly on the accuracy of reasoning outputs across a wide variety of scenarios.

Comparative Performance of Mainstream Models

The initial testing phase of General 365 involved 26 of the most prominent AI models currently available in the market. The results of these tests serve as a sobering reality check for the state of AI reasoning. Even the most advanced models struggled to maintain high accuracy levels when subjected to the General 365 criteria.

Gemini 3 Pro, which is identified as the current industry leader in terms of raw performance, reached an accuracy of 62.8%. While this score places it at the top of the list among the 26 models tested, it also highlights how much room for improvement remains. Perhaps more significant is the finding that the "passing line" of 60% was out of reach for the vast majority of models. This failure to meet a basic 60% threshold suggests that many current AI architectures, while proficient in language generation, still lack the robust reasoning frameworks required to navigate the complexities presented by the General 365 benchmark.

Industry Impact

The release of General 365 by Meituan's LongCat team is likely to have a profound impact on how AI models are developed and marketed. By establishing a benchmark where even the strongest models barely exceed a 60% accuracy rate, Meituan is forcing a shift in the industry's focus. Developers may now be incentivized to prioritize reasoning and logical consistency over mere fluency or parameter count.

Furthermore, the fact that a major technology player like Meituan is contributing to the evaluation ecosystem suggests a move toward more transparent and standardized testing. As models continue to evolve, benchmarks like General 365 will be essential for identifying which systems are truly capable of handling complex, real-world problem-solving. This benchmark sets a high bar, serving as both a challenge to current AI leaders and a roadmap for future research and development in the field of artificial intelligence reasoning.

Frequently Asked Questions

Question: What is the General 365 benchmark?

General 365 is a new reasoning evaluation benchmark released by the Meituan LongCat team. It is designed to serve as a rigorous standard for testing the logical and reasoning capabilities of mainstream AI models.

Question: Which model performed the best on the General 365 test?

According to the initial results, Gemini 3 Pro is the top-performing model on the General 365 benchmark, achieving an accuracy rate of 62.8%.

Question: How did most models perform on this new benchmark?

The majority of the 26 mainstream models tested failed to reach the 60% accuracy mark, which is considered the passing line for the General 365 evaluation.

Related News

Meituan Technical Team Showcases Six Research Papers at ACL 2026 Highlighting LLM Evaluation and Reasoning Optimization
Industry News

Meituan Technical Team Showcases Six Research Papers at ACL 2026 Highlighting LLM Evaluation and Reasoning Optimization

The Meituan technical team has announced the acceptance of six research papers at the ACL 2026 conference, a premier international event for computational linguistics and natural language processing. These papers cover a broad spectrum of cutting-edge AI domains, including large model evaluation, complex process reasoning, and the optimization of competition-level mathematical thinking. Additionally, the research explores advancements in reinforcement learning and the development of generative recommendation systems. By focusing on these critical areas, Meituan aims to establish a new paradigm for generative AI, addressing fundamental challenges in model performance, logical reasoning, and practical application. This contribution underscores Meituan's commitment to advancing the state of NLP and its integration into complex service ecosystems through rigorous academic research and technical optimization.

Managing AI Coding at Scale: Lessons from Refactoring 310,000 Lines of Code Using Agent Evaluation Logic
Industry News

Managing AI Coding at Scale: Lessons from Refactoring 310,000 Lines of Code Using Agent Evaluation Logic

As AI-generated code begins to account for over 90% of development output, the primary challenge for engineering teams shifts from production speed to systemic governance. This article details the Meituan Technical Team's experience in refactoring 310,000 lines of code by applying Agent evaluation principles to AI coding management. By focusing on technical debt sorting, rule construction, standardized operating procedures (SOPs), and a Pre-PR mechanism, the team successfully addressed the risk of AI-amplified chaos. The approach transforms large-scale refactoring from a high-cost, specialized project into a sustainable, daily iterative process. This framework ensures that AI remains a tool for improvement rather than a source of technical debt, providing a blueprint for enterprise-level AI integration in software development.

Meituan BI Evolution: Building a Metric-Centric Architecture with Automatic Semantics and Enhanced Calculation
Industry News

Meituan BI Evolution: Building a Metric-Centric Architecture with Automatic Semantics and Enhanced Calculation

Meituan's Data Platform team has pioneered a next-generation Business Intelligence (BI) architecture that shifts the focus from traditional dataset-driven models to a centralized metric platform. This strategic transformation addresses critical pain points in data management, specifically the issues of inconsistent data definitions—often referred to as 'data caliber confusion'—and suboptimal query performance. By leveraging two core technical pillars, 'automatic semantics' and 'enhanced calculation,' Meituan has developed a system that streamlines data interpretation and accelerates analytical processing. This evolution represents a significant step in Meituan's efforts to provide a more reliable and efficient data environment for its complex business operations, ensuring that data-driven decisions are based on consistent, high-performance analytics.