Back to List
Ornith-1.0: New Open-Source Self-Improving Models Set State-of-the-Art Benchmarks for Agentic Coding Tasks
Product LaunchOpen SourceAI AgentsCoding

Ornith-1.0: New Open-Source Self-Improving Models Set State-of-the-Art Benchmarks for Agentic Coding Tasks

Ornith-1.0 has been introduced as a suite of self-improving open-source models specifically engineered for agentic coding. Developed by deepreinforce-ai, these models range from 9B-Dense to 397B-MoE architectures, post-trained on top of Gemma 4 and Qwen 3.5. By utilizing a Reinforcement Learning (RL) framework that jointly optimizes solution rollouts and the scaffolds that drive them, Ornith-1.0 achieves state-of-the-art performance on major benchmarks like SWE-bench and Terminal-Bench 2.1. The project is released under the MIT license, ensuring global accessibility and freedom from regional limitations. The models demonstrate significant improvements over existing baselines in complex coding tasks, repository-level understanding, and multilingual support, marking a significant advancement for open-source AI agents in the software engineering domain.

Hacker News

Key Takeaways

  • Diverse Model Architectures: Ornith-1.0 is available in multiple configurations, including 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE, catering to different computational needs.
  • Self-Improving RL Framework: The models utilize Reinforcement Learning to optimize not just the final code solutions (rollouts) but also the underlying scaffolds that guide the generation process.
  • State-of-the-Art Performance: Ornith-1.0 models outperform comparable open-source models like Qwen 3.5 and Gemma 4 across critical benchmarks such as SWE-bench and Terminal-Bench 2.1.
  • MIT Licensed: The project is fully open-source, globally accessible, and free from regional restrictions, encouraging widespread adoption and community contribution.
  • Advanced Post-Training: The models are built upon high-performance foundations, specifically post-trained on top of Gemma 4 and Qwen 3.5 architectures.

In-Depth Analysis

The Self-Improving Training Framework and Scaffold Optimization

At the core of Ornith-1.0's success is its innovative self-improving training framework. Unlike traditional models that focus solely on generating the final output, Ornith-1.0 employs Reinforcement Learning (RL) to learn the generation of both the solution rollouts and the scaffolds that drive those rollouts. This dual optimization approach allows the model to discover superior search trajectories. By jointly refining the scaffold—the structural logic or steps taken to reach a solution—and the resulting code, the model generates higher-quality solutions that are more robust and efficient.

This method addresses a common bottleneck in agentic coding: the quality of the reasoning path. By treating the scaffold as a learnable component, Ornith-1.0 can adapt its internal logic to better handle complex, multi-step coding problems. This results in a model that doesn't just "guess" the code but follows a learned, optimized trajectory to solve repository-level issues.

Benchmarking Performance: 9B and 35B Model Comparisons

The performance of Ornith-1.0 is validated through extensive benchmarking against size-appropriate baselines. In the 9B category, Ornith-1.0-9B shows a clear lead over Qwen3.5-9B and Gemma4-12B. For instance, on the Terminal-Bench 2.1 (Terminus-2) benchmark, Ornith-1.0-9B achieved a score of 43.1, significantly higher than Qwen3.5-9B's 21.3 and Gemma4-12B's 21. On the SWE-bench Verified metric, Ornith-1.0-9B reached 69.4, outperforming Qwen3.5-9B (53.2) and Gemma4-12B (44.2).

Moving to the larger 35B models, Ornith-1.0-35B continues to demonstrate superiority. On the SWE-bench Verified benchmark, it scored 75.6, surpassing Qwen3.5-35B (70.0) and Qwen3.6-35B (73.4). Notably, Ornith-1.0-35B also outperformed the much larger Qwen3.5-397B in several categories, such as NL2Repo (34.6 vs 36.8, showing competitive parity) and significantly in the SWE Atlas metrics. In the SWE Atlas - QnA category, Ornith-1.0-35B scored 37.1, nearly tripling the performance of Qwen3.5-35B (13.2) and nearly doubling the 397B variant (20.4).

These results suggest that the RL-based scaffold optimization provides a significant efficiency boost, allowing smaller Ornith models to compete with or exceed the performance of significantly larger traditional models. The consistency across Terminal-Bench, SWE-bench (Pro and Multilingual), and Claw-eval highlights the model's versatility in handling various programming languages and complex agentic tasks.

Industry Impact

The release of Ornith-1.0 represents a pivotal moment for the open-source AI community, particularly in the niche of agentic coding. By providing models that achieve state-of-the-art results on benchmarks like SWE-bench, deepreinforce-ai is narrowing the gap between open-source and proprietary coding assistants.

The use of the MIT license is particularly significant. It removes barriers to entry for developers and enterprises globally, allowing for the integration of high-performance coding agents into various workflows without the concerns of regional limitations or restrictive licensing fees. This could accelerate the development of autonomous software engineering tools, where AI agents can independently navigate repositories, fix bugs, and implement features.

Furthermore, the focus on "agentic" coding—where the model acts as an agent capable of using terminals and navigating complex file structures—moves the industry beyond simple code completion. Ornith-1.0's ability to optimize its own search trajectories suggests a future where AI models are not static but continuously improve their problem-solving methodologies through specialized training frameworks.

Frequently Asked Questions

Question: What makes Ornith-1.0 different from other coding models?

Ornith-1.0 distinguishes itself through its self-improving RL framework. Instead of just learning to produce code, it learns to optimize the "scaffold" or the search trajectory that leads to the code. This joint optimization results in higher-quality solutions and better performance on complex, multi-step agentic tasks compared to standard post-trained models.

Question: Which base models were used to develop Ornith-1.0?

Ornith-1.0 models are post-trained on top of two primary high-performance architectures: Gemma 4 and Qwen 3.5. This allows the Ornith suite to leverage the foundational strengths of these models while adding specialized agentic coding capabilities through reinforcement learning.

Question: Is Ornith-1.0 free to use for commercial purposes?

Yes. Ornith-1.0 is released under the MIT license. This means it is globally accessible, free from regional limitations, and can be used, modified, and distributed for both private and commercial projects without the restrictions often found in proprietary or more restrictive open-source licenses.

Related News

Google Gemini Expands Personalized AI Image Generation to Eligible Free Users Across the United States
Product Launch

Google Gemini Expands Personalized AI Image Generation to Eligible Free Users Across the United States

Google has officially announced the expansion of its personalized AI image generation capabilities within Gemini, now reaching eligible free users located in the United States. This strategic update allows the Gemini chatbot to synthesize visual content that is specifically tailored to an individual's interests. A core component of this feature is its ability to leverage data integrated from various connected Google applications, creating a more cohesive and customized user experience. By moving this functionality beyond restricted tiers, Google is broadening access to advanced generative tools that utilize ecosystem-wide data to inform creative outputs. This development marks a significant step in the integration of personal context into mainstream AI image generation for the general public.

OpenAI Teases New Hardware for Codex: A Physical Shortcut Device for AI-Powered Coding
Product Launch

OpenAI Teases New Hardware for Codex: A Physical Shortcut Device for AI-Powered Coding

OpenAI has officially teased a new hardware device designed specifically for its AI coding tool, Codex, with a scheduled release date of July 15th. Revealed through a teaser video on X, the device features a square-shaped design equipped with several physical buttons, accompanied by the tagline, "Your favorite Codex shortcuts are getting an upgrade." This announcement marks a strategic expansion for OpenAI into the hardware space, specifically targeting the developer community. While OpenAI is known to be working on other hardware projects, the company has clarified that this specific device is dedicated to Codex and is distinct from its more mysterious, broader AI hardware initiatives. The move suggests a focus on enhancing the tactile workflow of programmers by bridging the gap between software-based AI assistance and physical hardware interfaces.

Cursor Launches New Mobile App for Remote Oversight of AI Coding Agents on the Go
Product Launch

Cursor Launches New Mobile App for Remote Oversight of AI Coding Agents on the Go

Cursor has officially expanded its ecosystem with the launch of a dedicated mobile application designed for the remote oversight of AI coding agents. This strategic move allows developers to maintain control and provide guidance to their autonomous coding agents while away from their primary workstations. By enabling "on the go" management, Cursor addresses the growing need for continuous monitoring in agentic software development workflows. The app focuses specifically on the oversight aspect, ensuring that human developers can intervene, direct, and supervise the progress of AI-driven tasks from any location. This development marks a significant transition for Cursor, moving beyond the traditional desktop IDE environment and into a more flexible, mobile-integrated approach to AI-assisted programming and agent management.