Back to List
TechnologyAIReinforcement LearningMachine Learning

Google Cloud and UCLA Introduce Supervised Reinforcement Learning (SRL) to Empower Smaller AI Models with Advanced Multi-Step Reasoning Capabilities

Researchers from Google Cloud and UCLA have unveiled Supervised Reinforcement Learning (SRL), a novel reinforcement learning framework designed to significantly enhance the ability of language models to tackle complex multi-step reasoning tasks. SRL redefines problem-solving as a sequence of logical actions, providing rich learning signals during training. This innovative approach allows smaller, more cost-effective models to master intricate problems previously beyond the scope of conventional training methods. Experiments demonstrate SRL's superior performance on mathematical reasoning benchmarks and its effective generalization to agentic software engineering tasks. Unlike traditional Reinforcement Learning with Verifiable Rewards (RLVR), which offers sparse, outcome-based feedback, SRL provides granular feedback, addressing the learning bottleneck faced by models struggling with difficult problems where correct solutions are rarely found within limited attempts. This enables models to learn from partially correct steps, fostering higher reasoning abilities in less expensive models.

VentureBeat

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework called Supervised Reinforcement Learning (SRL) that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. SRL reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals throughout the training process. This innovative approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques.

Experiments have shown that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks. This highlights SRL's versatility as a training framework capable of elevating smaller and less expensive models to higher reasoning abilities.

Recent advancements in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR). RLVR is a method where a model receives a reward based on the correctness of its final answer. Through repeated attempts to solve problems and receiving feedback on the final outcome, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach is contingent on the model's ability to discover a correct solution within a limited number of attempts, often referred to as "rollouts." Each rollout is computationally expensive, meaning models cannot attempt solutions indefinitely. This method encounters a significant limitation when problems are so difficult that the model rarely, if ever, finds the right answer within its allocated budget.

This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but then make a single mistake that leads to an incorrect final answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It operates as an all-or-nothing approach that fails to provide granular feedback and offers only sparse rewards, hindering learning on complex tasks.

Related News

Technology

NVIDIA Earth-2 and CorrDiff Achieve 50x Speedup in Weather Prediction with Gen AI Super-Resolution for Scalable AI Models

Generative AI super-resolution is significantly accelerating weather prediction, achieving a 50x speedup through the integration of NVIDIA Earth-2 and CorrDiff. This advancement enables the development of low-compute, scalable AI models, leading to faster training times and the capability for real-time predictions. The technology promises to revolutionize how weather forecasts are generated and delivered, making them more efficient and accessible.

Technology

New Foundational AI Model Leverages Supercomputing for Early Detection of Rare Cancers from 3D Medical Imaging Data

A new foundational AI model, developed by TU/e's team using the SPIKE-1 supercomputer, is capable of adapting to identify early signs of rare cancers. Medical imaging generates vast amounts of 3D data that are challenging to analyze comprehensively for disease detection, particularly for rare cancer types. By utilizing SPIKE-1, which boasts approximately 100 times the computing power of its predecessor, the team created a versatile AI model trained on over 250,000 CT scans. This innovation aims to enable faster and more accurate cancer detection. TU/e is also making these state-of-the-art tools open source to foster global collaboration and significantly advance rare cancer research and healthcare innovation worldwide.

Technology

Meta Tech Podcast Explores How Open Hardware and AI Drive Environmental Sustainability, Featuring OCP Summit 2025 Announcements and Net Zero Goals

The latest Meta Tech Podcast episode features Pascal Hartig, Dharmesh, and Lisa discussing the environmental benefits of open-source software and the emerging field of open hardware. The discussion highlights Meta's key announcements from the 2025 Open Compute Project (OCP) Summit, including a new open methodology utilizing AI to analyze Scope 3 emissions. The podcast delves into OCP's history and its growth to over 400 contributing companies. Listeners will learn how AI and open hardware are instrumental in Meta's pursuit of net-zero emissions by 2030, specifically mentioning AI's role in developing innovative concrete mixes for data center construction. The episode is available on Spotify, Apple Podcasts, and Pocket Casts.