Back to List
TechnologyAIReinforcement LearningMachine Learning

Google Cloud and UCLA Introduce Supervised Reinforcement Learning (SRL) to Empower Smaller AI Models with Advanced Multi-Step Reasoning Capabilities

Researchers from Google Cloud and UCLA have unveiled Supervised Reinforcement Learning (SRL), a novel reinforcement learning framework designed to significantly enhance the ability of language models to tackle complex multi-step reasoning tasks. SRL redefines problem-solving as a sequence of logical actions, providing rich learning signals during training. This innovative approach allows smaller, more cost-effective models to master intricate problems previously beyond the scope of conventional training methods. Experiments demonstrate SRL's superior performance on mathematical reasoning benchmarks and its effective generalization to agentic software engineering tasks. Unlike traditional Reinforcement Learning with Verifiable Rewards (RLVR), which offers sparse, outcome-based feedback, SRL provides granular feedback, addressing the learning bottleneck faced by models struggling with difficult problems where correct solutions are rarely found within limited attempts. This enables models to learn from partially correct steps, fostering higher reasoning abilities in less expensive models.

VentureBeat

Researchers at Google Cloud and UCLA have proposed a new reinforcement learning framework called Supervised Reinforcement Learning (SRL) that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks. SRL reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals throughout the training process. This innovative approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques.

Experiments have shown that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks. This highlights SRL's versatility as a training framework capable of elevating smaller and less expensive models to higher reasoning abilities.

Recent advancements in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR). RLVR is a method where a model receives a reward based on the correctness of its final answer. Through repeated attempts to solve problems and receiving feedback on the final outcome, the model gradually learns effective problem-solving strategies.

However, the success of this outcome-based approach is contingent on the model's ability to discover a correct solution within a limited number of attempts, often referred to as "rollouts." Each rollout is computationally expensive, meaning models cannot attempt solutions indefinitely. This method encounters a significant limitation when problems are so difficult that the model rarely, if ever, finds the right answer within its allocated budget.

This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but then make a single mistake that leads to an incorrect final answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It operates as an all-or-nothing approach that fails to provide granular feedback and offers only sparse rewards, hindering learning on complex tasks.

Related News

Technology

Google Unveils Antigravity: A New AI-Powered Autonomous Platform for End-to-End Software Development, Integrating with Gemini 3 for Agentic Coding

Google has launched Antigravity, a novel platform designed for "AI agent-led development," moving beyond traditional IDEs. This autonomous agent collaboration system enables AI to independently plan, execute, and verify complete software development tasks. Deeply integrated with the Gemini 3 model, Antigravity represents Google's key product in "Agentic Coding." It addresses limitations of previous AI tools, which were primarily assistive and required manual operation and step-by-step human prompts. Antigravity allows AI to work across editors, terminals, and browsers, plan complex multi-step tasks, automatically execute actions via tool calls, and self-check results. It shifts the development paradigm from human-operated tools to AI-operated tools with human supervision and collaboration. The platform's core philosophy revolves around Trust, Autonomy, Feedback, and Self-Improvement, providing transparency into AI's decision-making, enabling autonomous cross-environment operations, facilitating real-time human feedback, and allowing AI to learn from past experiences.

Technology

Google Vids Unlocks Advanced AI Features for All Gmail Users: Free Access to AI Voiceovers, Redundancy Removal, and Image Editing

Google has made several advanced AI features in its Vids video editing platform available to all users with a Gmail account, previously exclusive to paid subscribers. These newly accessible tools include AI voiceovers, automatic removal of redundant speech, and AI image editing. The transcription trimming feature automatically eliminates filler words like "um" and "ah," along with long pauses, significantly enhancing video quality. Users can also generate professional-grade voiceovers from text scripts, choosing from seven different voice options, many of which sound natural. Additionally, the AI image editing tool allows for easy modifications such as background removal, descriptive editing, and transforming static photos into dynamic videos. Google aims to empower both beginners and experienced creators to produce high-quality video content, anticipating significant growth in the video editing market despite Vids being in its early stages.

Technology

Quora's Poe AI Platform Launches Group Chat Feature Supporting Up to 200 Users for Enhanced Collaborative AI Interactions

Quora has introduced a new group chat feature for its AI platform, Poe, allowing up to 200 users to collaborate with various AI models and bots in a single conversation. This innovation supports multi-modal interactions including text, image, video, and audio generation. The launch coincides with OpenAI's ChatGPT piloting similar group chat functionalities in select markets, signaling a shift in AI interaction methods. Quora highlights that this feature will offer new interactive experiences for AI users, such as family trip planning using Gemini 2.5 and o3Deep Research, or team brainstorming with image models to create mood boards. Users can also engage in intellectual games with Q&A bots. Group chats can be created from Poe's homepage, with real-time synchronization across devices, ensuring seamless transitions between desktop and mobile. Quora developed this feature over six months and plans to optimize it based on user feedback, emphasizing the unexplored potential for group interaction and collaboration in AI mediums. Poe also enables users to create and share custom bots.