Back to List
TechnologyAIInnovationGaming AI

DeepMind Unveils SIMA 2: A Gemini-Powered AI Agent Capable of Reasoning, Learning, and Playing in Diverse 3D Virtual Worlds, Advancing Towards Embodied AGI

DeepMind has launched SIMA 2, an advanced version of its Scalable Instructable Multiworld Agent, significantly evolving from its predecessor. While SIMA 1 could execute over 600 language instructions across various 3D virtual worlds by observing screens and using virtual keyboard/mouse, SIMA 2, powered by the Gemini large language model, transcends mere execution. It can now reason about user goals, explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple virtual environments. This leap is driven by a Gemini-integrated core that combines language, vision, and reasoning, enabling SIMA 2 to understand high-level tasks, translate natural language into action plans, and explain its decisions in real-time. Trained through human demonstrations and AI self-supervision, SIMA 2 demonstrates remarkable cross-game generalization, applying learned concepts to new tasks and operating in previously unseen commercial open-world games. It also supports multimodal instructions and can autonomously navigate and complete tasks in dynamically generated 3D worlds, showcasing a self-improvement loop for continuous learning without human feedback. DeepMind positions SIMA 2 as a significant step towards Embodied General Intelligence.

Xiaohu.AI 日报

DeepMind has introduced SIMA 2, a new iteration of its Scalable Instructable Multiworld Agent, building upon the foundation laid by SIMA 1. Last year, SIMA 1 demonstrated the ability to execute over 600 language instructions, such as "turn left," "open map," or "climb ladder," across multiple 3D virtual worlds. Its significance lay in proving that AI could interact with games like humans, by observing the screen and using virtual keyboard and mouse operations, rather than directly accessing game APIs. However, SIMA 1 was primarily an "executor," mechanically following commands.

SIMA 2 represents a significant evolution, powered by the Gemini large language model. The core upgrade is that SIMA 2 can now not only execute tasks but also reason about user goals, converse to explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple worlds. DeepMind summarizes this advancement as a shift from an AI that merely "obeys" to one that "thinks."

At the heart of SIMA 2 is its deep integration with the Gemini large language model, providing it with complex reasoning, semantic understanding, and long-term goal planning capabilities. This enables the agent to understand high-level task goals, translate natural language instructions into executable action plans, and explain its behavior and decision logic in real-time. SIMA 2 can process various input forms, including natural language, game screen images, visual symbols (like emojis), and multi-language inputs. The model perceives its environment through "screen-based observation," mimicking human perception via visual input without direct access to game engine data, which enhances its universality and transferability across different game environments. SIMA 2 understands not just "what to do," but also "why," and can articulate its reasoning.

SIMA 2's training involves a combination of human demonstration videos with language labels for basic behaviors, followed by Gemini automatically generating new data annotations for expansion. This process allows SIMA 2 to self-reason and express action plans, fostering an interaction that feels more like a partnership than simply giving commands to an AI assistant. SIMA 2 explains abstract concepts and logical commands by reasoning about its environment and user intentions.

A remarkable aspect of SIMA 2 is its cross-game generalization capability. It can quickly understand and complete tasks even in games it has never encountered before. For instance, it can transfer the concept of "mining" learned in one game to a "harvesting" task in another. It can also comprehend longer, more complex multi-step instructions and perform actions like exploration, construction, and collection in commercial open-world games such as ASKA (Viking survival), MineDojo (Minecraft research version), Valheim, and No Man's Sky, without prior training. SIMA 2 demonstrates impressive progress in handling entirely new games without any pre-training.

Furthermore, SIMA 2 supports multimodal instruction understanding, recognizing and executing mixed instructions containing text, images, or symbols. For example, it can combine a text command like "Build a bridge across the river" with an image of a bridge or emojis like 🏠 or 🌲 as building or resource indicators, integrating different modal signals into a unified task plan.

In a surprising experiment, DeepMind connected SIMA 2 to Genie 3, a model capable of generating 3D worlds in real-time from text or images. SIMA 2 was then placed into these newly generated virtual worlds. The results showed that it could understand the environment structure, parse goals, plan reasonable paths, and autonomously complete tasks. Researchers describe this as SIMA 2 naturally learning to survive in entirely new worlds.

SIMA 2's most notable feature is its self-improvement capability. During training, it undergoes a continuous learning cycle: Gemini provides initial tasks and reward estimations; SIMA 2 executes tasks and records experiences; the system uses this self-generated data to retrain the next generation model; the new, stronger version then generates more experiences, and the cycle repeats, leading to continuous evolution. This mechanism, known as self-improvement, allows SIMA 2 to enhance its performance without human feedback or game data, as demonstrated by its ability to complete tasks it initially failed after multiple generations of training.

DeepMind emphasizes that SIMA 2's significance extends beyond a mere "game AI," viewing it as a crucial step towards "Embodied General Intelligence (Embodied AGI)." This is because the real world is inherently a complex, multi-task, dynamic, and interactive 3D environment. Official details are available on the DeepMind blog.

Related News

Technology

Google Unveils Antigravity: A New AI-Powered Autonomous Platform for End-to-End Software Development, Integrating with Gemini 3 for Agentic Coding

Google has launched Antigravity, a novel platform designed for "AI agent-led development," moving beyond traditional IDEs. This autonomous agent collaboration system enables AI to independently plan, execute, and verify complete software development tasks. Deeply integrated with the Gemini 3 model, Antigravity represents Google's key product in "Agentic Coding." It addresses limitations of previous AI tools, which were primarily assistive and required manual operation and step-by-step human prompts. Antigravity allows AI to work across editors, terminals, and browsers, plan complex multi-step tasks, automatically execute actions via tool calls, and self-check results. It shifts the development paradigm from human-operated tools to AI-operated tools with human supervision and collaboration. The platform's core philosophy revolves around Trust, Autonomy, Feedback, and Self-Improvement, providing transparency into AI's decision-making, enabling autonomous cross-environment operations, facilitating real-time human feedback, and allowing AI to learn from past experiences.

Technology

Google Vids Unlocks Advanced AI Features for All Gmail Users: Free Access to AI Voiceovers, Redundancy Removal, and Image Editing

Google has made several advanced AI features in its Vids video editing platform available to all users with a Gmail account, previously exclusive to paid subscribers. These newly accessible tools include AI voiceovers, automatic removal of redundant speech, and AI image editing. The transcription trimming feature automatically eliminates filler words like "um" and "ah," along with long pauses, significantly enhancing video quality. Users can also generate professional-grade voiceovers from text scripts, choosing from seven different voice options, many of which sound natural. Additionally, the AI image editing tool allows for easy modifications such as background removal, descriptive editing, and transforming static photos into dynamic videos. Google aims to empower both beginners and experienced creators to produce high-quality video content, anticipating significant growth in the video editing market despite Vids being in its early stages.

Technology

Quora's Poe AI Platform Launches Group Chat Feature Supporting Up to 200 Users for Enhanced Collaborative AI Interactions

Quora has introduced a new group chat feature for its AI platform, Poe, allowing up to 200 users to collaborate with various AI models and bots in a single conversation. This innovation supports multi-modal interactions including text, image, video, and audio generation. The launch coincides with OpenAI's ChatGPT piloting similar group chat functionalities in select markets, signaling a shift in AI interaction methods. Quora highlights that this feature will offer new interactive experiences for AI users, such as family trip planning using Gemini 2.5 and o3Deep Research, or team brainstorming with image models to create mood boards. Users can also engage in intellectual games with Q&A bots. Group chats can be created from Poe's homepage, with real-time synchronization across devices, ensuring seamless transitions between desktop and mobile. Quora developed this feature over six months and plans to optimize it based on user feedback, emphasizing the unexplored potential for group interaction and collaboration in AI mediums. Poe also enables users to create and share custom bots.