DeepMind Unveils SIMA 2: A Gemini-Powered AI Agent Capable of Reasoning, Learning, and Playing in Diverse 3D Virtual Worlds, Advancing Towards Embodied AGI
DeepMind has launched SIMA 2, an advanced version of its Scalable Instructable Multiworld Agent, significantly evolving from its predecessor. While SIMA 1 could execute over 600 language instructions across various 3D virtual worlds by observing screens and using virtual keyboard/mouse, SIMA 2, powered by the Gemini large language model, transcends mere execution. It can now reason about user goals, explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple virtual environments. This leap is driven by a Gemini-integrated core that combines language, vision, and reasoning, enabling SIMA 2 to understand high-level tasks, translate natural language into action plans, and explain its decisions in real-time. Trained through human demonstrations and AI self-supervision, SIMA 2 demonstrates remarkable cross-game generalization, applying learned concepts to new tasks and operating in previously unseen commercial open-world games. It also supports multimodal instructions and can autonomously navigate and complete tasks in dynamically generated 3D worlds, showcasing a self-improvement loop for continuous learning without human feedback. DeepMind positions SIMA 2 as a significant step towards Embodied General Intelligence.
DeepMind has introduced SIMA 2, a new iteration of its Scalable Instructable Multiworld Agent, building upon the foundation laid by SIMA 1. Last year, SIMA 1 demonstrated the ability to execute over 600 language instructions, such as "turn left," "open map," or "climb ladder," across multiple 3D virtual worlds. Its significance lay in proving that AI could interact with games like humans, by observing the screen and using virtual keyboard and mouse operations, rather than directly accessing game APIs. However, SIMA 1 was primarily an "executor," mechanically following commands.
SIMA 2 represents a significant evolution, powered by the Gemini large language model. The core upgrade is that SIMA 2 can now not only execute tasks but also reason about user goals, converse to explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple worlds. DeepMind summarizes this advancement as a shift from an AI that merely "obeys" to one that "thinks."
At the heart of SIMA 2 is its deep integration with the Gemini large language model, providing it with complex reasoning, semantic understanding, and long-term goal planning capabilities. This enables the agent to understand high-level task goals, translate natural language instructions into executable action plans, and explain its behavior and decision logic in real-time. SIMA 2 can process various input forms, including natural language, game screen images, visual symbols (like emojis), and multi-language inputs. The model perceives its environment through "screen-based observation," mimicking human perception via visual input without direct access to game engine data, which enhances its universality and transferability across different game environments. SIMA 2 understands not just "what to do," but also "why," and can articulate its reasoning.
SIMA 2's training involves a combination of human demonstration videos with language labels for basic behaviors, followed by Gemini automatically generating new data annotations for expansion. This process allows SIMA 2 to self-reason and express action plans, fostering an interaction that feels more like a partnership than simply giving commands to an AI assistant. SIMA 2 explains abstract concepts and logical commands by reasoning about its environment and user intentions.
A remarkable aspect of SIMA 2 is its cross-game generalization capability. It can quickly understand and complete tasks even in games it has never encountered before. For instance, it can transfer the concept of "mining" learned in one game to a "harvesting" task in another. It can also comprehend longer, more complex multi-step instructions and perform actions like exploration, construction, and collection in commercial open-world games such as ASKA (Viking survival), MineDojo (Minecraft research version), Valheim, and No Man's Sky, without prior training. SIMA 2 demonstrates impressive progress in handling entirely new games without any pre-training.
Furthermore, SIMA 2 supports multimodal instruction understanding, recognizing and executing mixed instructions containing text, images, or symbols. For example, it can combine a text command like "Build a bridge across the river" with an image of a bridge or emojis like 🏠 or 🌲 as building or resource indicators, integrating different modal signals into a unified task plan.
In a surprising experiment, DeepMind connected SIMA 2 to Genie 3, a model capable of generating 3D worlds in real-time from text or images. SIMA 2 was then placed into these newly generated virtual worlds. The results showed that it could understand the environment structure, parse goals, plan reasonable paths, and autonomously complete tasks. Researchers describe this as SIMA 2 naturally learning to survive in entirely new worlds.
SIMA 2's most notable feature is its self-improvement capability. During training, it undergoes a continuous learning cycle: Gemini provides initial tasks and reward estimations; SIMA 2 executes tasks and records experiences; the system uses this self-generated data to retrain the next generation model; the new, stronger version then generates more experiences, and the cycle repeats, leading to continuous evolution. This mechanism, known as self-improvement, allows SIMA 2 to enhance its performance without human feedback or game data, as demonstrated by its ability to complete tasks it initially failed after multiple generations of training.
DeepMind emphasizes that SIMA 2's significance extends beyond a mere "game AI," viewing it as a crucial step towards "Embodied General Intelligence (Embodied AGI)." This is because the real world is inherently a complex, multi-task, dynamic, and interactive 3D environment. Official details are available on the DeepMind blog.