Back to List
TechnologyAIInnovationGaming AI

DeepMind Unveils SIMA 2: A Gemini-Powered AI Agent Capable of Reasoning, Learning, and Playing in Diverse 3D Virtual Worlds, Advancing Towards Embodied AGI

DeepMind has launched SIMA 2, an advanced version of its Scalable Instructable Multiworld Agent, significantly evolving from its predecessor. While SIMA 1 could execute over 600 language instructions across various 3D virtual worlds by observing screens and using virtual keyboard/mouse, SIMA 2, powered by the Gemini large language model, transcends mere execution. It can now reason about user goals, explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple virtual environments. This leap is driven by a Gemini-integrated core that combines language, vision, and reasoning, enabling SIMA 2 to understand high-level tasks, translate natural language into action plans, and explain its decisions in real-time. Trained through human demonstrations and AI self-supervision, SIMA 2 demonstrates remarkable cross-game generalization, applying learned concepts to new tasks and operating in previously unseen commercial open-world games. It also supports multimodal instructions and can autonomously navigate and complete tasks in dynamically generated 3D worlds, showcasing a self-improvement loop for continuous learning without human feedback. DeepMind positions SIMA 2 as a significant step towards Embodied General Intelligence.

Xiaohu.AI 日报

DeepMind has introduced SIMA 2, a new iteration of its Scalable Instructable Multiworld Agent, building upon the foundation laid by SIMA 1. Last year, SIMA 1 demonstrated the ability to execute over 600 language instructions, such as "turn left," "open map," or "climb ladder," across multiple 3D virtual worlds. Its significance lay in proving that AI could interact with games like humans, by observing the screen and using virtual keyboard and mouse operations, rather than directly accessing game APIs. However, SIMA 1 was primarily an "executor," mechanically following commands.

SIMA 2 represents a significant evolution, powered by the Gemini large language model. The core upgrade is that SIMA 2 can now not only execute tasks but also reason about user goals, converse to explain its plans and thought processes, learn new behaviors, and generalize experiences across multiple worlds. DeepMind summarizes this advancement as a shift from an AI that merely "obeys" to one that "thinks."

At the heart of SIMA 2 is its deep integration with the Gemini large language model, providing it with complex reasoning, semantic understanding, and long-term goal planning capabilities. This enables the agent to understand high-level task goals, translate natural language instructions into executable action plans, and explain its behavior and decision logic in real-time. SIMA 2 can process various input forms, including natural language, game screen images, visual symbols (like emojis), and multi-language inputs. The model perceives its environment through "screen-based observation," mimicking human perception via visual input without direct access to game engine data, which enhances its universality and transferability across different game environments. SIMA 2 understands not just "what to do," but also "why," and can articulate its reasoning.

SIMA 2's training involves a combination of human demonstration videos with language labels for basic behaviors, followed by Gemini automatically generating new data annotations for expansion. This process allows SIMA 2 to self-reason and express action plans, fostering an interaction that feels more like a partnership than simply giving commands to an AI assistant. SIMA 2 explains abstract concepts and logical commands by reasoning about its environment and user intentions.

A remarkable aspect of SIMA 2 is its cross-game generalization capability. It can quickly understand and complete tasks even in games it has never encountered before. For instance, it can transfer the concept of "mining" learned in one game to a "harvesting" task in another. It can also comprehend longer, more complex multi-step instructions and perform actions like exploration, construction, and collection in commercial open-world games such as ASKA (Viking survival), MineDojo (Minecraft research version), Valheim, and No Man's Sky, without prior training. SIMA 2 demonstrates impressive progress in handling entirely new games without any pre-training.

Furthermore, SIMA 2 supports multimodal instruction understanding, recognizing and executing mixed instructions containing text, images, or symbols. For example, it can combine a text command like "Build a bridge across the river" with an image of a bridge or emojis like 🏠 or 🌲 as building or resource indicators, integrating different modal signals into a unified task plan.

In a surprising experiment, DeepMind connected SIMA 2 to Genie 3, a model capable of generating 3D worlds in real-time from text or images. SIMA 2 was then placed into these newly generated virtual worlds. The results showed that it could understand the environment structure, parse goals, plan reasonable paths, and autonomously complete tasks. Researchers describe this as SIMA 2 naturally learning to survive in entirely new worlds.

SIMA 2's most notable feature is its self-improvement capability. During training, it undergoes a continuous learning cycle: Gemini provides initial tasks and reward estimations; SIMA 2 executes tasks and records experiences; the system uses this self-generated data to retrain the next generation model; the new, stronger version then generates more experiences, and the cycle repeats, leading to continuous evolution. This mechanism, known as self-improvement, allows SIMA 2 to enhance its performance without human feedback or game data, as demonstrated by its ability to complete tasks it initially failed after multiple generations of training.

DeepMind emphasizes that SIMA 2's significance extends beyond a mere "game AI," viewing it as a crucial step towards "Embodied General Intelligence (Embodied AGI)." This is because the real world is inherently a complex, multi-task, dynamic, and interactive 3D environment. Official details are available on the DeepMind blog.

Related News

Technology

Qwen-Edit-2509-Multi-angle Lighting LoRA Released by Qwen for Enhanced Image Editing Capabilities

Qwen has announced the release of 'Qwen-Edit-2509-Multi-angle lighting LoRA,' a new tool designed to enhance image editing. The announcement, made via Twitter by @Qwen - Qwen, highlights the availability of this LoRA (Low-Rank Adaptation) model. Users can download 'Qwen-Edit-2509-Multi-angle lighting LoRA' from Hugging Face, with the download link provided as https://huggingface.co/dx8152/Qwen-Edit-2509-Multi-Angle-Lighting. This release is attributed to '大雄' and is associated with @Ali_TongyiLab.

Technology

Elon Musk Announces 'Just Grok 4': AI Demonstrates Emergent Intelligence by Redesigning Edison Lightbulb Filament

Elon Musk, via Twitter, announced 'This is just Grok 4,' highlighting a significant advancement in AI. The announcement follows a demonstration where Grok analyzed Thomas Edison's 1890 lightbulb patent, subsequently determining and implementing a superior filament design that successfully illuminated a light. This emergent intelligence, described as unique among current AI models, has been noted for its potential to revolutionize education and enable robots to construct.

Technology

Saudi AI Startup Humain Unveils 'Humain One' AI Operating System, Revolutionizing Computer Interaction with Natural Language Commands

Saudi Arabian AI startup Humain has officially launched 'Humain One,' a new AI operating system, at the 9th Future Investment Initiative conference in Riyadh. This system aims to replace traditional icon-based operating systems like Windows and macOS, allowing users to interact with computers through natural language commands to complete various tasks. Humain CEO Tariq Amin stated that the company is redefining enterprise computing by creating an AI partner that understands user goals, anticipates needs, and autonomously executes tasks. The operating system, driven by Humain's agent orchestration engine and powered by the Arabic-centric language model 'Allam,' is designed to enhance productivity and creativity across enterprise roles. This launch aligns with Saudi Arabia's accelerated push for AI development, aiming for a leading global market position. Humain, established in May by the country's sovereign wealth fund with Crown Prince Mohammed bin Salman as chairman, is part of Saudi Arabia's 'Vision 2030' to become a 'global AI powerhouse.'