Soul Player C64: Implementing a Real 25,000 Parameter Transformer on a 1 MHz Commodore 64
Soul Player C64 is a groundbreaking project that brings modern AI architecture to vintage hardware. It features a 2-layer decoder-only transformer, the same architecture powering ChatGPT and Claude, running on an unmodified 1 MHz Commodore 64. Implemented in hand-written 6502/6510 assembly, the model utilizes ~25,000 int8 parameters and fits entirely on a floppy disk. Despite the hardware limitations, it performs real multi-head causal self-attention, softmax, and RMSNorm. A key technical breakthrough in softmax score normalization allows the model to produce meaningful attention weights on 8-bit hardware. While processing takes approximately 60 seconds per token, the project demonstrates that the fundamental principles of Large Language Models can be scaled down to the most constrained computing environments.
Key Takeaways
- Modern Architecture on Retro Hardware: A real 2-layer decoder-only transformer running on an unmodified 1 MHz Commodore 64.
- Technical Specifications: Features ~25,000 int8 parameters, 4 attention heads, and 32-dimensional embeddings, all written in 6502/6510 assembly.
- Mathematical Breakthrough: Solved integer-based attention issues by adjusting softmax score normalization (shifting by 14 bits instead of 17) to provide sufficient dynamic range.
- User Experience: The model processes at a rate of roughly 60 seconds per token, signaling progress via flashing borders and SID chip audio blips.
- Customizable Training: Users can train their own models using a Python-based pipeline and deploy them via .d64 floppy disk images.
In-Depth Analysis
Architecture and Assembly Implementation
Soul Player C64 represents a significant feat in low-level programming. By implementing a decoder-only transformer—the standard architecture for modern LLMs—entirely in hand-written 6502/6510 assembly, the developer has bypassed the need for modern operating systems or high-level abstractions. The model consists of 2 layers with 4 attention heads each, 32-dimensional embeddings, and 64 hidden units in the Feed-Forward Network (FFN). To fit within the C64's memory and processing constraints, the ~25,000 parameters are quantized to int8 with per-tensor shift scaling. This allows the entire system, including the model weights and the inference engine, to reside on a single floppy disk.
Overcoming Integer Constraints
A critical challenge in porting transformers to 8-bit hardware is the precision of mathematical operations, particularly the softmax function. The developer identified that standard normalization led to uniform attention scores, effectively making the model "blind." The breakthrough involved fixing the softmax score normalization by shifting attention scores by 14 bits rather than 17. This adjustment provided a 128-entry exponent lookup table with enough dynamic range to generate meaningful attention weights, proving that complex transformer mathematics can be successfully approximated using integer arithmetic on a 1 MHz processor.
Performance and Interaction
Operating the Soul Player C64 is a slow but authentic experience. Running at approximately 60 seconds per token, the Commodore 64 provides visual and auditory feedback during the inference process: the screen border flashes while the processor "thinks," and the SID chip emits a blip for every token generated. The model supports lowercase letters, spaces, and basic punctuation. While the speed is a far cry from modern GPU-accelerated AI, the project serves as a functional proof of concept for the portability of transformer logic.
Industry Impact
The Soul Player C64 project highlights the extreme scalability of transformer architectures. It demonstrates that the core logic of modern AI is not inherently tied to massive clusters or high-precision floating-point units, but can be distilled into fundamental assembly instructions. For the AI industry, this underscores the potential for extreme quantization and optimization, suggesting that LLM-like capabilities could eventually be embedded in highly constrained IoT devices or legacy industrial systems. It also serves as an educational milestone, demystifying the "magic" of transformers by showing their operation at the most basic level of computing.
Frequently Asked Questions
Question: How fast does the model generate text?
Each token takes approximately 60 seconds to process. A full response typically takes several minutes to complete on the 1 MHz hardware.
Question: Can I train my own model for the Commodore 64?
Yes. The project includes a training pipeline using Python, NumPy, and Torch. Users can create a corpus in a specific <SEP> format, train the model, and then build a floppy disk image (.d64) to run on the C64 or an emulator.
Question: What are the hardware requirements?
It runs on an unmodified Commodore 64. For those without physical hardware, the VICE emulator is recommended for loading the soulplayer.d64 disk image.
