Needle: Distilling Gemini 3.1 into a 26M Parameter Simple Attention Network for High-Speed Edge Device Tool Calling
Cactus Compute has unveiled Needle, a remarkably compact 26-million parameter model designed to bring Gemini 3.1's tool-calling capabilities to edge devices. Built on a "Simple Attention Network" architecture, Needle achieves extreme performance metrics, including 6000 tokens/sec prefill and 1200 decode speeds on Cactus infrastructure. Despite its tiny footprint, it outperforms significantly larger models like FunctionGemma-270m and Qwen-0.6B in single-shot function calling tasks. The project is fully open-source, providing weights and dataset generation tools. Needle is specifically optimized for consumer hardware such as smartwatches and glasses, offering a local finetuning UI that allows developers to customize the model for specific tools on standard PCs or Macs without requiring massive compute resources.
Key Takeaways
- Extreme Efficiency: Needle is a 26-million parameter model distilled from Gemini 3.1, optimized for high-speed tool calling on consumer devices.
- Superior Performance: In single-shot function calling, Needle outperforms larger models including FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m.
- High-Speed Inference: The model achieves 6000 tokens/sec prefill and 1200 tokens/sec decode speeds when running on Cactus infrastructure.
- Open Source Accessibility: Weights and dataset generation scripts are fully open-sourced via the Cactus-Compute/needle GitHub repository.
- Local Customization: Includes a built-in web UI for local testing and finetuning on personal computers (Mac/PC), specifically targeting edge hardware like phones and wearables.
In-Depth Analysis
The Architecture of a 26M Parameter Simple Attention Network
Needle represents a significant shift in model design, moving away from the massive parameter counts of traditional LLMs toward a highly specialized "Simple Attention Network." The architecture is meticulously structured to maximize efficiency without sacrificing the logic required for tool calling. It utilizes a configuration of d=512 with 8 heads (8H) and 4 key-value heads (4KV), employing a Byte Pair Encoding (BPE) of 8192.
The model's internal structure is divided into a 12-layer encoder and an 8-layer decoder. The encoder processes the text query through ZCRMSNorm, Self-Attention with Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE), and Gated Residual connections. Notably, the architecture omits the standard Feed-Forward Network (FFN) in certain sections to maintain its slim profile. The encoder's output is fed into the decoder via Cross-Attention. The decoder itself consists of 8 layers featuring Masked Self-Attention, RoPE, and Gated Residuals. This encoder-decoder flow is specifically tuned to transform a text query into a precise tool call, utilizing tied linear layers and shared embeddings to keep the parameter count at a lean 26 million.
Training Methodology and Benchmarking Results
The development of Needle involved a two-stage training process that emphasizes data quality over sheer volume. The model was pre-trained on 16 TPU v6e units for 27 hours, consuming 200 billion tokens. Following this foundational phase, it underwent a focused post-training period of just 45 minutes. During this stage, it was exposed to 2 billion tokens of a specialized single-shot function call dataset. This distillation process from Gemini 3.1 allows the tiny model to inherit sophisticated tool-calling logic that would typically require a much larger architecture to emerge spontaneously.
When benchmarked against other "tiny" models, Needle shows exceptional specialization. It beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m specifically in the domain of single-shot function calls for personal AI. While the developers acknowledge that larger models (in the 270M to 600M range) possess more conversational scope and capacity, Needle’s focus on the specific task of tool calling makes it a superior choice for targeted applications where speed and local execution are paramount. In production environments on Cactus, the model demonstrates blistering speeds, reaching 6000 tokens/sec for prefill and 1200 tokens/sec for decoding, making it nearly instantaneous for the end-user.
Redefining AI for Consumer Edge Devices
The primary objective of the Needle project is to redefine what is possible for AI on consumer devices such as smartphones, smartwatches, and augmented reality glasses. By keeping the parameter count at 26M, the model can reside in the limited memory of wearable tech while providing the "intelligence" needed to interface with device APIs via tool calls.
To facilitate developer adoption, Cactus Compute has included a "Quickstart" path and a local playground. By running a simple set of commands, developers can launch a web UI at http://127.0.0.1:7860. This interface allows for immediate testing of custom tools and provides a "click of a button" finetuning experience. This democratizes the ability to create specialized AI agents, as the finetuning can be performed locally on a standard Mac or PC, bypassing the need for expensive cloud GPU clusters. This local-first approach ensures that developers can iterate quickly on specific tool-calling schemas tailored to their unique hardware or application requirements.
Industry Impact
The release of Needle signals a growing trend in the AI industry toward "extreme distillation" and task-specific miniaturization. As the industry moves beyond the "bigger is better" phase, Needle demonstrates that high-quality distillation from frontier models like Gemini 3.1 can produce highly capable, specialized agents that are small enough to run on a watch. This has profound implications for privacy and latency, as tool-calling logic can now remain entirely on-device. Furthermore, by open-sourcing the weights and the dataset generation process, Cactus Compute is providing a blueprint for other developers to create hyper-efficient models, potentially accelerating the deployment of sophisticated AI features in the Internet of Things (IoT) and wearable sectors.
Frequently Asked Questions
Question: How does Needle achieve such high speeds compared to other models?
Needle's speed (6000 toks/sec prefill) is a result of its extremely small parameter count (26M) and its "Simple Attention Network" architecture, which eliminates certain heavy components like standard FFNs in favor of a more streamlined encoder-decoder structure optimized for Cactus infrastructure.
Question: Can Needle handle general conversational tasks like ChatGPT?
While Needle is highly effective at single-shot function calling, the developers note that it is an experimental run geared toward tiny AI. Models with more parameters, such as Qwen-0.6B or FunctionGemma-270m, still excel more in general conversational settings due to their larger capacity and scope.
Question: What hardware is required to finetune Needle?
One of Needle's key features is its accessibility; it is designed to be finetuned locally on standard consumer hardware, including Macs and PCs. The provided web UI simplifies this process, allowing developers to adapt the model to their own tools without specialized AI servers.