Back to List
Needle: Distilling Gemini 3.1 into a 26M Parameter Simple Attention Network for High-Speed Edge Device Tool Calling
Open SourceTinyMLLLM DistillationTool Calling

Needle: Distilling Gemini 3.1 into a 26M Parameter Simple Attention Network for High-Speed Edge Device Tool Calling

Cactus Compute has unveiled Needle, a remarkably compact 26-million parameter model designed to bring Gemini 3.1's tool-calling capabilities to edge devices. Built on a "Simple Attention Network" architecture, Needle achieves extreme performance metrics, including 6000 tokens/sec prefill and 1200 decode speeds on Cactus infrastructure. Despite its tiny footprint, it outperforms significantly larger models like FunctionGemma-270m and Qwen-0.6B in single-shot function calling tasks. The project is fully open-source, providing weights and dataset generation tools. Needle is specifically optimized for consumer hardware such as smartwatches and glasses, offering a local finetuning UI that allows developers to customize the model for specific tools on standard PCs or Macs without requiring massive compute resources.

Hacker News

Key Takeaways

  • Extreme Efficiency: Needle is a 26-million parameter model distilled from Gemini 3.1, optimized for high-speed tool calling on consumer devices.
  • Superior Performance: In single-shot function calling, Needle outperforms larger models including FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m.
  • High-Speed Inference: The model achieves 6000 tokens/sec prefill and 1200 tokens/sec decode speeds when running on Cactus infrastructure.
  • Open Source Accessibility: Weights and dataset generation scripts are fully open-sourced via the Cactus-Compute/needle GitHub repository.
  • Local Customization: Includes a built-in web UI for local testing and finetuning on personal computers (Mac/PC), specifically targeting edge hardware like phones and wearables.

In-Depth Analysis

The Architecture of a 26M Parameter Simple Attention Network

Needle represents a significant shift in model design, moving away from the massive parameter counts of traditional LLMs toward a highly specialized "Simple Attention Network." The architecture is meticulously structured to maximize efficiency without sacrificing the logic required for tool calling. It utilizes a configuration of d=512 with 8 heads (8H) and 4 key-value heads (4KV), employing a Byte Pair Encoding (BPE) of 8192.

The model's internal structure is divided into a 12-layer encoder and an 8-layer decoder. The encoder processes the text query through ZCRMSNorm, Self-Attention with Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE), and Gated Residual connections. Notably, the architecture omits the standard Feed-Forward Network (FFN) in certain sections to maintain its slim profile. The encoder's output is fed into the decoder via Cross-Attention. The decoder itself consists of 8 layers featuring Masked Self-Attention, RoPE, and Gated Residuals. This encoder-decoder flow is specifically tuned to transform a text query into a precise tool call, utilizing tied linear layers and shared embeddings to keep the parameter count at a lean 26 million.

Training Methodology and Benchmarking Results

The development of Needle involved a two-stage training process that emphasizes data quality over sheer volume. The model was pre-trained on 16 TPU v6e units for 27 hours, consuming 200 billion tokens. Following this foundational phase, it underwent a focused post-training period of just 45 minutes. During this stage, it was exposed to 2 billion tokens of a specialized single-shot function call dataset. This distillation process from Gemini 3.1 allows the tiny model to inherit sophisticated tool-calling logic that would typically require a much larger architecture to emerge spontaneously.

When benchmarked against other "tiny" models, Needle shows exceptional specialization. It beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m specifically in the domain of single-shot function calls for personal AI. While the developers acknowledge that larger models (in the 270M to 600M range) possess more conversational scope and capacity, Needle’s focus on the specific task of tool calling makes it a superior choice for targeted applications where speed and local execution are paramount. In production environments on Cactus, the model demonstrates blistering speeds, reaching 6000 tokens/sec for prefill and 1200 tokens/sec for decoding, making it nearly instantaneous for the end-user.

Redefining AI for Consumer Edge Devices

The primary objective of the Needle project is to redefine what is possible for AI on consumer devices such as smartphones, smartwatches, and augmented reality glasses. By keeping the parameter count at 26M, the model can reside in the limited memory of wearable tech while providing the "intelligence" needed to interface with device APIs via tool calls.

To facilitate developer adoption, Cactus Compute has included a "Quickstart" path and a local playground. By running a simple set of commands, developers can launch a web UI at http://127.0.0.1:7860. This interface allows for immediate testing of custom tools and provides a "click of a button" finetuning experience. This democratizes the ability to create specialized AI agents, as the finetuning can be performed locally on a standard Mac or PC, bypassing the need for expensive cloud GPU clusters. This local-first approach ensures that developers can iterate quickly on specific tool-calling schemas tailored to their unique hardware or application requirements.

Industry Impact

The release of Needle signals a growing trend in the AI industry toward "extreme distillation" and task-specific miniaturization. As the industry moves beyond the "bigger is better" phase, Needle demonstrates that high-quality distillation from frontier models like Gemini 3.1 can produce highly capable, specialized agents that are small enough to run on a watch. This has profound implications for privacy and latency, as tool-calling logic can now remain entirely on-device. Furthermore, by open-sourcing the weights and the dataset generation process, Cactus Compute is providing a blueprint for other developers to create hyper-efficient models, potentially accelerating the deployment of sophisticated AI features in the Internet of Things (IoT) and wearable sectors.

Frequently Asked Questions

Question: How does Needle achieve such high speeds compared to other models?

Needle's speed (6000 toks/sec prefill) is a result of its extremely small parameter count (26M) and its "Simple Attention Network" architecture, which eliminates certain heavy components like standard FFNs in favor of a more streamlined encoder-decoder structure optimized for Cactus infrastructure.

Question: Can Needle handle general conversational tasks like ChatGPT?

While Needle is highly effective at single-shot function calling, the developers note that it is an experimental run geared toward tiny AI. Models with more parameters, such as Qwen-0.6B or FunctionGemma-270m, still excel more in general conversational settings due to their larger capacity and scope.

Question: What hardware is required to finetune Needle?

One of Needle's key features is its accessibility; it is designed to be finetuned locally on standard consumer hardware, including Macs and PCs. The provided web UI simplifies this process, allowing developers to adapt the model to their own tools without specialized AI servers.

Related News

PlayCanvas Releases SuperSplat: A Specialized 3D Gaussian Splatting Editor on GitHub
Open Source

PlayCanvas Releases SuperSplat: A Specialized 3D Gaussian Splatting Editor on GitHub

PlayCanvas has officially released SuperSplat, an innovative open-source editor dedicated to 3D Gaussian Splatting. Emerging as a trending project on GitHub, SuperSplat provides a specialized environment for manipulating and refining 3D Gaussian Splat data. Developed by the team at PlayCanvas, this tool addresses the growing need for accessible editing suites in the rapidly evolving field of neural radiance fields and point-cloud-based reconstructions. By offering a dedicated interface for 'splat' editing, SuperSplat aims to streamline the workflow for developers and 3D artists working with high-fidelity 3D captures. The project's availability on GitHub marks a significant contribution to the open-source graphics community, providing a foundation for further innovation in web-based and real-time 3D visualization.

9router: An Open-Source Solution for Unlimited Free AI Programming with Multi-Provider Integration and Token Optimization
Open Source

9router: An Open-Source Solution for Unlimited Free AI Programming with Multi-Provider Integration and Token Optimization

9router, a new open-source project hosted on GitHub by developer decolua, offers a comprehensive solution for developers seeking unlimited free AI programming capabilities. The tool acts as a bridge, connecting popular AI coding assistants—including Claude Code, Codex, Cursor, Cline, Copilot, and Antigravity—to a network of over 40 providers offering free access to Claude, GPT, and Gemini models. By implementing automatic fallback mechanisms and utilizing RTK technology to achieve a 40% reduction in token consumption, 9router ensures that users can maintain continuous workflows without hitting usage limits. This project represents a significant shift in the accessibility of high-performance Large Language Models (LLMs) for the global developer community, focusing on cost-efficiency and reliability through intelligent routing and data optimization.

Bytedance Releases UI-TARS-desktop: An Open-Source Multimodal AI Agent Technology Stack for Desktop Infrastructure
Open Source

Bytedance Releases UI-TARS-desktop: An Open-Source Multimodal AI Agent Technology Stack for Desktop Infrastructure

Bytedance has introduced UI-TARS-desktop, a new open-source multimodal AI agent technology stack that has recently gained traction on GitHub Trending. The project is designed to serve as a critical bridge between frontier AI models and the infrastructure required to support intelligent agents. By focusing on multimodal capabilities, UI-TARS-desktop aims to provide a framework for developing agents that can operate within desktop environments. This release highlights Bytedance's commitment to open-source AI development and addresses the industry's need for standardized tools to connect advanced models with practical, agentic applications. The project emphasizes the integration of cutting-edge AI with the foundational systems necessary for real-world deployment.