Back to List
Needle: Distilling Gemini 3.1 into a 26M Parameter Simple Attention Network for High-Speed Edge Device Tool Calling
Open SourceTinyMLLLM DistillationTool Calling

Needle: Distilling Gemini 3.1 into a 26M Parameter Simple Attention Network for High-Speed Edge Device Tool Calling

Cactus Compute has unveiled Needle, a remarkably compact 26-million parameter model designed to bring Gemini 3.1's tool-calling capabilities to edge devices. Built on a "Simple Attention Network" architecture, Needle achieves extreme performance metrics, including 6000 tokens/sec prefill and 1200 decode speeds on Cactus infrastructure. Despite its tiny footprint, it outperforms significantly larger models like FunctionGemma-270m and Qwen-0.6B in single-shot function calling tasks. The project is fully open-source, providing weights and dataset generation tools. Needle is specifically optimized for consumer hardware such as smartwatches and glasses, offering a local finetuning UI that allows developers to customize the model for specific tools on standard PCs or Macs without requiring massive compute resources.

Hacker News

Key Takeaways

  • Extreme Efficiency: Needle is a 26-million parameter model distilled from Gemini 3.1, optimized for high-speed tool calling on consumer devices.
  • Superior Performance: In single-shot function calling, Needle outperforms larger models including FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m.
  • High-Speed Inference: The model achieves 6000 tokens/sec prefill and 1200 tokens/sec decode speeds when running on Cactus infrastructure.
  • Open Source Accessibility: Weights and dataset generation scripts are fully open-sourced via the Cactus-Compute/needle GitHub repository.
  • Local Customization: Includes a built-in web UI for local testing and finetuning on personal computers (Mac/PC), specifically targeting edge hardware like phones and wearables.

In-Depth Analysis

The Architecture of a 26M Parameter Simple Attention Network

Needle represents a significant shift in model design, moving away from the massive parameter counts of traditional LLMs toward a highly specialized "Simple Attention Network." The architecture is meticulously structured to maximize efficiency without sacrificing the logic required for tool calling. It utilizes a configuration of d=512 with 8 heads (8H) and 4 key-value heads (4KV), employing a Byte Pair Encoding (BPE) of 8192.

The model's internal structure is divided into a 12-layer encoder and an 8-layer decoder. The encoder processes the text query through ZCRMSNorm, Self-Attention with Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE), and Gated Residual connections. Notably, the architecture omits the standard Feed-Forward Network (FFN) in certain sections to maintain its slim profile. The encoder's output is fed into the decoder via Cross-Attention. The decoder itself consists of 8 layers featuring Masked Self-Attention, RoPE, and Gated Residuals. This encoder-decoder flow is specifically tuned to transform a text query into a precise tool call, utilizing tied linear layers and shared embeddings to keep the parameter count at a lean 26 million.

Training Methodology and Benchmarking Results

The development of Needle involved a two-stage training process that emphasizes data quality over sheer volume. The model was pre-trained on 16 TPU v6e units for 27 hours, consuming 200 billion tokens. Following this foundational phase, it underwent a focused post-training period of just 45 minutes. During this stage, it was exposed to 2 billion tokens of a specialized single-shot function call dataset. This distillation process from Gemini 3.1 allows the tiny model to inherit sophisticated tool-calling logic that would typically require a much larger architecture to emerge spontaneously.

When benchmarked against other "tiny" models, Needle shows exceptional specialization. It beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m specifically in the domain of single-shot function calls for personal AI. While the developers acknowledge that larger models (in the 270M to 600M range) possess more conversational scope and capacity, Needle’s focus on the specific task of tool calling makes it a superior choice for targeted applications where speed and local execution are paramount. In production environments on Cactus, the model demonstrates blistering speeds, reaching 6000 tokens/sec for prefill and 1200 tokens/sec for decoding, making it nearly instantaneous for the end-user.

Redefining AI for Consumer Edge Devices

The primary objective of the Needle project is to redefine what is possible for AI on consumer devices such as smartphones, smartwatches, and augmented reality glasses. By keeping the parameter count at 26M, the model can reside in the limited memory of wearable tech while providing the "intelligence" needed to interface with device APIs via tool calls.

To facilitate developer adoption, Cactus Compute has included a "Quickstart" path and a local playground. By running a simple set of commands, developers can launch a web UI at http://127.0.0.1:7860. This interface allows for immediate testing of custom tools and provides a "click of a button" finetuning experience. This democratizes the ability to create specialized AI agents, as the finetuning can be performed locally on a standard Mac or PC, bypassing the need for expensive cloud GPU clusters. This local-first approach ensures that developers can iterate quickly on specific tool-calling schemas tailored to their unique hardware or application requirements.

Industry Impact

The release of Needle signals a growing trend in the AI industry toward "extreme distillation" and task-specific miniaturization. As the industry moves beyond the "bigger is better" phase, Needle demonstrates that high-quality distillation from frontier models like Gemini 3.1 can produce highly capable, specialized agents that are small enough to run on a watch. This has profound implications for privacy and latency, as tool-calling logic can now remain entirely on-device. Furthermore, by open-sourcing the weights and the dataset generation process, Cactus Compute is providing a blueprint for other developers to create hyper-efficient models, potentially accelerating the deployment of sophisticated AI features in the Internet of Things (IoT) and wearable sectors.

Frequently Asked Questions

Question: How does Needle achieve such high speeds compared to other models?

Needle's speed (6000 toks/sec prefill) is a result of its extremely small parameter count (26M) and its "Simple Attention Network" architecture, which eliminates certain heavy components like standard FFNs in favor of a more streamlined encoder-decoder structure optimized for Cactus infrastructure.

Question: Can Needle handle general conversational tasks like ChatGPT?

While Needle is highly effective at single-shot function calling, the developers note that it is an experimental run geared toward tiny AI. Models with more parameters, such as Qwen-0.6B or FunctionGemma-270m, still excel more in general conversational settings due to their larger capacity and scope.

Question: What hardware is required to finetune Needle?

One of Needle's key features is its accessibility; it is designed to be finetuned locally on standard consumer hardware, including Macs and PCs. The provided web UI simplifies this process, allowing developers to adapt the model to their own tools without specialized AI servers.

Related News

Microsoft Launches MarkItDown: A Specialized Python Tool for Seamless Office Document to Markdown Conversion
Open Source

Microsoft Launches MarkItDown: A Specialized Python Tool for Seamless Office Document to Markdown Conversion

Microsoft has officially released MarkItDown, a Python-based utility designed to facilitate the conversion of various file formats and Office documents into Markdown. Currently trending on GitHub, the tool provides a critical bridge between proprietary document formats and the widely used Markdown standard. By leveraging the Python ecosystem, MarkItDown offers developers a programmatic way to handle document transformations, which is essential for modern data processing and documentation workflows. The project is hosted on GitHub and distributed via PyPI, ensuring easy integration for developers. This release underscores Microsoft's ongoing contribution to open-source tools that simplify document interoperability and enhance the utility of text-based data formats in professional environments.

Hermes WebUI: Enabling Seamless Web and Mobile Access to Sophisticated Autonomous AI Agents on Private Servers
Open Source

Hermes WebUI: Enabling Seamless Web and Mobile Access to Sophisticated Autonomous AI Agents on Private Servers

Hermes WebUI, a new project by developer nesquena, has gained significant traction on GitHub for its ability to provide a streamlined interface for the Hermes Agent. As a sophisticated autonomous agent designed to reside on a user's server, the Hermes Agent represents a high level of AI capability. The introduction of Hermes WebUI bridges the gap between complex server-side operations and user accessibility, allowing individuals to interact with their autonomous agents via web browsers or mobile devices. This development is particularly relevant for users seeking to manage powerful AI workflows remotely without relying on traditional terminal-based interfaces. By facilitating access from any location, Hermes WebUI enhances the utility of the Hermes ecosystem, ensuring that sophisticated autonomous tasks can be monitored and managed with ease across multiple platforms.

MoneyPrinterTurbo: Revolutionizing High-Definition Short Video Creation via AI Large Language Models
Open Source

MoneyPrinterTurbo: Revolutionizing High-Definition Short Video Creation via AI Large Language Models

MoneyPrinterTurbo is an innovative open-source project recently highlighted on GitHub Trending, developed by user harry0703. The tool is designed to automate the production of high-definition short videos through the integration of AI Large Language Models (LLMs). By offering a "one-click" solution, MoneyPrinterTurbo aims to simplify the complex workflow of video editing and content generation, making professional-quality visual media accessible to a broader range of users. This project represents a growing trend in the AI industry where LLMs are utilized not just for text generation, but as central orchestrators for multimedia output. As an open-source repository, it provides a foundation for developers and creators to explore the intersection of generative AI and automated video production, addressing the high demand for rapid content creation in the digital age.