Ollama v0.19 favicon

Ollama v0.19

Ollama Powered by MLX: High-Performance AI Inference for Apple Silicon Mac

Introduction:

Discover the new Ollama powered by MLX, Apple's machine learning framework. This update brings unprecedented speeds to Apple Silicon, leveraging unified memory and GPU Neural Accelerators on M5 chips. Featuring NVFP4 support for production-grade accuracy and an upgraded intelligent caching system, Ollama 0.19 optimizes coding agents like Claude Code and personal assistants like OpenClaw. Experience faster time to first token (TTFT) and high-token generation speeds for demanding local AI workloads on macOS.

Added On:

2026-04-03

Monthly Visitors:

--K

Ollama v0.19 - AI Tool Screenshot and Interface Preview

Ollama v0.19 Product Information

Ollama Powered by MLX: Accelerating AI on Apple Silicon

In a landmark update for local AI enthusiasts and developers, Ollama is now powered by MLX on Apple Silicon in a new preview release. This integration marks the fastest way to run Ollama on macOS, leveraging Apple’s dedicated machine learning framework to push the boundaries of performance. By utilizing the unique strengths of Apple’s hardware, Ollama provides an optimized environment for running complex language models with unprecedented speed and efficiency.

What's Ollama Powered by MLX?

Ollama powered by MLX is a specialized version of the Ollama inference engine designed specifically for Apple Silicon. MLX is Apple’s native machine learning framework, and by building Ollama on top of it, the platform can now take full advantage of the unified memory architecture found in Mac devices.

This update is particularly transformative for users of Apple’s latest hardware. On M5, M5 Pro, and M5 Max chips, Ollama leverages new GPU Neural Accelerators. These hardware improvements significantly accelerate both the time to first token (TTFT) and the overall generation speed (tokens per second). Whether you are running a personal assistant or a heavy-duty coding agent, Ollama powered by MLX ensures your local models respond with production-level agility.

Key Features of Ollama 0.19

MLX Acceleration and Performance

The core of this release is the integration with MLX. By moving to this framework, Ollama achieves a large speedup across all Apple Silicon devices. Testing conducted on March 29, 2026, showed that the Qwen3.5-35B-A3B model can reach a prefill performance of 1851 token/s and a decode performance of 134 token/s when running with int4 quantization in Ollama 0.19.

NVFP4 Support

Ollama now leverages NVIDIA’s NVFP4 format. This inclusion allows Ollama to:

  • Maintain high model accuracy while reducing memory bandwidth.
  • Lower storage requirements for intensive inference workloads.
  • Achieve production parity, allowing users to share the same results locally as they would in a scaled production environment.
  • Run models specifically optimized by NVIDIA’s model optimizer.

Improved Caching System

Efficiency is at the heart of the new Ollama update. The upgraded cache makes agentic tasks much smoother through:

  • Lower Memory Utilization: Cache is now reused across different conversations, leading to more cache hits when using shared system prompts.
  • Intelligent Checkpoints: Ollama stores snapshots of the cache at strategic locations in the prompt, reducing processing time.
  • Smarter Eviction: Shared prefixes are retained longer in the memory, even when older branches are dropped, ensuring faster responses for branching dialogues.

Use Case Scenarios

Coding Agents

With the release of Ollama 0.19, coding agents like Claude Code, OpenCode, Codex, and Pi see a massive boost. The improved caching and MLX acceleration allow these tools to process large codebases and respond to complex queries almost instantaneously.

Personal Assistants

Personal assistants like OpenClaw benefit from the reduced latency. The faster TTFT ensures that interacting with an AI assistant feels natural and fluid, making it a viable tool for daily productivity on macOS.

Production Model Testing

Because Ollama now supports NVFP4, developers can test models locally with the confidence that the performance and accuracy will match the scale of NVIDIA-optimized production environments.

How to Use Ollama 0.19

To get started with the preview release of Ollama powered by MLX, ensure you are using a Mac with at least 32GB of unified memory.

  1. Download Ollama 0.19 from the official source.
  2. To launch specific agents or models, use the following commands in your terminal:
  • For Claude Code: ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
  • For OpenClaw: ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
  • To chat directly with the model: ollama run qwen3.5:35b-a3b-coding-nvfp4

Note: This version is specifically tuned for the Qwen3.5-35B-A3B model with sampling parameters optimized for coding tasks.

FAQ

What hardware is required for Ollama powered by MLX?

You need an Apple Silicon Mac (M1, M2, M3, M4, or M5). For the preview release of the Qwen3.5-35B model, it is recommended to have more than 32GB of unified memory.

What is the benefit of NVFP4 support in Ollama?

NVFP4 reduces memory and storage requirements without sacrificing model quality. It allows Ollama users to run models that are optimized by NVIDIA's tools and ensures that local results match production environment outputs.

Will more models be supported in the future?

Yes. While this release focuses on the Qwen3.5 architecture, the team is actively working to expand supported architectures and will introduce easier ways to import custom fine-tuned models into Ollama.

How does the new caching system work?

Ollama now reuses its cache across conversations and stores "intelligent checkpoints" in the prompt. This means if you use a shared system prompt (common in tools like Claude Code), Ollama doesn't have to re-process the entire prompt every time, leading to faster responses.

Loading related products...