Which is faster, MLX or llama.cpp? How large is the performance gap?

MLX is 3%–12% faster in most benchmark scenarios, primarily due to kernel fusion and a shorter Metal kernel dispatch path (480 vs 160 dispatches for a 7B model). However, at batch=1 single-stream inference, when memory bandwidth is saturated, or during swap, the gap approaches zero. The larger the memory tier and model size, the more pronounced MLX's advantage.

What is the difference between GGUF and MLX quantization formats? Can they be converted?

The two formats are completely incompatible. llama.cpp uses GGUF (Q2_K through Q8_0); MLX uses safetensors + MLX quantization metadata. The same base model must be converted separately for each framework. The mlx-community on Hugging Face maintains pre-converted MLX versions of popular models.

Why do most Mac local LLM setups use Ollama?

Ollama wraps llama.cpp with a zero-config HTTP inference server (:11434, OpenAI-compatible), model lifecycle management, and multi-client sharing. Claude Code, Cursor, and other Agent tools connect to it directly without any custom gateway. MLX has slightly higher throughput but no built-in HTTP server, making it unsuitable as an Agent runtime.

Which framework holds up better during swap, MLX or llama.cpp?

Both degrade severely and the gap disappears. Memory bandwidth drops from 120 GB/s to 3–5 GB/s (NVMe ceiling), and no framework can optimize around that. The only fix is upgrading to a larger unified memory tier (24 GB or 48 GB) — switching frameworks does nothing.

MLX vs llama.cpp: Which Inference Framework Is Closer to the Metal on Apple Silicon? 2026

Q: What is the fundamental difference between llama.cpp's Metal backend and MLX?

llama.cpp abstracts multiple platforms through the ggml tensor library; Metal is just one of many backends, with pre-compiled kernels and per-op independent dispatch. MLX was designed from day one exclusively for Apple Silicon: lazy evaluation fuses ops at runtime, reducing dispatch count, with a shorter quantization path and deeper unified memory alignment.

Q: Why doesn't Apple Silicon LLM inference use CUDA?

Apple Silicon has no NVIDIA GPU and does not support CUDA. Apple's GPU is accessed via the Metal API, and the unified memory architecture allows the CPU and GPU to share the same physical memory pool. Both MLX and llama.cpp are built on Metal, not CUDA.

Q: Why does Claude Code choose Ollama instead of MLX?

Ollama provides a zero-config HTTP inference server (:11434) with an OpenAI-compatible API — no custom gateway needed. In an Agent tool loop, framework-level tok/s differences account for less than 5% of total latency; tool execution time is the actual bottleneck. MLX has no built-in HTTP server, so integrating it with Claude Code requires building a full inference service stack.

Q: Which is better for local deployment, MLX or llama.cpp?

Use case determines the answer: Agent workloads (Claude Code/Cursor) → llama.cpp + Ollama; benchmarking/research/LoRA fine-tuning → MLX; multi-user shared server → Ollama. Neither is universally superior — the deciding factor is HTTP runtime requirements and engineering complexity.

TL;DR (3-line version)

Performance MLX is faster — but only in Apple Silicon single-machine benchmark scenarios (3%–12% ahead)
Agent Production llama.cpp + Ollama is better suited for Agent / production inference (HTTP runtime is the deciding variable)
Real Bottleneck In most scenarios the bottleneck is memory bandwidth, not the framework — both collapse during swap

Core verdict

Hardware proximity: MLX wins (3× fewer dispatches, shorter quantization path). Engineering proximity: llama.cpp / Ollama wins (out-of-the-box HTTP runtime, zero-config Claude Code integration). These two dimensions are not interchangeable.

Unified Theoretical Framework (First Principles)

The fundamental difference between MLX and llama.cpp is not "slightly faster vs slightly slower" — it is two entirely different execution philosophies:

graph-fusion execution (MLX): lazily collect the op graph → fuse at runtime → batch dispatch
per-op dispatch execution (llama.cpp): each op submitted independently → cross-platform compatible → predictable overhead

And two fundamentally different engineering orientations:

hardware-native specialization (MLX): Apple-only, squeeze every cycle from one hardware target
engineered portability (llama.cpp): runs across CUDA / Metal / Vulkan / CPU, portability first

Every performance difference traces back to three things:
① dispatch count ② kernel fusion capability ③ whether memory bandwidth becomes the bottleneck

System Boundary: Scope and limits of this article's conclusions

A conclusion with defined boundaries is more trustworthy than one without. All conclusions in this article about MLX vs llama.cpp performance and stack selection hold under the following explicit conditions:

✅ Applicable scenarios

Apple Silicon (M1–M4 / M5 family)
batch = 1 / small-batch inference (decode phase)
Decoder-only transformer (7B–14B mainstream models)
Metal backend (macOS native GPU path)
Single-machine local inference / small team shared node
Claude Code / Cursor / Open WebUI and similar Agent tools

❌ Not applicable

CUDA / NVIDIA multi-GPU serving
Large batch training
Distributed inference (multi-node)
Speculative decoding pipeline
MoE mixture-of-experts models (architecture differs significantly)
Cloud GPU instances (non-Apple Silicon)

Conclusions in this article do not apply to NVIDIA / AMD GPU scenarios — that is the domain of CUDA vs ROCm, unrelated to Apple Silicon's unified memory architecture.

Key Term Definitions: Metal dispatch · Kernel fusion · Unified memory · GGUF vs MLX format

Before diving into the technical content, let's align on a few core concepts. These are also the most commonly confused terms in the Apple Silicon LLM inference space.

Metal dispatch (Metal compute dispatch): The minimum unit by which the CPU submits a compute kernel to the Apple Silicon GPU. Each dispatch requires a CPU-GPU synchronization handshake, producing roughly 1–3 μs of scheduling overhead. The more dispatches, the greater the cumulative CPU wait time.
Kernel fusion: Combining multiple independent ops (e.g., matmul → bias add → activation) into a single GPU kernel execution, reducing dispatch count and intermediate global memory reads/writes. MLX performs kernel fusion automatically via lazy evaluation; llama.cpp relies on hand-written fused kernels (e.g., flash attention) with limited coverage.
Unified memory (Apple Silicon): The CPU and GPU on Apple Silicon share a single physical memory pool — there is no discrete VRAM. After model weights are loaded, the GPU reads them directly without any CPU↔GPU copy. This is the foundational hardware advantage of LLM inference on Mac. Both MLX and llama.cpp's Metal backend automatically benefit from unified memory, but MLX's memory allocator is more tightly aligned to this model.
GGUF (llama.cpp quantization format): The model quantization format used by llama.cpp, supporting multiple precisions from Q2_K to Q8_0. Model files use the .gguf extension and can be downloaded directly from Hugging Face. Ollama uses GGUF for model management, and it is incompatible with the MLX format.
MLX format (mlx-community quantization format): The model format used by MLX, based on safetensors + MLX quantization metadata. Obtained via mlx_lm.convert or from mlx-community on Hugging Face. Quantization kernels are directly bound to Metal shaders, and the format is completely incompatible with GGUF — the same base model must be converted separately for each framework.

MLX vs llama.cpp Architecture Overview: Complete Apple Silicon inference framework comparison

Before going deeper, a single table to establish the overall picture. The fundamental divergence between mlx vs llama.cpp comes down to design goals: one optimized exclusively for Apple Silicon, the other for cross-platform portability.

Dimension	MLX	llama.cpp
Design goal	Apple Silicon exclusive	Cross-platform (CUDA / Metal / Vulkan / CPU)
Dispatch model	Lazy graph fusion (batch dispatch after graph analysis)	Per-op dispatch (each op submitted independently)
Metal approach	Runtime JIT kernel (compiled and specialized per tensor shape)	Precompiled kernels (bundled with binary)
Memory model	Native unified memory allocation	ggml allocator (general-purpose, not Apple-specific)
Quantization format	safetensors + MLX metadata (mlx-community)	GGUF (Q2_K – Q8_0)
HTTP runtime	None built-in (requires custom FastAPI gateway)	Ollama wrapper, zero-config (`:11434`)
Works with Claude Code / Cursor out of the box?	❌ Requires custom gateway	✅ Ollama zero-config
Best use case	Benchmarking, research, LoRA fine-tuning	Agent runtime, production inference, team sharing

The last two rows determine the engineering decision; tok/s differences are covered in § Real sources of the performance gap and do not factor into Agent runtime selection.

Apple Silicon LLM Inference Stack Model (2026)

Converting the comparison table into a four-layer model is the simplest framework for understanding all conclusions in this article:

┌─────────────────────────────────────────────────┐
│            Application Layer                    │
│   Claude Code  ·  Cursor  ·  Open WebUI         │
├─────────────────────────────────────────────────┤
│            Runtime Layer                        │
│   Ollama (llama.cpp)  │  FastAPI (MLX wrapper)  │
│   ✅ Recommended Agent path  │  ⚠️ DIY, higher eng. cost  │
├─────────────────────────────────────────────────┤
│            Compute Layer                        │
│   ggml-metal (per-op) │  MLX (graph fusion)     │
│   Cross-platform, precompiled  │  Apple-only, JIT-specialized  │
├─────────────────────────────────────────────────┤
│            Hardware Layer                       │
│       Apple Silicon Unified Memory              │
│   CPU + GPU share physical memory · zero-copy · high bandwidth  │
└─────────────────────────────────────────────────┘

Fig · Apple Silicon LLM inference four-layer stack — stack selection happens at the Runtime layer, performance differences originate at the Compute layer, the ceiling is at the Hardware layer

Key takeaway: stack selection only happens at the Runtime layer. Performance differences between the Compute layer (MLX vs ggml-metal) do not affect Runtime layer selection — they operate at different levels and do not substitute for each other.

Core difference: Metal dispatch count (480 vs 160) — the source of Apple Silicon inference performance

This is the single most important number in this article. Understanding the dispatch count difference is understanding the root cause of the MLX vs llama.cpp performance gap.

For a 7B transformer model, generating one token requires one forward pass through all 32 layers. Each layer contains ops including QKV projection, attention score, softmax, output projection, FFN gate, FFN up/down, RMS Norm, RoPE, and others:

llama.cpp (per-op dispatch):  32 layers × ~15 ops/layer ≈ 480 Metal dispatches
MLX (after lazy graph fusion): 32 layers × ~4–5 fusion blocks/layer ≈ 128–160 Metal dispatches

💡 Core insight (quotable)

MLX's advantage is not stronger compute — it's reducing CPU↔GPU dispatch count by ~3×.
At batch=1, each [commandBuffer commit] incurs roughly 1–3 μs of CPU scheduling overhead. 480 vs 160 dispatches saves approximately 0.3–1 ms per token — this is the primary mechanism behind MLX's benchmark lead, not superior kernel algorithms.

Every Metal dispatch is a CPU-GPU synchronization point: the CPU must wait for the GPU to acknowledge the command buffer enqueue before initiating the next op's kernel call. When GPU compute time is short (small-batch inference), dispatch overhead becomes the bottleneck. MLX's lazy evaluation fuses multiple ops into a single dispatch, bypassing this synchronization cost entirely.

Token Latency Decomposition Model (Dispatch Cost Model)

Breaking down a single token's generation latency into three quantifiable components:

Token latency ≈
  dispatch_count × cpu_sync_cost     ← dispatch overhead (source of framework difference)
  + gpu_compute_time                 ← actual GPU compute (bandwidth / compute bound)
  + memory_bandwidth_time            ← weight transfer time (physical ceiling)

Typical distribution at batch=1, 7B model, M4 Mac Mini 16GB:

Component	llama.cpp	MLX	Notes
`dispatch_count × cpu_sync_cost`	~480 × 2 μs ≈ 0.96 ms	~160 × 2 μs ≈ 0.32 ms	Source of framework difference, roughly 10%–30% of total latency
`gpu_compute_time`	~1–3 ms (similar for both)		Determined by compute and quantization format
`memory_bandwidth_time`	~4 GB ÷ 120 GB/s ≈ 33 ms (dominant term)		Physical ceiling, cannot be optimized by any framework

Approximate estimates (batch=1, 7B Q4_K_M, M4 16GB). memory_bandwidth_time is the decisive dominant term — this explains why MLX's advantage approaches zero when bandwidth is saturated.

Dimension	llama.cpp (ggml-metal)	MLX
Kernel lifecycle	Precompiled, bundled with binary (`.metal` → `.metallib`)	Compiled on demand at runtime; first run has a compilation delay, then cached
Kernel fusion	Limited: some hand-written fused kernels (e.g., flash attention)	Framework-level automatic fusion; lazy graph analysis merges adjacent ops
Dispatches per forward pass (7B)	~400–600 (per-op)	~80–160 (after fusion)
CPU dispatch path	ggml scheduler thread → Metal command buffer	MLX scheduler writes Metal command buffer directly
Multi-stream parallelism	Single Metal command queue (typical configuration)	Supports multiple streams; adjacent layers can compute in parallel

Why MLX exists: The design rationale behind an Apple Silicon-exclusive inference framework

In late 2023, Apple's machine learning research team open-sourced MLX. Its design motivation is distinctive — not cross-platform compatibility, but squeezing everything out of one specific piece of hardware.

Before MLX, LLM inference on Apple Silicon depended on:

Core ML / MPS: Apple's official inference stack; closed model format, unfriendly for researchers
llama.cpp + Metal: The community default, but ggml was designed for the CUDA world and carries cross-platform abstraction overhead
PyTorch + MPS backend: Incomplete MPS op coverage; inference speed does not fully leverage Apple Silicon

The Apple engineers' conclusion: to truly exploit the combination of unified memory + high-bandwidth bus + Metal GPU, a compute framework designed from scratch for Apple Silicon was needed.

MLX's four core design principles

CPU and GPU share the same memory space — zero-copy data transfer
Lazy evaluation: operations are not executed immediately; they are batched and scheduled when results are needed, with graph optimization in between
Dynamic graph + JIT; API style close to NumPy / JAX, researcher-friendly
Metal compute shaders compiled on demand at runtime — no pre-bundled fixed kernels

llama.cpp architecture: ggml multi-platform abstraction + Metal backend + GGUF quantization

llama.cpp was released by Georgi Gerganov in March 2023, with the goal of running LLaMA on consumer hardware in pure C/C++ — portability is the first priority.

ggml: the general tensor backend

The core of llama.cpp is ggml — a lightweight C tensor library that supports multiple hardware targets via backend plugins:

Backend	Hardware	Activation condition
`GGML_METAL`	Apple Silicon GPU	macOS, `-DLLAMA_METAL=ON`
`GGML_CUDA`	NVIDIA GPU	Linux/Windows + CUDA
`GGML_VULKAN`	AMD / Intel GPU	Vulkan driver
`GGML_CPU`	All CPUs	Default fallback

The benefit is one codebase running everywhere; the cost is that each backend is a general-purpose adapter, unable to deep-optimize for specific hardware, and the abstraction layer introduces scheduling overhead.

Metal backend (ggml-metal) and precompiled kernels

On Apple Silicon, llama.cpp enables ggml-metal.metal — a precompiled Metal shader file that includes:

Matrix multiplication: dedicated kernels implemented for each GGUF quantization format such as Q4_K and Q8_0
Softmax, RoPE, RMS Norm: independent per-op kernels
KV cache operations: memory view reuse to avoid copies

Key limitation: kernels are precompiled and fixed — they are not re-optimized at runtime based on tensor shape. When sequence length or batch size changes, kernel parameters are passed in via Metal buffers, but the kernel itself is unchanged — this is precisely where MLX's runtime specialization gains an edge.

GGUF quantization format

llama.cpp uses GGUF, with quantization precision ranging from Q2_K (extreme compression) to Q8_0 (close to fp16). Each quantization format has a corresponding Metal kernel implementation, which is why llama.cpp performs well on Apple Silicon — but this also means maintaining a separate set of kernels for every new quantization format added.

MLX architecture: lazy evaluation + runtime JIT kernel + Apple Silicon unified memory

The fundamental difference between MLX and llama.cpp lies in where the abstraction layer sits: llama.cpp's Metal backend is a "plugin backend", while MLX's entire compute graph lives within Metal's semantics from the ground up.

Lazy evaluation + compute graph fusion

When Python code calls an MLX op, it does not execute immediately — it only records a compute graph node
When mx.eval() is called, the framework analyzes the entire graph
Adjacent fusible ops (matmul + bias + activation, etc.) are merged into a single Metal dispatch
The generated Metal compute shader is compiled at runtime and cached for reuse

Deep utilization of unified memory

All tensors are allocated in the unified memory pool; CPU and GPU share the same physical address space
There is no "VRAM" concept — after model weights are loaded, the GPU reads them directly; no cudaMemcpy equivalent needed
Metal buffers and MLX arrays share the same underlying pointer — zero-copy parameter passing

llama.cpp's Metal backend also benefits from unified memory (it's a hardware feature of Apple Silicon available to any Metal program), but ggml's memory allocator was not designed specifically for this model and carries extra alignment and allocation logic.

GGUF vs MLX quantization format: two incompatible ecosystems

MLX uses safetensors + MLX quantization metadata, which is completely incompatible with GGUF. The same base model must be converted separately:

For llama.cpp: download from HuggingFace → convert to GGUF with convert.py, then quantize
For MLX: convert with mlx_lm.convert, or pull pre-converted versions directly from mlx-community

Supported quantization bit-widths include 4-bit, 8-bit, and mixed precision. Op implementations are directly and deeply bound to Metal shaders — this is the fundamental reason MLX's quantization dequant path is shorter than GGUF's.

Real sources of the MLX vs llama.cpp performance gap: dispatch count vs memory bandwidth

This is the most misunderstood part of mlx benchmark results. MLX's speed advantage does not come from a better matrix multiplication algorithm — it comes from three quantifiable engineering factors:

1. Dispatch fusion: 3× fewer CPU-GPU synchronization points

As described in § Core difference, MLX reduces Metal dispatches per token from ~480 to ~160. This yields the greatest benefit at batch=1, short sequences — where GPU compute time is short and CPU scheduling overhead makes up a larger share.

2. Runtime kernel specialization: optimized for concrete tensor shapes

MLX compiles Metal shaders the first time a particular compute shape is encountered, allowing it to tune tile size and workgroup size for the specific matrix dimensions. llama.cpp's precompiled kernels use conservative general-purpose parameters that adapt poorly to non-standard sequence lengths.

3. Shorter quantization dequant path

To support multiple platforms, llama.cpp's dequantization must account for CPU fallback paths; MLX's quantization kernels target Metal exclusively, allowing dequant + matmul to complete within a single shader, reducing global memory reads and writes.

Measured results reference

Data from Macstripe Lab, Llama-3.1-8B / Q4_K_M, batch=1, measured June 2026. For trend reference only — not to be used as a purchasing or selection benchmark.

Device	llama.cpp (Ollama) tok/s	MLX tok/s	Gap
M4 Mac Mini 16GB	27–31	28–33	+3%–7%
M4 Mac Mini 24GB	34–39	37–43	+5%–10%
M4 Pro 48GB	72–80	80–90	+8%–12%

Pattern: the larger the memory tier and model, the more pronounced MLX's advantage. Larger models have longer GPU compute time per forward pass, so the absolute gain from dispatch fusion is larger. At 16GB running 7B, bandwidth is already the bottleneck and the fusion benefit is diluted.

Performance Convergence Model

Visualizing the trend from the table above: how MLX's performance advantage grows with memory tier, and converges to zero at the bandwidth saturation point:

  MLX advantage
  (tok/s delta)
  │
  +12% │                                        ● 48GB (14B)
       │                                   ·
   +8% │                              ·
       │                         ·
   +5% │                    ● 24GB (7B–14B)
       │               ·
   +3% │          ● 16GB (7B)
       │
    0% │━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  ← memory bandwidth ceiling
       │   (swap)  (swap)
  -∞% │     ↓ both frameworks collapse when swap triggers
       │
       └────────────────────────────────────────────
              16GB       24GB       48GB       Memory tier

Fig · MLX vs llama.cpp performance convergence model — MLX has an advantage below the memory bandwidth ceiling; both converge to zero in the swap zone

The key point of the chart: memory tier determines where you sit on the curve. The gain from upgrading 16GB → 24GB (which no framework switch can replicate) far exceeds the 3%–12% MLX vs llama.cpp gap.

Why mlx benchmark differences are ≈0 in most scenarios: memory bandwidth is the real ceiling

Not all scenarios surface MLX's advantage. This is the most important counter-intuitive point in this article:

Memory bandwidth saturation: physical limits compress framework differences to zero

The performance ceiling for batch=1 inference is weight transfer bandwidth, not GPU compute. Generating each token requires reading all layer weights once (7B Q4 ≈ 4 GB of data movement). M4's memory bandwidth is approximately 120 GB/s (16GB version) — this physical limit determines the tok/s upper bound. Neither MLX nor llama.cpp can break through it. At bandwidth saturation, both converge.

Both frameworks collapse during memory swap

When unified memory is insufficient, the system pages weights to SSD, dropping bandwidth from 120 GB/s to 3–5 GB/s (NVMe sequential read ceiling). Both frameworks' tok/s fall to single digits and the framework difference disappears entirely. The only fix is upgrading the memory tier — switching frameworks does nothing.

Long-context prefill: GPU compute becomes the bottleneck, dispatch overhead share approaches zero

Prefill (processing a long prompt) is compute-intensive, with GPU utilization near 100%. The difference in prefill time for a 2048-token prompt is typically <5% between the two frameworks.

Small models (≤3B): absolute compute time is too short

A 3B Q4 model weighs only ~2 GB; one forward pass GPU time is roughly 5–10 ms. The 0.3–1 ms saved by dispatch fusion represents a higher share, but the absolute value is too small for users to perceive.

High-precision quantization (Q8_0 / fp16): bandwidth saturates earlier

Higher precision means larger weights and earlier bandwidth saturation. Q8 models on 16GB machines almost certainly swap, at which point discussing framework differences is irrelevant.

Ollama vs MLX engineering decision: why Agent runtime doesn't care about tok/s

If MLX is closer to the hardware, why is Claude Code's recommended stack Ollama (llama.cpp)? Because these are two different dimensions entirely.

HTTP server is the deciding variable

Ollama provides a complete inference service engineering stack on top of llama.cpp:

Zero-config HTTP inference server (:11434, OpenAI-compatible API) — Claude Code connects directly, no configuration needed
Model lifecycle management (ollama pull / rm, automatic unloading of idle models)
Multi-request concurrent scheduling (multiple Claude Code windows share a single model instance without reloading)
Team LAN sharing (ollama serve --host 0.0.0.0, entire team connects to the same :11434)

MLX has none of this. Connecting MLX to Claude Code requires: build a FastAPI gateway → implement model load/unload logic → handle concurrent request queues → maintain an OpenAI API compatibility layer. That is a complete inference service engineering project, not a one-command setup.

Framework-level tok/s difference is <5% of Agent tool loop latency

In an Agent tool loop, each Claude Code round-trip includes: prompt assembly → HTTP request → wait for response → parse tool calls → execute tools → next round. Latency breakdown:

Factor	Share (typical Agent task)	Can framework optimize?
Tool execution time (file I/O, shell commands)	50%–70%	No
HTTP round-trip + model TTFT	20%–35%	Mainly memory tier
Framework-level tok/s gap (MLX vs llama.cpp)	<5%	Yes, but negligible impact

The deciding variable for Agent experience is memory tier and model size. Framework differences are completely swamped by the HTTP layer and tool execution time.

Engineering boundaries of Ollama vs MLX

Clarifying the engineering boundary of each makes the decision obvious:

Ollama (llama.cpp): HTTP runtime layer — solves the "how to plug it in" problem
MLX: compute layer tool — solves the "how to run fast in Python" problem
They are not mutually exclusive: the same Mac can run MLX for benchmarking while running Ollama to serve Claude Code

→ Full Ollama vs MLX decision logic: Ollama vs MLX: Which local model should Claude Code use?
→ MLX deployment on M4 Pro in practice: Apple Silicon M4 Pro Local LLM Guide: Performance Measurements and MLX Deployment

MLX or llama.cpp — which should I choose? (30-second decision guide)

Choose based on your actual use case. There is no universally "better" answer:

Building an Agent (Claude Code / Cursor / tool loop) llama.cpp + Ollama HTTP runtime is the deciding variable; tok/s difference <5%
Benchmarking / research / CI regression testing MLX Closer to the hardware, 3× fewer dispatches, more meaningful results
LoRA fine-tuning / dynamically modifying model architecture MLX (mlx-lm) Dynamic graph in Python; the go-to choice for fine-tuning on Apple Silicon
Multi-user server / shared team inference node Ollama serve (llama.cpp) Built-in concurrent scheduling; 24GB / 48GB nodes provisioned in 5 minutes
Insufficient memory (frequent swap) Upgrade memory tier, not framework 16GB→24GB gain far exceeds the 3%–12% MLX vs llama.cpp difference

Apple Silicon LLM Inference Summary Framework: Three Quotable Rules (mac m4 llm inference speed)

Compressing all conclusions in this article into a reusable, quotable decision framework:

Summary framework (quotable directly)

Performance ceiling = memory bandwidth
The tok/s ceiling for batch=1 inference on Apple Silicon is determined by unified memory bandwidth (~120 GB/s on M4). No framework can break through a physical limit.
Framework difference = dispatch overhead
The performance gap between MLX and llama.cpp (3%–12%) comes from Metal dispatch count differences (480 vs 160), not superior op algorithms.
Stack selection = runtime vs research
Agent runtime → Ollama (llama.cpp); HTTP server is the deciding variable. Research / benchmarking / fine-tuning → MLX; closest to the hardware at the compute layer.

Framework optimization yields 5%–15%. Memory tier determines 85%.
Every MLX vs llama.cpp discussion ultimately comes back to this.

FAQ

Which is faster, MLX or llama.cpp? How large is the gap?

MLX is 3%–12% faster in most benchmark scenarios, primarily from kernel fusion (dispatch count ~480 vs 160) and runtime Metal kernel specialization. However, at batch=1 bandwidth saturation, during memory swap, or with long prefill, the gap approaches zero. Larger memory tiers (48GB nodes) and larger models (14B+) amplify MLX's advantage; at 16GB running 7B the gap is only 3%–7%.

What is the fundamental difference between llama.cpp's Metal backend and MLX?

llama.cpp abstracts multiple platforms through ggml; Metal is one of many backends, with precompiled kernels and per-op independent dispatch. MLX was designed from day one exclusively for Apple Silicon: lazy evaluation fuses ops at runtime, dispatch count is 3× lower, the quantization path is shorter, and unified memory alignment is deeper.

Can GGUF and MLX quantization formats be converted to each other?

No direct conversion; the two formats are completely independent. GGUF is generated from original weights using llama.cpp's convert.py; MLX format is converted with mlx_lm.convert or pulled directly from mlx-community. The same base model must be converted separately for each.

Why does Claude Code choose Ollama instead of MLX?

Ollama provides a zero-config HTTP service (:11434, OpenAI-compatible) without needing a custom gateway. In an Agent tool loop, framework-level tok/s differences account for less than 5% of total latency; tool execution time and the HTTP layer are the actual bottlenecks. MLX has no built-in HTTP server — integrating it with Claude Code requires a complete custom inference service stack, consuming all speed gains in engineering overhead.

Which framework holds up better during swap?

Both degrade severely and the gap disappears. Memory bandwidth drops from 120 GB/s to 3–5 GB/s (NVMe ceiling), and no framework can optimize around an SSD bottleneck. Upgrading to a 24GB or 48GB node is the only fix.

Can MLX and Ollama be installed on the same machine?

Yes, they don't conflict at all. Recommended setup: run Ollama as a persistent background service for Claude Code, and use MLX on demand in a Python environment for benchmarking or fine-tuning scripts. Both share the same unified memory pool, but use different model formats and must be managed separately.