TL;DR (3-line version)

  • Performance MLX is faster — but only in Apple Silicon single-machine benchmark scenarios (3%–12% ahead)
  • Agent Production llama.cpp + Ollama is better suited for Agent / production inference (HTTP runtime is the deciding variable)
  • Real Bottleneck In most scenarios the bottleneck is memory bandwidth, not the framework — both collapse during swap

Core verdict

Hardware proximity: MLX wins (3× fewer dispatches, shorter quantization path). Engineering proximity: llama.cpp / Ollama wins (out-of-the-box HTTP runtime, zero-config Claude Code integration). These two dimensions are not interchangeable.

Unified Theoretical Framework (First Principles)

The fundamental difference between MLX and llama.cpp is not "slightly faster vs slightly slower" — it is two entirely different execution philosophies:

  • graph-fusion execution (MLX): lazily collect the op graph → fuse at runtime → batch dispatch
  • per-op dispatch execution (llama.cpp): each op submitted independently → cross-platform compatible → predictable overhead

And two fundamentally different engineering orientations:

  • hardware-native specialization (MLX): Apple-only, squeeze every cycle from one hardware target
  • engineered portability (llama.cpp): runs across CUDA / Metal / Vulkan / CPU, portability first

Every performance difference traces back to three things:
① dispatch count  ② kernel fusion capability  ③ whether memory bandwidth becomes the bottleneck

System Boundary: Scope and limits of this article's conclusions

A conclusion with defined boundaries is more trustworthy than one without. All conclusions in this article about MLX vs llama.cpp performance and stack selection hold under the following explicit conditions:

✅ Applicable scenarios

  • Apple Silicon (M1–M4 / M5 family)
  • batch = 1 / small-batch inference (decode phase)
  • Decoder-only transformer (7B–14B mainstream models)
  • Metal backend (macOS native GPU path)
  • Single-machine local inference / small team shared node
  • Claude Code / Cursor / Open WebUI and similar Agent tools

❌ Not applicable

  • CUDA / NVIDIA multi-GPU serving
  • Large batch training
  • Distributed inference (multi-node)
  • Speculative decoding pipeline
  • MoE mixture-of-experts models (architecture differs significantly)
  • Cloud GPU instances (non-Apple Silicon)

Conclusions in this article do not apply to NVIDIA / AMD GPU scenarios — that is the domain of CUDA vs ROCm, unrelated to Apple Silicon's unified memory architecture.

Key Term Definitions: Metal dispatch · Kernel fusion · Unified memory · GGUF vs MLX format

Before diving into the technical content, let's align on a few core concepts. These are also the most commonly confused terms in the Apple Silicon LLM inference space.

Metal dispatch (Metal compute dispatch)
The minimum unit by which the CPU submits a compute kernel to the Apple Silicon GPU. Each dispatch requires a CPU-GPU synchronization handshake, producing roughly 1–3 μs of scheduling overhead. The more dispatches, the greater the cumulative CPU wait time.
Kernel fusion
Combining multiple independent ops (e.g., matmul → bias add → activation) into a single GPU kernel execution, reducing dispatch count and intermediate global memory reads/writes. MLX performs kernel fusion automatically via lazy evaluation; llama.cpp relies on hand-written fused kernels (e.g., flash attention) with limited coverage.
Unified memory (Apple Silicon)
The CPU and GPU on Apple Silicon share a single physical memory pool — there is no discrete VRAM. After model weights are loaded, the GPU reads them directly without any CPU↔GPU copy. This is the foundational hardware advantage of LLM inference on Mac. Both MLX and llama.cpp's Metal backend automatically benefit from unified memory, but MLX's memory allocator is more tightly aligned to this model.
GGUF (llama.cpp quantization format)
The model quantization format used by llama.cpp, supporting multiple precisions from Q2_K to Q8_0. Model files use the .gguf extension and can be downloaded directly from Hugging Face. Ollama uses GGUF for model management, and it is incompatible with the MLX format.
MLX format (mlx-community quantization format)
The model format used by MLX, based on safetensors + MLX quantization metadata. Obtained via mlx_lm.convert or from mlx-community on Hugging Face. Quantization kernels are directly bound to Metal shaders, and the format is completely incompatible with GGUF — the same base model must be converted separately for each framework.

MLX vs llama.cpp Architecture Overview: Complete Apple Silicon inference framework comparison

Before going deeper, a single table to establish the overall picture. The fundamental divergence between mlx vs llama.cpp comes down to design goals: one optimized exclusively for Apple Silicon, the other for cross-platform portability.

DimensionMLXllama.cpp
Design goal Apple Silicon exclusive Cross-platform (CUDA / Metal / Vulkan / CPU)
Dispatch model Lazy graph fusion (batch dispatch after graph analysis) Per-op dispatch (each op submitted independently)
Metal approach Runtime JIT kernel (compiled and specialized per tensor shape) Precompiled kernels (bundled with binary)
Memory model Native unified memory allocation ggml allocator (general-purpose, not Apple-specific)
Quantization format safetensors + MLX metadata (mlx-community) GGUF (Q2_K – Q8_0)
HTTP runtime None built-in (requires custom FastAPI gateway) Ollama wrapper, zero-config (:11434)
Works with Claude Code / Cursor out of the box? ❌ Requires custom gateway ✅ Ollama zero-config
Best use case Benchmarking, research, LoRA fine-tuning Agent runtime, production inference, team sharing

The last two rows determine the engineering decision; tok/s differences are covered in § Real sources of the performance gap and do not factor into Agent runtime selection.

Apple Silicon LLM Inference Stack Model (2026)

Converting the comparison table into a four-layer model is the simplest framework for understanding all conclusions in this article:

┌─────────────────────────────────────────────────┐
│            Application Layer                    │
│   Claude Code  ·  Cursor  ·  Open WebUI         │
├─────────────────────────────────────────────────┤
│            Runtime Layer                        │
│   Ollama (llama.cpp)  │  FastAPI (MLX wrapper)  │
│   ✅ Recommended Agent path  │  ⚠️ DIY, higher eng. cost  │
├─────────────────────────────────────────────────┤
│            Compute Layer                        │
│   ggml-metal (per-op) │  MLX (graph fusion)     │
│   Cross-platform, precompiled  │  Apple-only, JIT-specialized  │
├─────────────────────────────────────────────────┤
│            Hardware Layer                       │
│       Apple Silicon Unified Memory              │
│   CPU + GPU share physical memory · zero-copy · high bandwidth  │
└─────────────────────────────────────────────────┘
Fig · Apple Silicon LLM inference four-layer stack — stack selection happens at the Runtime layer, performance differences originate at the Compute layer, the ceiling is at the Hardware layer

Key takeaway: stack selection only happens at the Runtime layer. Performance differences between the Compute layer (MLX vs ggml-metal) do not affect Runtime layer selection — they operate at different levels and do not substitute for each other.

Core difference: Metal dispatch count (480 vs 160) — the source of Apple Silicon inference performance

This is the single most important number in this article. Understanding the dispatch count difference is understanding the root cause of the MLX vs llama.cpp performance gap.

For a 7B transformer model, generating one token requires one forward pass through all 32 layers. Each layer contains ops including QKV projection, attention score, softmax, output projection, FFN gate, FFN up/down, RMS Norm, RoPE, and others:

llama.cpp (per-op dispatch):  32 layers × ~15 ops/layer ≈ 480 Metal dispatches
MLX (after lazy graph fusion): 32 layers × ~4–5 fusion blocks/layer ≈ 128–160 Metal dispatches

💡 Core insight (quotable)

MLX's advantage is not stronger compute — it's reducing CPU↔GPU dispatch count by ~3×.
At batch=1, each [commandBuffer commit] incurs roughly 1–3 μs of CPU scheduling overhead. 480 vs 160 dispatches saves approximately 0.3–1 ms per token — this is the primary mechanism behind MLX's benchmark lead, not superior kernel algorithms.

Every Metal dispatch is a CPU-GPU synchronization point: the CPU must wait for the GPU to acknowledge the command buffer enqueue before initiating the next op's kernel call. When GPU compute time is short (small-batch inference), dispatch overhead becomes the bottleneck. MLX's lazy evaluation fuses multiple ops into a single dispatch, bypassing this synchronization cost entirely.

Token Latency Decomposition Model (Dispatch Cost Model)

Breaking down a single token's generation latency into three quantifiable components:

Token latency ≈
  dispatch_count × cpu_sync_cost     ← dispatch overhead (source of framework difference)
  + gpu_compute_time                 ← actual GPU compute (bandwidth / compute bound)
  + memory_bandwidth_time            ← weight transfer time (physical ceiling)

Typical distribution at batch=1, 7B model, M4 Mac Mini 16GB:

Componentllama.cppMLXNotes
dispatch_count × cpu_sync_cost ~480 × 2 μs ≈ 0.96 ms ~160 × 2 μs ≈ 0.32 ms Source of framework difference, roughly 10%–30% of total latency
gpu_compute_time ~1–3 ms (similar for both) Determined by compute and quantization format
memory_bandwidth_time ~4 GB ÷ 120 GB/s ≈ 33 ms (dominant term) Physical ceiling, cannot be optimized by any framework

Approximate estimates (batch=1, 7B Q4_K_M, M4 16GB). memory_bandwidth_time is the decisive dominant term — this explains why MLX's advantage approaches zero when bandwidth is saturated.

Dimensionllama.cpp (ggml-metal)MLX
Kernel lifecycle Precompiled, bundled with binary (.metal.metallib) Compiled on demand at runtime; first run has a compilation delay, then cached
Kernel fusion Limited: some hand-written fused kernels (e.g., flash attention) Framework-level automatic fusion; lazy graph analysis merges adjacent ops
Dispatches per forward pass (7B) ~400–600 (per-op) ~80–160 (after fusion)
CPU dispatch path ggml scheduler thread → Metal command buffer MLX scheduler writes Metal command buffer directly
Multi-stream parallelism Single Metal command queue (typical configuration) Supports multiple streams; adjacent layers can compute in parallel

Why MLX exists: The design rationale behind an Apple Silicon-exclusive inference framework

In late 2023, Apple's machine learning research team open-sourced MLX. Its design motivation is distinctive — not cross-platform compatibility, but squeezing everything out of one specific piece of hardware.

Before MLX, LLM inference on Apple Silicon depended on:

  • Core ML / MPS: Apple's official inference stack; closed model format, unfriendly for researchers
  • llama.cpp + Metal: The community default, but ggml was designed for the CUDA world and carries cross-platform abstraction overhead
  • PyTorch + MPS backend: Incomplete MPS op coverage; inference speed does not fully leverage Apple Silicon

The Apple engineers' conclusion: to truly exploit the combination of unified memory + high-bandwidth bus + Metal GPU, a compute framework designed from scratch for Apple Silicon was needed.

MLX's four core design principles

  • CPU and GPU share the same memory space — zero-copy data transfer
  • Lazy evaluation: operations are not executed immediately; they are batched and scheduled when results are needed, with graph optimization in between
  • Dynamic graph + JIT; API style close to NumPy / JAX, researcher-friendly
  • Metal compute shaders compiled on demand at runtime — no pre-bundled fixed kernels

llama.cpp architecture: ggml multi-platform abstraction + Metal backend + GGUF quantization

llama.cpp was released by Georgi Gerganov in March 2023, with the goal of running LLaMA on consumer hardware in pure C/C++portability is the first priority.

ggml: the general tensor backend

The core of llama.cpp is ggml — a lightweight C tensor library that supports multiple hardware targets via backend plugins:

BackendHardwareActivation condition
GGML_METALApple Silicon GPUmacOS, -DLLAMA_METAL=ON
GGML_CUDANVIDIA GPULinux/Windows + CUDA
GGML_VULKANAMD / Intel GPUVulkan driver
GGML_CPUAll CPUsDefault fallback

The benefit is one codebase running everywhere; the cost is that each backend is a general-purpose adapter, unable to deep-optimize for specific hardware, and the abstraction layer introduces scheduling overhead.

Metal backend (ggml-metal) and precompiled kernels

On Apple Silicon, llama.cpp enables ggml-metal.metal — a precompiled Metal shader file that includes:

  • Matrix multiplication: dedicated kernels implemented for each GGUF quantization format such as Q4_K and Q8_0
  • Softmax, RoPE, RMS Norm: independent per-op kernels
  • KV cache operations: memory view reuse to avoid copies

Key limitation: kernels are precompiled and fixed — they are not re-optimized at runtime based on tensor shape. When sequence length or batch size changes, kernel parameters are passed in via Metal buffers, but the kernel itself is unchanged — this is precisely where MLX's runtime specialization gains an edge.

GGUF quantization format

llama.cpp uses GGUF, with quantization precision ranging from Q2_K (extreme compression) to Q8_0 (close to fp16). Each quantization format has a corresponding Metal kernel implementation, which is why llama.cpp performs well on Apple Silicon — but this also means maintaining a separate set of kernels for every new quantization format added.

MLX architecture: lazy evaluation + runtime JIT kernel + Apple Silicon unified memory

The fundamental difference between MLX and llama.cpp lies in where the abstraction layer sits: llama.cpp's Metal backend is a "plugin backend", while MLX's entire compute graph lives within Metal's semantics from the ground up.

Lazy evaluation + compute graph fusion

  1. When Python code calls an MLX op, it does not execute immediately — it only records a compute graph node
  2. When mx.eval() is called, the framework analyzes the entire graph
  3. Adjacent fusible ops (matmul + bias + activation, etc.) are merged into a single Metal dispatch
  4. The generated Metal compute shader is compiled at runtime and cached for reuse

Deep utilization of unified memory

  • All tensors are allocated in the unified memory pool; CPU and GPU share the same physical address space
  • There is no "VRAM" concept — after model weights are loaded, the GPU reads them directly; no cudaMemcpy equivalent needed
  • Metal buffers and MLX arrays share the same underlying pointer — zero-copy parameter passing

llama.cpp's Metal backend also benefits from unified memory (it's a hardware feature of Apple Silicon available to any Metal program), but ggml's memory allocator was not designed specifically for this model and carries extra alignment and allocation logic.

GGUF vs MLX quantization format: two incompatible ecosystems

MLX uses safetensors + MLX quantization metadata, which is completely incompatible with GGUF. The same base model must be converted separately:

  • For llama.cpp: download from HuggingFace → convert to GGUF with convert.py, then quantize
  • For MLX: convert with mlx_lm.convert, or pull pre-converted versions directly from mlx-community

Supported quantization bit-widths include 4-bit, 8-bit, and mixed precision. Op implementations are directly and deeply bound to Metal shaders — this is the fundamental reason MLX's quantization dequant path is shorter than GGUF's.

Real sources of the MLX vs llama.cpp performance gap: dispatch count vs memory bandwidth

This is the most misunderstood part of mlx benchmark results. MLX's speed advantage does not come from a better matrix multiplication algorithm — it comes from three quantifiable engineering factors:

1. Dispatch fusion: 3× fewer CPU-GPU synchronization points

As described in § Core difference, MLX reduces Metal dispatches per token from ~480 to ~160. This yields the greatest benefit at batch=1, short sequences — where GPU compute time is short and CPU scheduling overhead makes up a larger share.

2. Runtime kernel specialization: optimized for concrete tensor shapes

MLX compiles Metal shaders the first time a particular compute shape is encountered, allowing it to tune tile size and workgroup size for the specific matrix dimensions. llama.cpp's precompiled kernels use conservative general-purpose parameters that adapt poorly to non-standard sequence lengths.

3. Shorter quantization dequant path

To support multiple platforms, llama.cpp's dequantization must account for CPU fallback paths; MLX's quantization kernels target Metal exclusively, allowing dequant + matmul to complete within a single shader, reducing global memory reads and writes.

Measured results reference

Data from Macstripe Lab, Llama-3.1-8B / Q4_K_M, batch=1, measured June 2026. For trend reference only — not to be used as a purchasing or selection benchmark.

Devicellama.cpp (Ollama) tok/sMLX tok/sGap
M4 Mac Mini 16GB27–3128–33+3%–7%
M4 Mac Mini 24GB34–3937–43+5%–10%
M4 Pro 48GB72–8080–90+8%–12%

Pattern: the larger the memory tier and model, the more pronounced MLX's advantage. Larger models have longer GPU compute time per forward pass, so the absolute gain from dispatch fusion is larger. At 16GB running 7B, bandwidth is already the bottleneck and the fusion benefit is diluted.

Performance Convergence Model

Visualizing the trend from the table above: how MLX's performance advantage grows with memory tier, and converges to zero at the bandwidth saturation point:

  MLX advantage
  (tok/s delta)

  +12% │                                        ● 48GB (14B)
       │                                   ·
   +8% │                              ·
       │                         ·
   +5% │                    ● 24GB (7B–14B)
       │               ·
   +3% │          ● 16GB (7B)
       
    0% │━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  ← memory bandwidth ceiling
       │   (swap)  (swap)
  -∞% │     ↓ both frameworks collapse when swap triggers
       
       └────────────────────────────────────────────
              16GB       24GB       48GB       Memory tier
Fig · MLX vs llama.cpp performance convergence model — MLX has an advantage below the memory bandwidth ceiling; both converge to zero in the swap zone

The key point of the chart: memory tier determines where you sit on the curve. The gain from upgrading 16GB → 24GB (which no framework switch can replicate) far exceeds the 3%–12% MLX vs llama.cpp gap.

Why mlx benchmark differences are ≈0 in most scenarios: memory bandwidth is the real ceiling

Not all scenarios surface MLX's advantage. This is the most important counter-intuitive point in this article:

Memory bandwidth saturation: physical limits compress framework differences to zero

The performance ceiling for batch=1 inference is weight transfer bandwidth, not GPU compute. Generating each token requires reading all layer weights once (7B Q4 ≈ 4 GB of data movement). M4's memory bandwidth is approximately 120 GB/s (16GB version) — this physical limit determines the tok/s upper bound. Neither MLX nor llama.cpp can break through it. At bandwidth saturation, both converge.

Both frameworks collapse during memory swap

When unified memory is insufficient, the system pages weights to SSD, dropping bandwidth from 120 GB/s to 3–5 GB/s (NVMe sequential read ceiling). Both frameworks' tok/s fall to single digits and the framework difference disappears entirely. The only fix is upgrading the memory tier — switching frameworks does nothing.

Long-context prefill: GPU compute becomes the bottleneck, dispatch overhead share approaches zero

Prefill (processing a long prompt) is compute-intensive, with GPU utilization near 100%. The difference in prefill time for a 2048-token prompt is typically <5% between the two frameworks.

Small models (≤3B): absolute compute time is too short

A 3B Q4 model weighs only ~2 GB; one forward pass GPU time is roughly 5–10 ms. The 0.3–1 ms saved by dispatch fusion represents a higher share, but the absolute value is too small for users to perceive.

High-precision quantization (Q8_0 / fp16): bandwidth saturates earlier

Higher precision means larger weights and earlier bandwidth saturation. Q8 models on 16GB machines almost certainly swap, at which point discussing framework differences is irrelevant.

Ollama vs MLX engineering decision: why Agent runtime doesn't care about tok/s

If MLX is closer to the hardware, why is Claude Code's recommended stack Ollama (llama.cpp)? Because these are two different dimensions entirely.

HTTP server is the deciding variable

Ollama provides a complete inference service engineering stack on top of llama.cpp:

  • Zero-config HTTP inference server (:11434, OpenAI-compatible API) — Claude Code connects directly, no configuration needed
  • Model lifecycle management (ollama pull / rm, automatic unloading of idle models)
  • Multi-request concurrent scheduling (multiple Claude Code windows share a single model instance without reloading)
  • Team LAN sharing (ollama serve --host 0.0.0.0, entire team connects to the same :11434)

MLX has none of this. Connecting MLX to Claude Code requires: build a FastAPI gateway → implement model load/unload logic → handle concurrent request queues → maintain an OpenAI API compatibility layer. That is a complete inference service engineering project, not a one-command setup.

Framework-level tok/s difference is <5% of Agent tool loop latency

In an Agent tool loop, each Claude Code round-trip includes: prompt assembly → HTTP request → wait for response → parse tool calls → execute tools → next round. Latency breakdown:

FactorShare (typical Agent task)Can framework optimize?
Tool execution time (file I/O, shell commands)50%–70%No
HTTP round-trip + model TTFT20%–35%Mainly memory tier
Framework-level tok/s gap (MLX vs llama.cpp)<5%Yes, but negligible impact

The deciding variable for Agent experience is memory tier and model size. Framework differences are completely swamped by the HTTP layer and tool execution time.

Engineering boundaries of Ollama vs MLX

Clarifying the engineering boundary of each makes the decision obvious:

  • Ollama (llama.cpp): HTTP runtime layer — solves the "how to plug it in" problem
  • MLX: compute layer tool — solves the "how to run fast in Python" problem
  • They are not mutually exclusive: the same Mac can run MLX for benchmarking while running Ollama to serve Claude Code

→ Full Ollama vs MLX decision logic: Ollama vs MLX: Which local model should Claude Code use?
→ MLX deployment on M4 Pro in practice: Apple Silicon M4 Pro Local LLM Guide: Performance Measurements and MLX Deployment

MLX or llama.cpp — which should I choose? (30-second decision guide)

Choose based on your actual use case. There is no universally "better" answer:

Apple Silicon LLM Inference Summary Framework: Three Quotable Rules (mac m4 llm inference speed)

Compressing all conclusions in this article into a reusable, quotable decision framework:

Summary framework (quotable directly)

  1. Performance ceiling = memory bandwidth
    The tok/s ceiling for batch=1 inference on Apple Silicon is determined by unified memory bandwidth (~120 GB/s on M4). No framework can break through a physical limit.
  2. Framework difference = dispatch overhead
    The performance gap between MLX and llama.cpp (3%–12%) comes from Metal dispatch count differences (480 vs 160), not superior op algorithms.
  3. Stack selection = runtime vs research
    Agent runtime → Ollama (llama.cpp); HTTP server is the deciding variable. Research / benchmarking / fine-tuning → MLX; closest to the hardware at the compute layer.

Framework optimization yields 5%–15%. Memory tier determines 85%.
Every MLX vs llama.cpp discussion ultimately comes back to this.

Related questions: common searches on Apple Silicon inference stack selection

Which is better for local deployment, MLX or llama.cpp?

It depends on what "deployment" means. If you're deploying an inference service (for Claude Code / Cursor / a custom API), llama.cpp + Ollama is the better choice — zero-config HTTP server with solid model management. If you're deploying a research environment (testing new models, benchmarking, writing Python scripts), MLX is better — closer to the hardware and a more flexible Python interface.

Why doesn't Apple Silicon LLM inference use CUDA?

Apple Silicon has no NVIDIA GPU and does not support CUDA. Apple's GPU is accessed via the Metal API, and the unified memory architecture means CPU and GPU share the same physical memory pool — fundamentally different from NVIDIA's discrete VRAM model. Both MLX and llama.cpp are built on Metal, leveraging unified memory rather than VRAM.

What is the difference between GGUF and MLX quantization formats?

The two are incompatible — separate quantization ecosystems. GGUF (llama.cpp) supports Q2_K through Q8_0 with dedicated Metal kernel implementations for each precision; MLX uses safetensors + MLX quantization metadata, with quantization kernels directly bound to Metal shaders. The same base model must be converted separately to use it in either framework.

Why do most Mac LLM setups use Ollama?

Ollama solves the "how to actually use an LLM" engineering problem: start an HTTP server with one command, OpenAI-compatible API, and direct integration with Claude Code / Cursor / Open WebUI and other tools. The underlying engine is llama.cpp, Metal-accelerated on Apple Silicon, with sufficient performance and no Python environment or custom service required.

FAQ

Which is faster, MLX or llama.cpp? How large is the gap?

MLX is 3%–12% faster in most benchmark scenarios, primarily from kernel fusion (dispatch count ~480 vs 160) and runtime Metal kernel specialization. However, at batch=1 bandwidth saturation, during memory swap, or with long prefill, the gap approaches zero. Larger memory tiers (48GB nodes) and larger models (14B+) amplify MLX's advantage; at 16GB running 7B the gap is only 3%–7%.

What is the fundamental difference between llama.cpp's Metal backend and MLX?

llama.cpp abstracts multiple platforms through ggml; Metal is one of many backends, with precompiled kernels and per-op independent dispatch. MLX was designed from day one exclusively for Apple Silicon: lazy evaluation fuses ops at runtime, dispatch count is 3× lower, the quantization path is shorter, and unified memory alignment is deeper.

Can GGUF and MLX quantization formats be converted to each other?

No direct conversion; the two formats are completely independent. GGUF is generated from original weights using llama.cpp's convert.py; MLX format is converted with mlx_lm.convert or pulled directly from mlx-community. The same base model must be converted separately for each.

Why does Claude Code choose Ollama instead of MLX?

Ollama provides a zero-config HTTP service (:11434, OpenAI-compatible) without needing a custom gateway. In an Agent tool loop, framework-level tok/s differences account for less than 5% of total latency; tool execution time and the HTTP layer are the actual bottlenecks. MLX has no built-in HTTP server — integrating it with Claude Code requires a complete custom inference service stack, consuming all speed gains in engineering overhead.

Which framework holds up better during swap?

Both degrade severely and the gap disappears. Memory bandwidth drops from 120 GB/s to 3–5 GB/s (NVMe ceiling), and no framework can optimize around an SSD bottleneck. Upgrading to a 24GB or 48GB node is the only fix.

Can MLX and Ollama be installed on the same machine?

Yes, they don't conflict at all. Recommended setup: run Ollama as a persistent background service for Claude Code, and use MLX on demand in a Python environment for benchmarking or fine-tuning scripts. Both share the same unified memory pool, but use different model formats and must be managed separately.

Further reading