A few percent in bench with no swap; agents bottleneck on HTTP stability and memory, not framework name.

After locking Ollama runtime, 14B and shared teams usually need 24GB or 48GB unified memory.

When is MLX required?

Benchmarks, CI regression, research scripts only — not Claude Code runtime.

Ollama vs MLX: Which Local Model Stack for Claude Code? (M4 Mac Mini 2026)

Q: Can I wrap MLX in HTTP for Claude Code?

Not recommended for Claude Code. Custom Agent platforms are a separate case; Claude Code should use Ollama runtime.

Q: Can MLX replace Ollama?

Not for Claude Code / Cursor. MLX is for offline benchmarks; runtime is Ollama only.

Choosing Ollama vs MLX for local LLM inference on Apple Silicon

10-second verdict (runbook)

Claude Code always uses Ollama (:11434). MLX is for offline benchmarks and model validation only.

Under this article’s scope (Claude Code / Cursor local models), the choice happens at the runtime layer — default and recommended: Ollama. There is no third “official path.”

When people search Ollama vs MLX, they are usually asking about the runtime layer; for Claude Code local model setup, the answer is below.

Non-negotiable conclusion

Claude Code has one default stack: Ollama. MLX does not belong in Agent runtime selection in this article — only in benchmark, CI, and research.

Runtime (Claude Code / local API) → Ollama only
Offline (benchmark / CI / research) → MLX
M4 Mac Mini local LLM · 16GB → start with 7B tier, then ollama serve

At-a-glance (single comparison table)

For M4 Mac Mini local LLM and Claude Code local models (Ollama vs MLX), start with the two fatal rows — MLX is not an “alternate runtime”; it must not sit on the Agent path.

Dimension	Ollama	MLX
Agent runtime (Claude Code / Cursor / tool loop)	✅	❌
Zero glue for Claude Code	✅ `:11434`	❌ custom gateway required
Built-in HTTP inference	✅	❌ (custom gateway not recommended here)
tok/s (8B, no swap)	baseline	~+3% to +12% (offline reference only)
Team `ollama serve`	✅ standard	❌ not on Agent path

The first two rows decide the stack; tok/s is in the section below and does not drive Claude Code runtime choice.

Core misconception: most people ask the wrong question

Searching Ollama vs MLX and asking “which is faster?” is the wrong frame for M4 Mac Mini local LLM with agents.

For Claude Code local models, the real question is:

Can this model run as production runtime behind an Agent?

Wrong step in the funnel (7B/14B, RAM tier first)? Use the series path at the top.

Most common MLX mistake (conversion killer)

Seeing MLX edge out Ollama on Apple Silicon inference in a benchmark and wiring it behind Claude Code is wrong.

Claude Code bottlenecks are usually not raw tok/s:

Stable HTTP serve on :11434
Latency and timeouts under multi-turn tool loops
Context / model tags and team sharing

For Claude Code / Cursor, MLX should not be the runtime backend; wrapping MLX in FastAPI is glue you own — usually costlier than Ollama.

Claude Code runtime layers (choice is only in the middle)

Claude Code local model setup is not “pick Ollama or MLX once”; it is three layers:

Layer	What it is	Choose here?
Application	Claude Code, Cursor, Agent tool loop	No — you work here
Runtime	Ollama (only recommendation in this article) · HTTP `:11434`	Yes — locked to Ollama
Compute	MLX · offline benchmark / CI / research	No — not on Claude Code main path

Selection happens at runtime, not compute. For Claude Code readers, faster MLX does not change the runtime answer.

Three layers: Claude Code → Ollama runtime → Apple Silicon; MLX on offline branch only — Fig. 1 · App → runtime (Ollama) → hardware; MLX stays off the Agent main path

How Claude Code hooks a local model (hands-on)

Default stack for Claude Code local models: ANTHROPIC_BASE_URL → local Ollama :11434. MLX is not in this chain.

brew install ollama
ollama pull qwen2.5-coder:7b
ollama serve

export ANTHROPIC_BASE_URL=http://127.0.0.1:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=

Claude Code then talks to your local model. Or use ollama launch claude --model qwen2.5-coder:7b. Billing and team setup → Claude Code + Ollama field test.

16GB M4 Mac Mini: strongly recommended

Use a 7B tag (e.g. qwen2.5-coder:7b)
Do not keep 14B + Chrome + IDE pinned at RAM ceiling
Switching to MLX will not “fix swap” for agents

Why MLX can be faster (does not change runtime choice)

Numbers below explain Ollama vs MLX in Apple Silicon inference benchmarks only — they do not change “Claude Code → Ollama only.”

MLX: direct Metal kernels + array ops
Ollama: llama.cpp + HTTP + service overhead

MLX skips a service shell — but the gap is usually small:

16GB: ~0%–5%
24GB: ~5%–8%
48GB: ~8%–12%

Note: Ranges are Macstripe Lab trends (Llama-3.1-8B 4-bit), not for stack selection. 16GB Ollama ~27–31 vs MLX ~28–32 tok/s; 48GB Ollama ~72–78 vs MLX ~80–88 tok/s. Methodology in the hub lab article.

Why Claude Code has only one runtime path

In daily M4 Mac Mini local LLM work, Claude Code cares about more than generation speed:

Metric	Importance for agents
tok/s	low
API stability	high
tool loop latency	critical
ops (pull / serve / share)	critical

How the Agent plugs in ≫ a few percent inference.

A real failure on 16GB (M4 Mac Mini local LLM)

Typical Claude Code local model load: Claude Code · Ollama qwen2.5-coder:14b · Chrome (~15 tabs) · VS Code (m4-16gb-lab-01, 2026-05-28).

Memory pressure: red
swap: 8GB+
tok/s: ~28–31 → single digits
Claude Code: timeouts

Takeaway: not MLX vs Ollama — wrong RAM tier and model size. → 7B vs 14B

2026 model combos (practical)

Scenario	Pick
Claude Code	`qwen2.5-coder:7b`
General agent	Qwen3 8B (`ollama pull qwen3:8b`)
Reasoning	DeepSeek-R1 distill
Benchmark baseline	Llama 3.1 8B

Team M4 Mac Mini local LLM: one host runs ollama serve, everyone points Claude Code at the same :11434; MLX only for overnight benchmarks, not Agents.

Final runtime rule (this article’s spec)

For Claude Code / Cursor / standard Agent tool loops, local models should use an HTTP inference runtime — in this article, Ollama. MLX is compute-layer tooling without a turnkey Agent runtime; MLX + custom HTTP is not listed as recommended Claude Code architecture here.

Engineering carve-out: Rare custom Agent platforms (not Claude Code / Cursor, own gateway and SLA) may use MLX + custom HTTP — feasible, out of scope, and does not change the Claude Code conclusion above.

FAQ

Can I wrap MLX in HTTP for Claude Code?

Not recommended for Claude Code. You can build a gateway, but you own compatibility, model management, and uptime — usually worse than Ollama. Custom Agent platforms (not Claude Code) are a separate case; see the engineering carve-out.

Can MLX replace Ollama?

Not under this article’s rules. Claude Code / Cursor runtime is Ollama; MLX is for offline benchmarks and does not replace Ollama on the Agent path.

Is Ollama slower?

With no swap, same model, Ollama may trail by a few percent in benchmarks. In daily coding and agents you rarely feel it — bottlenecks are wiring and memory, not the framework name.

24GB vs 48GB?

24GB: 7B / 8B, solo or light agents. 48GB: 14B, shared team, longer num_ctx. Upgrading RAM usually beats flipping Ollama ↔ MLX.

When is MLX mandatory?

Only for benchmarks, CI regression, research scripts (nightly tok/s, new quants, mlx-community checks). Not in Claude Code runtime; you can install MLX on the same Mac, but Agents still hit Ollama only.

Decision path (wrap-up)

See final runtime rule. Setup → Claude Code + Ollama (Step ④).

Memory tier (logical follow-on, not opinion)

Premise (locked): runtime = Ollama (Claude Code local model spec)

Load: Claude Code (Ollama runtime) + 14B + multi-user tool loops + ollama serve

Only bottleneck: unified memory — weights + KV + context + IDE/browser in one pool

Conclusion: you need a 24GB / 48GB dedicated M4 inference node — not “pick MLX instead,” but not enough RAM

16GB fits solo 7B + local Ollama; at 14B with a shared team, the variable collapses from “Ollama vs MLX” to “do we have enough unified memory?” → Macstripe 24GB/48GB + ollama serve cluster.

Get 24GB / 48GB dedicated M4 (Ollama serve cluster) → · cluster topology

Bottom line

On Apple Silicon, Ollama vs MLX is not a symmetric choice: Agents use Ollama, benchmarks use MLX; what actually blocks you on M4 Mac Mini local LLM is memory tier and model size.

Ollama vs MLX: Which local model stack for Claude Code?