10-second verdict (runbook)
Claude Code always uses Ollama (:11434). MLX is for offline benchmarks and model validation only.
Under this article’s scope (Claude Code / Cursor local models), the choice happens at the runtime layer — default and recommended: Ollama. There is no third “official path.”
When people search Ollama vs MLX, they are usually asking about the runtime layer; for Claude Code local model setup, the answer is below.
Non-negotiable conclusion
Claude Code has one default stack: Ollama. MLX does not belong in Agent runtime selection in this article — only in benchmark, CI, and research.
- Runtime (Claude Code / local API) → Ollama only
- Offline (benchmark / CI / research) → MLX
- M4 Mac Mini local LLM · 16GB → start with 7B tier, then
ollama serve
At-a-glance (single comparison table)
For M4 Mac Mini local LLM and Claude Code local models (Ollama vs MLX), start with the two fatal rows — MLX is not an “alternate runtime”; it must not sit on the Agent path.
| Dimension | Ollama | MLX |
|---|---|---|
| Agent runtime (Claude Code / Cursor / tool loop) | ✅ | ❌ |
| Zero glue for Claude Code | ✅ :11434 | ❌ custom gateway required |
| Built-in HTTP inference | ✅ | ❌ (custom gateway not recommended here) |
| tok/s (8B, no swap) | baseline | ~+3% to +12% (offline reference only) |
Team ollama serve | ✅ standard | ❌ not on Agent path |
The first two rows decide the stack; tok/s is in the section below and does not drive Claude Code runtime choice.
Core misconception: most people ask the wrong question
Searching Ollama vs MLX and asking “which is faster?” is the wrong frame for M4 Mac Mini local LLM with agents.
For Claude Code local models, the real question is:
Can this model run as production runtime behind an Agent?
Wrong step in the funnel (7B/14B, RAM tier first)? Use the series path at the top.
Most common MLX mistake (conversion killer)
Seeing MLX edge out Ollama on Apple Silicon inference in a benchmark and wiring it behind Claude Code is wrong.
Claude Code bottlenecks are usually not raw tok/s:
- Stable HTTP serve on
:11434 - Latency and timeouts under multi-turn tool loops
- Context / model tags and team sharing
For Claude Code / Cursor, MLX should not be the runtime backend; wrapping MLX in FastAPI is glue you own — usually costlier than Ollama.
Claude Code runtime layers (choice is only in the middle)
Claude Code local model setup is not “pick Ollama or MLX once”; it is three layers:
| Layer | What it is | Choose here? |
|---|---|---|
| Application | Claude Code, Cursor, Agent tool loop | No — you work here |
| Runtime | Ollama (only recommendation in this article) · HTTP :11434 | Yes — locked to Ollama |
| Compute | MLX · offline benchmark / CI / research | No — not on Claude Code main path |
Selection happens at runtime, not compute. For Claude Code readers, faster MLX does not change the runtime answer.
How Claude Code hooks a local model (hands-on)
Default stack for Claude Code local models: ANTHROPIC_BASE_URL → local Ollama :11434. MLX is not in this chain.
brew install ollama
ollama pull qwen2.5-coder:7b
ollama serve
export ANTHROPIC_BASE_URL=http://127.0.0.1:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=
Claude Code then talks to your local model. Or use ollama launch claude --model qwen2.5-coder:7b. Billing and team setup → Claude Code + Ollama field test.
16GB M4 Mac Mini: strongly recommended
- Use a 7B tag (e.g.
qwen2.5-coder:7b) - Do not keep 14B + Chrome + IDE pinned at RAM ceiling
- Switching to MLX will not “fix swap” for agents
Why MLX can be faster (does not change runtime choice)
Numbers below explain Ollama vs MLX in Apple Silicon inference benchmarks only — they do not change “Claude Code → Ollama only.”
- MLX: direct Metal kernels + array ops
- Ollama: llama.cpp + HTTP + service overhead
MLX skips a service shell — but the gap is usually small:
- 16GB: ~0%–5%
- 24GB: ~5%–8%
- 48GB: ~8%–12%
Note: Ranges are Macstripe Lab trends (Llama-3.1-8B 4-bit), not for stack selection. 16GB Ollama ~27–31 vs MLX ~28–32 tok/s; 48GB Ollama ~72–78 vs MLX ~80–88 tok/s. Methodology in the hub lab article.
Why Claude Code has only one runtime path
In daily M4 Mac Mini local LLM work, Claude Code cares about more than generation speed:
| Metric | Importance for agents |
|---|---|
| tok/s | low |
| API stability | high |
| tool loop latency | critical |
| ops (pull / serve / share) | critical |
How the Agent plugs in ≫ a few percent inference.
A real failure on 16GB (M4 Mac Mini local LLM)
Typical Claude Code local model load: Claude Code · Ollama qwen2.5-coder:14b · Chrome (~15 tabs) · VS Code (m4-16gb-lab-01, 2026-05-28).
- Memory pressure: red
- swap: 8GB+
- tok/s: ~28–31 → single digits
- Claude Code: timeouts
Takeaway: not MLX vs Ollama — wrong RAM tier and model size. → 7B vs 14B
2026 model combos (practical)
| Scenario | Pick |
|---|---|
| Claude Code | qwen2.5-coder:7b |
| General agent | Qwen3 8B (ollama pull qwen3:8b) |
| Reasoning | DeepSeek-R1 distill |
| Benchmark baseline | Llama 3.1 8B |
Team M4 Mac Mini local LLM: one host runs ollama serve, everyone points Claude Code at the same :11434; MLX only for overnight benchmarks, not Agents.
Final runtime rule (this article’s spec)
For Claude Code / Cursor / standard Agent tool loops, local models should use an HTTP inference runtime — in this article, Ollama. MLX is compute-layer tooling without a turnkey Agent runtime; MLX + custom HTTP is not listed as recommended Claude Code architecture here.
Engineering carve-out: Rare custom Agent platforms (not Claude Code / Cursor, own gateway and SLA) may use MLX + custom HTTP — feasible, out of scope, and does not change the Claude Code conclusion above.
FAQ
Can I wrap MLX in HTTP for Claude Code?
Not recommended for Claude Code. You can build a gateway, but you own compatibility, model management, and uptime — usually worse than Ollama. Custom Agent platforms (not Claude Code) are a separate case; see the engineering carve-out.
Can MLX replace Ollama?
Not under this article’s rules. Claude Code / Cursor runtime is Ollama; MLX is for offline benchmarks and does not replace Ollama on the Agent path.
Is Ollama slower?
With no swap, same model, Ollama may trail by a few percent in benchmarks. In daily coding and agents you rarely feel it — bottlenecks are wiring and memory, not the framework name.
24GB vs 48GB?
24GB: 7B / 8B, solo or light agents. 48GB: 14B, shared team, longer num_ctx. Upgrading RAM usually beats flipping Ollama ↔ MLX.
When is MLX mandatory?
Only for benchmarks, CI regression, research scripts (nightly tok/s, new quants, mlx-community checks). Not in Claude Code runtime; you can install MLX on the same Mac, but Agents still hit Ollama only.
Decision path (wrap-up)
See final runtime rule. Setup → Claude Code + Ollama (Step ④).
Memory tier (logical follow-on, not opinion)
Premise (locked): runtime = Ollama (Claude Code local model spec)
Load: Claude Code (Ollama runtime) + 14B + multi-user tool loops + ollama serve
Only bottleneck: unified memory — weights + KV + context + IDE/browser in one pool
Conclusion: you need a 24GB / 48GB dedicated M4 inference node — not “pick MLX instead,” but not enough RAM
16GB fits solo 7B + local Ollama; at 14B with a shared team, the variable collapses from “Ollama vs MLX” to “do we have enough unified memory?” → Macstripe 24GB/48GB + ollama serve cluster.
Get 24GB / 48GB dedicated M4 (Ollama serve cluster) → · cluster topology
Bottom line
On Apple Silicon, Ollama vs MLX is not a symmetric choice: Agents use Ollama, benchmarks use MLX; what actually blocks you on M4 Mac Mini local LLM is memory tier and model size.