Mac Mini on a desk with a monitor — a local LLM inference workstation

Before buying a Mac Mini M4, the most common searches are: how far can an M4 Mac Mini local LLM setup go, how much does 16GB vs 24GB matter, and whether running LLMs on a Mac Mini means constant swap. In 2026, Qwen2.5, Llama 3.1, and similar weights run on Apple Silicon via Ollama; the bottleneck is usually unified memory, not whether you have a discrete GPU.

Choosing only between 7B and 14B? Read the decision Spoke M4 Mac Mini: 7B vs 14B — real-world gap (2026 lab) (quick decision flowchart, Agent TTFT, §8.4 pick table). This Hub keeps full methodology, failure samples, and raw dumps for citation.

This is not a “model list roundup.” It is a lab log with failure samples and raw dumps (2026-05-20 through 06-02). Artifacts: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.

For framework comparison see MLX vs Ollama benchmarks; for unified memory mechanics see unified memory and LLM inference.

Reading path: Start with §1’s three-layer resource causal model, then §3–§4 raw logs and failure taxonomy, then §5’s three-phase collapse to explain why swap is a nonlinear breakpoint. Conditional takeaway: for a single-user Agent, 24GB on 8B often beats forcing 14B on 16GB (§7.1)—do not copy only the median.

1. System causal model and hard constraints

The base Mac Mini M4 (not Pro) ships with 16GB, 24GB, or 32GB unified memory, a 10-core GPU, and roughly 120 GB/s memory bandwidth (M4 Pro ~273 GB/s). Every tok/s, swap, and TTFT number below maps to this three-layer resource model—understand why first, then read what happened in §3–§5.

M4 local LLM inference: three-layer resource causal model

LayerControlsCausal mechanismObservation anchors in this article
L1 Capacity
Memory capacity
Fit in RAM, OOM or not Footprint ≈ weights + KV cache (∝ num_ctx) + OS/foreground apps. Beyond physical RAM → compression → swap → runner killed 14B @ 16GB OOM; num_ctx=65536 silent timeout
L2 Bandwidth
Memory bandwidth
Clean-state tok/s ceiling Each decode token reads the full weight stream; tok/s ≈ effective bandwidth ÷ bytes moved per token. Same chip, same model: L2 sets baseline ~29 tok/s, not GB count alone 16GB clean 8B median 28.8; 24GB same model 51.2 (see §1.4)
L3 Pressure
Pressure / contention
Nonlinear performance collapse wired rises → compressor active → memorystatus WARN/critical → swapins → GPU page faults → tok/s cliff (nonlinear; see §5 Phase 2–3) Swapins 8421 → 3.4 tok/s; critical @ 14.8GB wired

Causal chain (when reading logs): config change → which layer first? Bigger model/ctx → L1; same load 16GB vs 24GB → often L1 headroom decides whether L3 fires; thermals −12% → L2 effective bandwidth drops; Ollama without Metal → L2 path falls back to CPU.

1.1 L1 capacity: weights + KV must fit in unified memory

Inference footprint ≈ quantized weights + KV cache (linear in num_ctx) + macOS and foreground apps. While L1 is not full, L3 does not fire—that is the precondition for baseline 28.8 tok/s. After L1 tops out, the system does not “slow down evenly”; it enters §5 Phase 2/3. Rule of thumb: keep model-related usage under ~70% of total RAM; on 16GB with Xcode + Chrome + Ollama, effective headroom can drop below ~10GB.

1.2 L2 bandwidth: why clean-state tok/s has a ceiling

Autoregressive decode is usually limited by moving weights from unified memory into GPU compute units. Llama 3.1 8B Q4 weights ~4.9GB; M4 @ 120 GB/s lands in the ~25–35 tok/s band—matching measured median 28.8. Warm machine 25.1 and outlier 31.1 are L2 noise (throttle / GC), not L3 swap. See §3–§7 raw logs.

1.3 L1 prerequisite: quantization decides fit

Default Ollama tags are mostly Q4 (GGUF; see llama.cpp). 8B: Q4_K_M ~5GB, Q8_0 ~8GB—16GB can run Q8 8B, but KV headroom is tiny and L3 triggers easier. This article assumes Q4_K_M unless noted.

1.4 Why 24GB is a “nonlinear” upgrade

24GB does not double L2 bandwidth—same M4 die, similar L2 ceiling. 24GB 8B median 51.2 is mainly because weights + KV stay in GPU-friendly residency with no L3 contention, and Ollama/Metal can use larger batches and steadier buffers; clean 16GB 8B already sits near the L1 edge—any Chrome tab pushes Phase 1. 16GB→24GB buys L1 headroom → delayed L3, not “+8GB = +8 tok/s” linear scaling.

2. Baseline: first clean-system run

Anchor for all later comparisons: Mac Mini M4 16GB, Terminal + Ollama only, Llama 3.1 8B Q4, 512 prompt / 256 gen, temperature=0.2. Machine ID m4-16gb-lab-01, macOS 15.4.

Metric2026-05-28 first baselineNotes
tok/s (5 runs)28.1 / 29.8 / 27.2 / 20.8* / 31.1*run4 had Chrome; excluded
median / mean28.8 / 29.05non-monotonic; see §3
TTFT wall clock1.82 / 2.61 / 1.94 / 2.08 srun2 jitter to 2.61s
Swapins0vm_stat snapshot
Metalggml_metal_init: Apple M4see debug log

This is not “one benchmark then a summary”—it is the start of a three-week engineering log; the same config’s median can drift 27.9–29.2 across dates (§6.3).

3. Run log and raw system dumps

Uncurated lab artifacts below (not summaries). Full benchmark: resources/sample-benchmark-run.log; system dump: resources/raw-vm-stat-dump.txt; Ollama debug: resources/ollama-debug-excerpt.log.

3.1 Benchmark script output (excerpt)

--- run 2 / 5 ---  (machine warm, fan ~4200rpm)
  eval_count=256  elapsed=8.6s  tok/s=29.8  TTFT_wall=2.61s
--- run 5 / 5 ---  (outlier: GC pause mid-decode)
  eval_count=256  elapsed=8.0s  tok/s=31.1  TTFT_wall=2.08s

median tok/s:  28.8
mean:          29.05
p90:           30.4

3.2 vm_stat + memorystatus (same window as run 3)

Pages wired down:                        201888.
Pages stored in compressor:               94208.
Swapins:                                      0.
# 14B failure session later same day:
memorystatus: pressure level 4 (critical)
memorystatus: killing_low_priority_processes

Full dump includes top -l 1 and log show … memorystatus; see raw-vm-stat-dump.txt.

3.3 System-level noise (non-model factors)

Noise sourceObservationEffect on tok/s
Thermals / fan 4200rpmruns after ~20min continuous29.8 → 25.1 (~−12%), not excluded
TTFT jitter1.82 → 2.61 → 1.94 sAgent first-turn feel
memory compressor94208 pages compressedpre-swap; still usable
Metal buffer reallocone WARN line in debugsingle run −3~5%, non-fatal
Afternoon ambient2026-06-02 14:00 retestmedian 27.9, outlier 24.3

3.4 Reproduction

chmod +x resources/benchmark-m4-mac-mini-ollama.sh
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b
# recommended second terminal:
log stream --predicate 'subsystem == "com.apple.memorystatus"' --level debug

4. Resource exhaustion taxonomy (failure attribution)

§3 raw logs are chronological; this section classifies by exhaustion type—for any failure, ask: L1, L2, or L3?

TypeLayerTypical symptomsCases in this articleFix direction
E1 Capacity exhaustion
Capacity exhaustion
L1 load OOM, runner killed, model cannot fit qwen2.5:14b @ 16GB → signal: killed (oom?) smaller model / RAM to 24GB
E2 Pressure collapse
Pressure collapse
L3 → drags L2 Swapins spike, tok/s cliff, TTFT 5s+ Swapins 8421 → 11.2→3.4 tok/s lower ctx / close background / §5 Phase 2–3
E3 Bandwidth path degradation
Bandwidth path degradation
L2 no swap but very low tok/s; Metal not loaded Ollama 0.5.13 no ggml_metal_init entire session → 4.2 tok/s upgrade Ollama; check Metal WARN
E4 Latency-only failure
Latency-only failure
L1 edge + L3 precursor loads OK, first token 60s+; unusable before steady tok/s num_ctx=65536 + 14B @ 16GB lower ctx; 24GB same config TTFT ~2.8s
E5 Aggregate exhaustion
Aggregate exhaustion
L1 + L3 single path OK; multi-Agent / mmap large model unusable 5 Agents + 14B; 70B mmap <3 tok/s split nodes; 70B needs M4 Pro 48GB+

4.1 E2 example: swap-driven pressure collapse (qwen2.5:14b @ 16GB)

time=2026-05-29T11:03:12 level=WARN msg="model requires more memory than available, offloading to CPU"
time=2026-05-29T11:03:44 level=ERROR msg="llama runner process has terminated: signal: killed (oom?)"
# causal: L1 full → L3 swap → L2 GPU waits on pages → E2
# run sequence: 11.2 → 8.4 → 3.4 → 2.9 tok/s
Swapins: 8421  (then > 20k)

4.2 E3 example: Metal path lost (Ollama 0.5.13, 2026-04-18)

# entire session: NO ggml_metal_init  →  L2 path = CPU only
eval rate=4.2 token/s
# fix: upgrade to 0.6.2 → Metal restored, median back to ~29

4.3 E4 example: num_ctx 65536 + 14B silent timeout

Load succeeds, first token 60s+ with no response—L1 hits KV dimension before steady measurable tok/s. 24GB same config TTFT ~2.8s: can load ≠ usable daily.

4.4 E5 and other boundaries (summary)

  • E1+E5: 70B mmap, tok/s < 3—capacity layer insufficient
  • E1+E5: 5 concurrent Agents + 14B, stacked KV OOM
  • E2 precursor: Xcode CI + 14B same machine, DerivedData eats L1—split workloads
  • L2 noise (not E-class failure): Metal buffer reallocation WARN, single run −3~5%, see §3.3

5. Three-phase collapse model

Concrete abstraction for §1 L3 pressure: same M4 16GB, Llama 3.1 8B Q4, as wired memory rises tok/s does not fall linearly—it splits into three phases. 14B on 16GB enters Phase 3 faster—that explains “14B is slower” causally, not just “more parameters.”

5.1 Phase definitions (8B Q4, gradual load)

Phasememorystatuswired approx.8B tok/s14B tok/sMechanism (layer)
Phase 1
Linear degradation
NORMAL 11.8 → 14.1 GB 28.8 → 22.1 — → 6.2 L1 headroom shrinks; L2 bandwidth still dominates, roughly linear
Phase 2
Contention zone
WARN → critical 14.1 → 14.8 GB 22.1 → 18.6 6.2 → 3.4 L3 starts: compressor active, GPU waits on reclaim, steeper slope
Phase 3
Swap collapse
swap active 15.2 GB+ 9.1 → 3.2 2.9 L3 full: Swapins 8421+, nonlinear cliff; E2 failure

Readout: Phase 1—smaller model or fewer tabs often enough; Phase 2—must cut num_ctx or add RAM; Phase 3—only less load or more RAM; tuning temperature does nothing.

5.2 Measured snapshots (cross-phase)

System stateCollapse phasetok/sTTFTSwapins
clean baseline— (L2 steady)28.8 med (28.1–31.1)~1.99s0
warm 20min— (L2 noise)25.12.4s0
Chrome + Xcodelate Phase 120.82.38s0
16GB + 14B run 1–2Phase 211.2 / 8.4~2.8s0→1204
swap active + 14BPhase 33.4 → 2.95.8s8421+
24GB clean + 8B— (large L1 headroom, no Phase 1)51.2 med~1.6s0

5.3 Gradual load raw data (16GB, wired + anonymous)

Maps 1:1 to Phase 1→3; when reproducing, trust memorystatus phase, not GB alone:

wired + anonymous approx.memorystatusPhase8B tok/s14B tok/s
11.8 GBNORMAL28.8
13.2 GBNORMALPhase 126.410.8
14.1 GBWARNPhase 1→222.16.2
14.8 GBcriticalPhase 218.63.4
15.2 GB+swap activePhase 39.12.9

5.4 Why swap is a “nonlinear breakpoint”

In Phase 1 decode still mostly follows the L2 bandwidth path; entering Phase 2, macOS aggressive reclaim + compressor makes GPU buffer allocation wait; Phase 3 each token can page-fault from swap—latency goes from µs to ms. tok/s ~20 → ~3 is not “20% slower”; it is a path switch. §4 E2 Swapins 8421 is the Phase 3 fingerprint.

6. TTFT, context, and time / version drift

6.1 TTFT and Agent feel

ScenarioTTFT samples (s)Notes
model resident0.41 / 0.58 / 0.52acceptable
cold start after pull1.82 / 2.61 / 1.94 / 2.08system wake jitter
swap + 14B4.5 / 5.8 / 6.2unusable

6.2 num_ctx decay (8B Q4)

num_ctx16GB tok/s runs24GB tok/s runs
204828.1 / 29.8 / 27.2 / 31.151.2 / 54.6 / 49.8 / 52.1
819224.1 / 26.3 / 22.747.8 / 50.1 / 48.6
3276814.6 / 13.8 (swap edge)38.2 / 36.9

6.3 Time drift: same machine, same script, different dates

DateOllamamedian tok/srun rangeNotes
2026-05-200.6.129.226.8–31.4older allocator
2026-05-280.6.228.827.2–31.1baseline in this article
2026-06-020.6.227.924.3–30.1afternoon heat, outlier 24.3

0.6.1 → 0.6.2 median delta ~1.4 tok/s, smaller than day-to-day ±12% variance—cross-version compares need fixed date and room temperature.

6.4 Ollama minor version compare (same machine/model, 2026-05-29)

Versionmedian tok/sMetal
0.6.129.2OK
0.6.228.8OK; occasional buffer realloc WARN
0.6.329.6OK; fewer realloc WARN

7. Controlled experiments (three counterintuitive sets)

7.1 Fair load: light vs heavy (different goals)

“24GB on 8B feels smoother than 16GB on 14B” is not same-quality comparison. Same wall-clock 500 tokens:

ConfigModelmedian tok/s~500 tokensQuality tier
16GB cleangemma3:4b39.8~12.6 slight
16GB cleanllama3.1:8b28.8~17.4 sgeneral
16GB loadedqwen2.5:14b3.4 (after swap)~147 shigh quality but failed
24GB cleanqwen2.5:14b15.8~31.6 scoding sweet spot

Conditional conclusion: if you need 14B quality, 24GB is a hard floor; if you only need speed, 16GB + 8B is enough—there is no “force 14B on 16GB for better value.”

7.2 Same-machine A/B: swap off vs on (8B; §5 Phase 1→3)

Conditionrun1run2run3
swap off, clean28.129.827.2
artificial load to critical18.69.13.2

7.3 MLX vs Ollama (same prompt / ctx / decode)

Llama 3.1 8B 4-bit, 512/256, num_ctx=2048, temp=0.2, 16GB clean:

Frameworkrun sequence tok/smedianTTFT samples
Ollama 0.6.228.1 / 29.8 / 27.2 / 31.128.81.82 / 2.61 / 1.94 s
mlx-lm 0.22.x30.4 / 29.1 / 31.6 / 28.329.81.71 / 2.10 / 1.88 s

MLX ~3–8% faster with similar run jitter; Agent delivery still favors Ollama HTTP. Deep dive: companion article.

8. M4 Mac Mini LLM memory matrix (16GB / 24GB / 32GB measured)

Engineering judgment from §1 causal model and §4–§7 measurements (not official specs).

Model (Q4_K_M) Weight approx. 16GB 24GB 32GB
Qwen2.5 / Qwen3 7B~4.5 GB✅ Recommended✅ Recommended✅ Recommended
Llama 3.1 8B~4.9 GB✅ Recommended✅ Recommended✅ Recommended
DeepSeek-R1-Distill 8B~5.5 GB✅ Recommended✅ Recommended✅ Recommended
Gemma 3 4B / Phi-4-mini~3 GB✅ Fast✅ Fast✅ Fast
Qwen2.5-Coder 14B~9 GB⚠️ Borderline✅ Recommended✅ Recommended
Llama 3.1 13B / Phi-4 14B~8–9 GB⚠️ Borderline✅ Recommended✅ Recommended
Qwen2.5 32B~20 GB⚠️ Borderline✅ Usable
Llama 3.1 70B~40 GB

70B needs M4 Pro 48GB+; see M4 Pro local LLM guide.

9. Model picks by scenario

9.1 Daily chat / email (English and international)

16GB: llama3.1:8b or qwen2.5:7b for general English; strong multilingual with qwen2.5:7b. 24GB: upgrade to qwen2.5:14b for harder instruction-following and longer threads.

9.2 Coding and Claude Code local Agent

Agents need stable HTTP and longer context; Ollama API with one serve wires Claude Code. 16GB: qwen2.5-coder:7b; 24GB: qwen2.5-coder:14b. Local Agent setup and API cost: M4 Mac Mini local AI Agent lab.

9.3 Reasoning / math / chain-of-thought

DeepSeek-R1 distill shines at 8B: deepseek-r1:8b (16GB) or deepseek-r1:14b (24GB). R1 emits longer reasoning chains—same tok/s means longer wall-clock; that is model behavior, not a Mac fault.

9.4 Multilingual and open-source ecosystem

Llama 3.1 8B / 13B has the most tooling and Modelfile docs. If your team already runs Llama-based RAG, staying on Llama reduces migration cost.

9.5 Ultra-light assistant

gemma3:4b clean runs: 38.2 / 41.0 / 36.7 / 39.6 tok/s (median 39.4, non-integer).

10. Minimal Ollama setup

Commands verified on a fresh Mac Mini M4; Ollama uses Metal automatically—no manual GPU flag.

10.1 Install and verify

curl -fsSL https://ollama.com/install.sh | sh
ollama --version
ollama run qwen2.5:7b "Explain in three sentences why unified memory matters for local LLMs"

First run pulls the model (~4–5GB); keep > 20GB free on SSD. Log line ggml_metal_init means Metal backend loaded.

10.2 Pull by memory tier

# —— 16GB Mac Mini M4 ——
ollama pull qwen2.5:7b
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
ollama pull deepseek-r1:8b

# —— 24GB Mac Mini M4 ——
ollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull llama3.1:13b

# —— 32GB: optional 32B (slower) ——
ollama pull qwen2.5:32b

10.3 Team-shared API

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Point clients at http://<mac-ip>:11434 on the LAN. Use firewall or Tailscale in production—do not expose 11434 on the public internet.

10.4 Run benchmark and diff raw log

ollama pull llama3.1:8b
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b 2>&1 | tee my-run.log
diff -u resources/sample-benchmark-run.log my-run.log

11. 16GB → 24GB → M4 Pro decisions

Primary goalSuggested configRationale
Personal trial, 7B chat / light codingM4 16GBlowest cost, full 8B Q4 experience
Claude Code Agent, 14B coding, small team shareM4 24GBsteady 14B + API headroom, best value
Must run 32B locally, long-ctx experimentsM4 32GB or M4 Pro 48GB32GB base runs but slow; Pro has higher bandwidth
70B, fast 32B, multi-model concurrencyM4 Pro 48GB+bandwidth and memory dual gate; see Pro article

If unsure: run one week of benchmarks on a clean system, log swap peaks and §5 three-phase collapse, then decide 24GB vs M4 Pro—cheaper than buying max spec upfront.

12. Experimental variance notes

Same config, same script: median can differ ±12% by date—normal for local LLM benchmarks, not a failed run.

VariableObserved rangeReporting principle
date / room tempmedian 27.9–29.2 (8B 16GB)report interval + raw runs, not a single point
Ollama 0.6.1→0.6.3~±1 tok/ssmaller than day variance; record version
thermals / fan29.8 → 25.1 (−12%)keep in log, do not exclude
GC / Metal reallocsingle-run outlier 31.1 or −5%report p90, not median only
Chrome background20.8 tok/sseparate row, not in baseline

If your median differs >15% from this article, align first: Ollama version, num_ctx, temperature, swap yes/no, thermals. Raw files: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.

We deliberately keep warm runs (25.1 tok/s) and afternoon outliers (24.3)—pretty medians alone cannot tell noise from misconfiguration. That is the line between lab notes and marketing benchmarks.

FAQ

Can M4 Mac Mini 16GB run 70B?

No. 70B Q4 weights ~40GB+, beyond 16GB physical RAM. “16GB runs 70B” mmap tricks use SSD as RAM; tok/s usually < 3—not practical.

How much faster is 24GB than 16GB?

Same 8B, clean: 16GB five-run median 28.8 tok/s, 24GB median 51.2. With Chrome, 16GB can drop to 20.8 in one run—background load often beats RAM tier for feel.

Ollama or MLX?

API, Claude Code, multi-model switching → Ollama. Python evals, custom sampling → MLX. Both can install; do not keep two large models resident at once.

Q4 vs Q8 quantization?

16GB: start Q4_K_M. 24GB on 8B can try Q8 for quality; 14B still Q4. Q8 14B on 24GB nears memory ceiling.

How do I know if I am swapping?

Activity Monitor → Memory → watch Swap grow; or sysctl vm.swapusage. Swapins >0 and tok/s in §5 Phase 3 (single digits) is E2 pressure collapse—smaller model or more RAM, not hyperparameter tuning.

Why is 24GB so much faster than 16GB?

Not doubled bandwidth (same M4 die, similar L2)—more L1 headroom, L3 not firing: 16GB clean 8B median ~28.8, 24GB ~51.2. See §1.4 and §5.2.

Base M4 vs M4 Pro?

Same 8B Q4, clean: M4 24GB median 51.2, M4 Pro 48GB median 75.9 (same lab script). Pro wins more under concurrent batch; single-user Agent may not need Pro.

Agent: tok/s or TTFT?

First turn cares about TTFT (§6.1); long replies care about tok/s. Under swap, TTFT can go 1.82s → 5.8s—fix memory before switching frameworks.

No physical Mac for the team?

Deploy Ollama on a dedicated macOS host; members hit it on the LAN—same commands as here. For remote macOS to reproduce §5 collapse and §4 failures, see “Lab environment disclosure” below.

Summary

M4 Mac Mini local LLM limits are set by L1 capacity + L2 bandwidth + L3 pressure: 16GB comfort zone 7B–8B (L2 steady median ~29 tok/s); 24GB steady 14B (~16 tok/s); 32GB before 32B; 70B needs M4 Pro (E1 capacity exhaustion).

Conditional advice: single-user Agent, 7B–14B range—24GB on 8B often beats forcing 14B on 16GB (§7.1 fair-load table); multi-concurrency or batch inference needs more nodes or Pro. Revalidate with §1 causal model + §3 raw log + §5 Phase 1–3 on your hardware—do not copy only the median.

References

This article is a three-week engineering log, not a one-shot snapshot. Raw artifacts:

  • resources/sample-benchmark-run.log — benchmark output
  • resources/raw-vm-stat-dump.txt — vm_stat / top / memorystatus
  • resources/ollama-debug-excerpt.log — Ollama verbose + failure sessions
  • resources/benchmark-m4-mac-mini-ollama.sh — reproduction script

Lab environment disclosure (Infrastructure disclosure)

Benchmarks ran on physical Mac Mini M4 hardware: some Macstripe dedicated lease nodes (no noisy-neighbor virtualization), some lab bench machines; macOS 15.4, Ollama 0.6.2. Macstripe offers M4 Mac Mini remote lease; conclusions here do not depend on a specific vendor—you can reproduce on your own hardware.

To reproduce §5 collapse and §4 failure taxonomy on remote macOS (e.g. team without a local Mac), see models and regions on the Macstripe home page—optional infrastructure, not a prerequisite for this article’s results.

This article covers which models fit plus lab raw data; companion posts split frameworks, Pro tier, and Agent rollout so one page does not sprawl.