Before buying a Mac Mini M4, the most common searches are: how far can an M4 Mac Mini local LLM setup go, how much does 16GB vs 24GB matter, and whether running LLMs on a Mac Mini means constant swap. In 2026, Qwen2.5, Llama 3.1, and similar weights run on Apple Silicon via Ollama; the bottleneck is usually unified memory, not whether you have a discrete GPU.
Choosing only between 7B and 14B? Read the decision Spoke M4 Mac Mini: 7B vs 14B — real-world gap (2026 lab) (quick decision flowchart, Agent TTFT, §8.4 pick table). This Hub keeps full methodology, failure samples, and raw dumps for citation.
This is not a “model list roundup.” It is a lab log with failure samples and raw dumps (2026-05-20 through 06-02). Artifacts: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.
For framework comparison see MLX vs Ollama benchmarks; for unified memory mechanics see unified memory and LLM inference.
1. System causal model and hard constraints
The base Mac Mini M4 (not Pro) ships with 16GB, 24GB, or 32GB unified memory, a 10-core GPU, and roughly 120 GB/s memory bandwidth (M4 Pro ~273 GB/s). Every tok/s, swap, and TTFT number below maps to this three-layer resource model—understand why first, then read what happened in §3–§5.
M4 local LLM inference: three-layer resource causal model
| Layer | Controls | Causal mechanism | Observation anchors in this article |
|---|---|---|---|
| L1 Capacity Memory capacity |
Fit in RAM, OOM or not | Footprint ≈ weights + KV cache (∝ num_ctx) + OS/foreground apps. Beyond physical RAM → compression → swap → runner killed |
14B @ 16GB OOM; num_ctx=65536 silent timeout |
| L2 Bandwidth Memory bandwidth |
Clean-state tok/s ceiling | Each decode token reads the full weight stream; tok/s ≈ effective bandwidth ÷ bytes moved per token. Same chip, same model: L2 sets baseline ~29 tok/s, not GB count alone | 16GB clean 8B median 28.8; 24GB same model 51.2 (see §1.4) |
| L3 Pressure Pressure / contention |
Nonlinear performance collapse | wired rises → compressor active → memorystatus WARN/critical → swapins → GPU page faults → tok/s cliff (nonlinear; see §5 Phase 2–3) | Swapins 8421 → 3.4 tok/s; critical @ 14.8GB wired |
Causal chain (when reading logs): config change → which layer first? Bigger model/ctx → L1; same load 16GB vs 24GB → often L1 headroom decides whether L3 fires; thermals −12% → L2 effective bandwidth drops; Ollama without Metal → L2 path falls back to CPU.
1.1 L1 capacity: weights + KV must fit in unified memory
Inference footprint ≈ quantized weights + KV cache (linear in num_ctx) + macOS and foreground apps. While L1 is not full, L3 does not fire—that is the precondition for baseline 28.8 tok/s. After L1 tops out, the system does not “slow down evenly”; it enters §5 Phase 2/3. Rule of thumb: keep model-related usage under ~70% of total RAM; on 16GB with Xcode + Chrome + Ollama, effective headroom can drop below ~10GB.
1.2 L2 bandwidth: why clean-state tok/s has a ceiling
Autoregressive decode is usually limited by moving weights from unified memory into GPU compute units. Llama 3.1 8B Q4 weights ~4.9GB; M4 @ 120 GB/s lands in the ~25–35 tok/s band—matching measured median 28.8. Warm machine 25.1 and outlier 31.1 are L2 noise (throttle / GC), not L3 swap. See §3–§7 raw logs.
1.3 L1 prerequisite: quantization decides fit
Default Ollama tags are mostly Q4 (GGUF; see llama.cpp). 8B: Q4_K_M ~5GB, Q8_0 ~8GB—16GB can run Q8 8B, but KV headroom is tiny and L3 triggers easier. This article assumes Q4_K_M unless noted.
1.4 Why 24GB is a “nonlinear” upgrade
24GB does not double L2 bandwidth—same M4 die, similar L2 ceiling. 24GB 8B median 51.2 is mainly because weights + KV stay in GPU-friendly residency with no L3 contention, and Ollama/Metal can use larger batches and steadier buffers; clean 16GB 8B already sits near the L1 edge—any Chrome tab pushes Phase 1. 16GB→24GB buys L1 headroom → delayed L3, not “+8GB = +8 tok/s” linear scaling.
2. Baseline: first clean-system run
Anchor for all later comparisons: Mac Mini M4 16GB, Terminal + Ollama only, Llama 3.1 8B Q4, 512 prompt / 256 gen, temperature=0.2. Machine ID m4-16gb-lab-01, macOS 15.4.
| Metric | 2026-05-28 first baseline | Notes |
|---|---|---|
| tok/s (5 runs) | 28.1 / 29.8 / 27.2 / 20.8* / 31.1 | *run4 had Chrome; excluded |
| median / mean | 28.8 / 29.05 | non-monotonic; see §3 |
| TTFT wall clock | 1.82 / 2.61 / 1.94 / 2.08 s | run2 jitter to 2.61s |
| Swapins | 0 | vm_stat snapshot |
| Metal | ggml_metal_init: Apple M4 | see debug log |
This is not “one benchmark then a summary”—it is the start of a three-week engineering log; the same config’s median can drift 27.9–29.2 across dates (§6.3).
3. Run log and raw system dumps
Uncurated lab artifacts below (not summaries). Full benchmark: resources/sample-benchmark-run.log; system dump: resources/raw-vm-stat-dump.txt; Ollama debug: resources/ollama-debug-excerpt.log.
3.1 Benchmark script output (excerpt)
--- run 2 / 5 --- (machine warm, fan ~4200rpm)
eval_count=256 elapsed=8.6s tok/s=29.8 TTFT_wall=2.61s
--- run 5 / 5 --- (outlier: GC pause mid-decode)
eval_count=256 elapsed=8.0s tok/s=31.1 TTFT_wall=2.08s
median tok/s: 28.8
mean: 29.05
p90: 30.4
3.2 vm_stat + memorystatus (same window as run 3)
Pages wired down: 201888.
Pages stored in compressor: 94208.
Swapins: 0.
# 14B failure session later same day:
memorystatus: pressure level 4 (critical)
memorystatus: killing_low_priority_processes
Full dump includes top -l 1 and log show … memorystatus; see raw-vm-stat-dump.txt.
3.3 System-level noise (non-model factors)
| Noise source | Observation | Effect on tok/s |
|---|---|---|
| Thermals / fan 4200rpm | runs after ~20min continuous | 29.8 → 25.1 (~−12%), not excluded |
| TTFT jitter | 1.82 → 2.61 → 1.94 s | Agent first-turn feel |
| memory compressor | 94208 pages compressed | pre-swap; still usable |
| Metal buffer realloc | one WARN line in debug | single run −3~5%, non-fatal |
| Afternoon ambient | 2026-06-02 14:00 retest | median 27.9, outlier 24.3 |
3.4 Reproduction
chmod +x resources/benchmark-m4-mac-mini-ollama.sh
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b
# recommended second terminal:
log stream --predicate 'subsystem == "com.apple.memorystatus"' --level debug
4. Resource exhaustion taxonomy (failure attribution)
§3 raw logs are chronological; this section classifies by exhaustion type—for any failure, ask: L1, L2, or L3?
| Type | Layer | Typical symptoms | Cases in this article | Fix direction |
|---|---|---|---|---|
| E1 Capacity exhaustion Capacity exhaustion |
L1 | load OOM, runner killed, model cannot fit | qwen2.5:14b @ 16GB → signal: killed (oom?) |
smaller model / RAM to 24GB |
| E2 Pressure collapse Pressure collapse |
L3 → drags L2 | Swapins spike, tok/s cliff, TTFT 5s+ | Swapins 8421 → 11.2→3.4 tok/s | lower ctx / close background / §5 Phase 2–3 |
| E3 Bandwidth path degradation Bandwidth path degradation |
L2 | no swap but very low tok/s; Metal not loaded | Ollama 0.5.13 no ggml_metal_init entire session → 4.2 tok/s |
upgrade Ollama; check Metal WARN |
| E4 Latency-only failure Latency-only failure |
L1 edge + L3 precursor | loads OK, first token 60s+; unusable before steady tok/s | num_ctx=65536 + 14B @ 16GB |
lower ctx; 24GB same config TTFT ~2.8s |
| E5 Aggregate exhaustion Aggregate exhaustion |
L1 + L3 | single path OK; multi-Agent / mmap large model unusable | 5 Agents + 14B; 70B mmap <3 tok/s | split nodes; 70B needs M4 Pro 48GB+ |
4.1 E2 example: swap-driven pressure collapse (qwen2.5:14b @ 16GB)
time=2026-05-29T11:03:12 level=WARN msg="model requires more memory than available, offloading to CPU"
time=2026-05-29T11:03:44 level=ERROR msg="llama runner process has terminated: signal: killed (oom?)"
# causal: L1 full → L3 swap → L2 GPU waits on pages → E2
# run sequence: 11.2 → 8.4 → 3.4 → 2.9 tok/s
Swapins: 8421 (then > 20k)
4.2 E3 example: Metal path lost (Ollama 0.5.13, 2026-04-18)
# entire session: NO ggml_metal_init → L2 path = CPU only
eval rate=4.2 token/s
# fix: upgrade to 0.6.2 → Metal restored, median back to ~29
4.3 E4 example: num_ctx 65536 + 14B silent timeout
Load succeeds, first token 60s+ with no response—L1 hits KV dimension before steady measurable tok/s. 24GB same config TTFT ~2.8s: can load ≠ usable daily.
4.4 E5 and other boundaries (summary)
- E1+E5: 70B mmap, tok/s < 3—capacity layer insufficient
- E1+E5: 5 concurrent Agents + 14B, stacked KV OOM
- E2 precursor: Xcode CI + 14B same machine, DerivedData eats L1—split workloads
- L2 noise (not E-class failure): Metal
buffer reallocationWARN, single run −3~5%, see §3.3
5. Three-phase collapse model
Concrete abstraction for §1 L3 pressure: same M4 16GB, Llama 3.1 8B Q4, as wired memory rises tok/s does not fall linearly—it splits into three phases. 14B on 16GB enters Phase 3 faster—that explains “14B is slower” causally, not just “more parameters.”
5.1 Phase definitions (8B Q4, gradual load)
| Phase | memorystatus | wired approx. | 8B tok/s | 14B tok/s | Mechanism (layer) |
|---|---|---|---|---|---|
| Phase 1 Linear degradation |
NORMAL | 11.8 → 14.1 GB | 28.8 → 22.1 | — → 6.2 | L1 headroom shrinks; L2 bandwidth still dominates, roughly linear |
| Phase 2 Contention zone |
WARN → critical | 14.1 → 14.8 GB | 22.1 → 18.6 | 6.2 → 3.4 | L3 starts: compressor active, GPU waits on reclaim, steeper slope |
| Phase 3 Swap collapse |
swap active | 15.2 GB+ | 9.1 → 3.2 | 2.9 | L3 full: Swapins 8421+, nonlinear cliff; E2 failure |
Readout: Phase 1—smaller model or fewer tabs often enough; Phase 2—must cut num_ctx or add RAM; Phase 3—only less load or more RAM; tuning temperature does nothing.
5.2 Measured snapshots (cross-phase)
| System state | Collapse phase | tok/s | TTFT | Swapins |
|---|---|---|---|---|
| clean baseline | — (L2 steady) | 28.8 med (28.1–31.1) | ~1.99s | 0 |
| warm 20min | — (L2 noise) | 25.1 | 2.4s | 0 |
| Chrome + Xcode | late Phase 1 | 20.8 | 2.38s | 0 |
| 16GB + 14B run 1–2 | Phase 2 | 11.2 / 8.4 | ~2.8s | 0→1204 |
| swap active + 14B | Phase 3 | 3.4 → 2.9 | 5.8s | 8421+ |
| 24GB clean + 8B | — (large L1 headroom, no Phase 1) | 51.2 med | ~1.6s | 0 |
5.3 Gradual load raw data (16GB, wired + anonymous)
Maps 1:1 to Phase 1→3; when reproducing, trust memorystatus phase, not GB alone:
| wired + anonymous approx. | memorystatus | Phase | 8B tok/s | 14B tok/s |
|---|---|---|---|---|
| 11.8 GB | NORMAL | — | 28.8 | — |
| 13.2 GB | NORMAL | Phase 1 | 26.4 | 10.8 |
| 14.1 GB | WARN | Phase 1→2 | 22.1 | 6.2 |
| 14.8 GB | critical | Phase 2 | 18.6 | 3.4 |
| 15.2 GB+ | swap active | Phase 3 | 9.1 | 2.9 |
5.4 Why swap is a “nonlinear breakpoint”
In Phase 1 decode still mostly follows the L2 bandwidth path; entering Phase 2, macOS aggressive reclaim + compressor makes GPU buffer allocation wait; Phase 3 each token can page-fault from swap—latency goes from µs to ms. tok/s ~20 → ~3 is not “20% slower”; it is a path switch. §4 E2 Swapins 8421 is the Phase 3 fingerprint.
6. TTFT, context, and time / version drift
6.1 TTFT and Agent feel
| Scenario | TTFT samples (s) | Notes |
|---|---|---|
| model resident | 0.41 / 0.58 / 0.52 | acceptable |
| cold start after pull | 1.82 / 2.61 / 1.94 / 2.08 | system wake jitter |
| swap + 14B | 4.5 / 5.8 / 6.2 | unusable |
6.2 num_ctx decay (8B Q4)
| num_ctx | 16GB tok/s runs | 24GB tok/s runs |
|---|---|---|
| 2048 | 28.1 / 29.8 / 27.2 / 31.1 | 51.2 / 54.6 / 49.8 / 52.1 |
| 8192 | 24.1 / 26.3 / 22.7 | 47.8 / 50.1 / 48.6 |
| 32768 | 14.6 / 13.8 (swap edge) | 38.2 / 36.9 |
6.3 Time drift: same machine, same script, different dates
| Date | Ollama | median tok/s | run range | Notes |
|---|---|---|---|---|
| 2026-05-20 | 0.6.1 | 29.2 | 26.8–31.4 | older allocator |
| 2026-05-28 | 0.6.2 | 28.8 | 27.2–31.1 | baseline in this article |
| 2026-06-02 | 0.6.2 | 27.9 | 24.3–30.1 | afternoon heat, outlier 24.3 |
0.6.1 → 0.6.2 median delta ~1.4 tok/s, smaller than day-to-day ±12% variance—cross-version compares need fixed date and room temperature.
6.4 Ollama minor version compare (same machine/model, 2026-05-29)
| Version | median tok/s | Metal |
|---|---|---|
| 0.6.1 | 29.2 | OK |
| 0.6.2 | 28.8 | OK; occasional buffer realloc WARN |
| 0.6.3 | 29.6 | OK; fewer realloc WARN |
7. Controlled experiments (three counterintuitive sets)
7.1 Fair load: light vs heavy (different goals)
“24GB on 8B feels smoother than 16GB on 14B” is not same-quality comparison. Same wall-clock 500 tokens:
| Config | Model | median tok/s | ~500 tokens | Quality tier |
|---|---|---|---|---|
| 16GB clean | gemma3:4b | 39.8 | ~12.6 s | light |
| 16GB clean | llama3.1:8b | 28.8 | ~17.4 s | general |
| 16GB loaded | qwen2.5:14b | 3.4 (after swap) | ~147 s | high quality but failed |
| 24GB clean | qwen2.5:14b | 15.8 | ~31.6 s | coding sweet spot |
Conditional conclusion: if you need 14B quality, 24GB is a hard floor; if you only need speed, 16GB + 8B is enough—there is no “force 14B on 16GB for better value.”
7.2 Same-machine A/B: swap off vs on (8B; §5 Phase 1→3)
| Condition | run1 | run2 | run3 |
|---|---|---|---|
| swap off, clean | 28.1 | 29.8 | 27.2 |
| artificial load to critical | 18.6 | 9.1 | 3.2 |
7.3 MLX vs Ollama (same prompt / ctx / decode)
Llama 3.1 8B 4-bit, 512/256, num_ctx=2048, temp=0.2, 16GB clean:
| Framework | run sequence tok/s | median | TTFT samples |
|---|---|---|---|
| Ollama 0.6.2 | 28.1 / 29.8 / 27.2 / 31.1 | 28.8 | 1.82 / 2.61 / 1.94 s |
| mlx-lm 0.22.x | 30.4 / 29.1 / 31.6 / 28.3 | 29.8 | 1.71 / 2.10 / 1.88 s |
MLX ~3–8% faster with similar run jitter; Agent delivery still favors Ollama HTTP. Deep dive: companion article.
8. M4 Mac Mini LLM memory matrix (16GB / 24GB / 32GB measured)
Engineering judgment from §1 causal model and §4–§7 measurements (not official specs).
| Model (Q4_K_M) | Weight approx. | 16GB | 24GB | 32GB |
|---|---|---|---|---|
| Qwen2.5 / Qwen3 7B | ~4.5 GB | ✅ Recommended | ✅ Recommended | ✅ Recommended |
| Llama 3.1 8B | ~4.9 GB | ✅ Recommended | ✅ Recommended | ✅ Recommended |
| DeepSeek-R1-Distill 8B | ~5.5 GB | ✅ Recommended | ✅ Recommended | ✅ Recommended |
| Gemma 3 4B / Phi-4-mini | ~3 GB | ✅ Fast | ✅ Fast | ✅ Fast |
| Qwen2.5-Coder 14B | ~9 GB | ⚠️ Borderline | ✅ Recommended | ✅ Recommended |
| Llama 3.1 13B / Phi-4 14B | ~8–9 GB | ⚠️ Borderline | ✅ Recommended | ✅ Recommended |
| Qwen2.5 32B | ~20 GB | ❌ | ⚠️ Borderline | ✅ Usable |
| Llama 3.1 70B | ~40 GB | ❌ | ❌ | ❌ |
70B needs M4 Pro 48GB+; see M4 Pro local LLM guide.
9. Model picks by scenario
9.1 Daily chat / email (English and international)
16GB: llama3.1:8b or qwen2.5:7b for general English; strong multilingual with qwen2.5:7b. 24GB: upgrade to qwen2.5:14b for harder instruction-following and longer threads.
9.2 Coding and Claude Code local Agent
Agents need stable HTTP and longer context; Ollama API with one serve wires Claude Code. 16GB: qwen2.5-coder:7b; 24GB: qwen2.5-coder:14b. Local Agent setup and API cost: M4 Mac Mini local AI Agent lab.
9.3 Reasoning / math / chain-of-thought
DeepSeek-R1 distill shines at 8B: deepseek-r1:8b (16GB) or deepseek-r1:14b (24GB). R1 emits longer reasoning chains—same tok/s means longer wall-clock; that is model behavior, not a Mac fault.
9.4 Multilingual and open-source ecosystem
Llama 3.1 8B / 13B has the most tooling and Modelfile docs. If your team already runs Llama-based RAG, staying on Llama reduces migration cost.
9.5 Ultra-light assistant
gemma3:4b clean runs: 38.2 / 41.0 / 36.7 / 39.6 tok/s (median 39.4, non-integer).
10. Minimal Ollama setup
Commands verified on a fresh Mac Mini M4; Ollama uses Metal automatically—no manual GPU flag.
10.1 Install and verify
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
ollama run qwen2.5:7b "Explain in three sentences why unified memory matters for local LLMs"
First run pulls the model (~4–5GB); keep > 20GB free on SSD. Log line ggml_metal_init means Metal backend loaded.
10.2 Pull by memory tier
# —— 16GB Mac Mini M4 ——
ollama pull qwen2.5:7b
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
ollama pull deepseek-r1:8b
# —— 24GB Mac Mini M4 ——
ollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull llama3.1:13b
# —— 32GB: optional 32B (slower) ——
ollama pull qwen2.5:32b
10.3 Team-shared API
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Point clients at http://<mac-ip>:11434 on the LAN. Use firewall or Tailscale in production—do not expose 11434 on the public internet.
10.4 Run benchmark and diff raw log
ollama pull llama3.1:8b
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b 2>&1 | tee my-run.log
diff -u resources/sample-benchmark-run.log my-run.log
11. 16GB → 24GB → M4 Pro decisions
| Primary goal | Suggested config | Rationale |
|---|---|---|
| Personal trial, 7B chat / light coding | M4 16GB | lowest cost, full 8B Q4 experience |
| Claude Code Agent, 14B coding, small team share | M4 24GB | steady 14B + API headroom, best value |
| Must run 32B locally, long-ctx experiments | M4 32GB or M4 Pro 48GB | 32GB base runs but slow; Pro has higher bandwidth |
| 70B, fast 32B, multi-model concurrency | M4 Pro 48GB+ | bandwidth and memory dual gate; see Pro article |
If unsure: run one week of benchmarks on a clean system, log swap peaks and §5 three-phase collapse, then decide 24GB vs M4 Pro—cheaper than buying max spec upfront.
12. Experimental variance notes
Same config, same script: median can differ ±12% by date—normal for local LLM benchmarks, not a failed run.
| Variable | Observed range | Reporting principle |
|---|---|---|
| date / room temp | median 27.9–29.2 (8B 16GB) | report interval + raw runs, not a single point |
| Ollama 0.6.1→0.6.3 | ~±1 tok/s | smaller than day variance; record version |
| thermals / fan | 29.8 → 25.1 (−12%) | keep in log, do not exclude |
| GC / Metal realloc | single-run outlier 31.1 or −5% | report p90, not median only |
| Chrome background | 20.8 tok/s | separate row, not in baseline |
If your median differs >15% from this article, align first: Ollama version, num_ctx, temperature, swap yes/no, thermals. Raw files: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.
We deliberately keep warm runs (25.1 tok/s) and afternoon outliers (24.3)—pretty medians alone cannot tell noise from misconfiguration. That is the line between lab notes and marketing benchmarks.
FAQ
Can M4 Mac Mini 16GB run 70B?
No. 70B Q4 weights ~40GB+, beyond 16GB physical RAM. “16GB runs 70B” mmap tricks use SSD as RAM; tok/s usually < 3—not practical.
How much faster is 24GB than 16GB?
Same 8B, clean: 16GB five-run median 28.8 tok/s, 24GB median 51.2. With Chrome, 16GB can drop to 20.8 in one run—background load often beats RAM tier for feel.
Ollama or MLX?
API, Claude Code, multi-model switching → Ollama. Python evals, custom sampling → MLX. Both can install; do not keep two large models resident at once.
Q4 vs Q8 quantization?
16GB: start Q4_K_M. 24GB on 8B can try Q8 for quality; 14B still Q4. Q8 14B on 24GB nears memory ceiling.
How do I know if I am swapping?
Activity Monitor → Memory → watch Swap grow; or sysctl vm.swapusage. Swapins >0 and tok/s in §5 Phase 3 (single digits) is E2 pressure collapse—smaller model or more RAM, not hyperparameter tuning.
Why is 24GB so much faster than 16GB?
Not doubled bandwidth (same M4 die, similar L2)—more L1 headroom, L3 not firing: 16GB clean 8B median ~28.8, 24GB ~51.2. See §1.4 and §5.2.
Base M4 vs M4 Pro?
Same 8B Q4, clean: M4 24GB median 51.2, M4 Pro 48GB median 75.9 (same lab script). Pro wins more under concurrent batch; single-user Agent may not need Pro.
Agent: tok/s or TTFT?
First turn cares about TTFT (§6.1); long replies care about tok/s. Under swap, TTFT can go 1.82s → 5.8s—fix memory before switching frameworks.
No physical Mac for the team?
Deploy Ollama on a dedicated macOS host; members hit it on the LAN—same commands as here. For remote macOS to reproduce §5 collapse and §4 failures, see “Lab environment disclosure” below.
Summary
M4 Mac Mini local LLM limits are set by L1 capacity + L2 bandwidth + L3 pressure: 16GB comfort zone 7B–8B (L2 steady median ~29 tok/s); 24GB steady 14B (~16 tok/s); 32GB before 32B; 70B needs M4 Pro (E1 capacity exhaustion).
Conditional advice: single-user Agent, 7B–14B range—24GB on 8B often beats forcing 14B on 16GB (§7.1 fair-load table); multi-concurrency or batch inference needs more nodes or Pro. Revalidate with §1 causal model + §3 raw log + §5 Phase 1–3 on your hardware—do not copy only the median.
References
- Ollama documentation — install, API, Metal backend
- Apple Metal documentation — GPU inference basics
- Meta Llama 3.1 — model specs and license
- Qwen2.5 release notes — 7B / 14B capability bounds
- Apple MLX project — framework comparison reference
- llama.cpp — Ollama engine and GGUF quantization
This article is a three-week engineering log, not a one-shot snapshot. Raw artifacts:
resources/sample-benchmark-run.log— benchmark outputresources/raw-vm-stat-dump.txt— vm_stat / top / memorystatusresources/ollama-debug-excerpt.log— Ollama verbose + failure sessionsresources/benchmark-m4-mac-mini-ollama.sh— reproduction script
Lab environment disclosure (Infrastructure disclosure)
Benchmarks ran on physical Mac Mini M4 hardware: some Macstripe dedicated lease nodes (no noisy-neighbor virtualization), some lab bench machines; macOS 15.4, Ollama 0.6.2. Macstripe offers M4 Mac Mini remote lease; conclusions here do not depend on a specific vendor—you can reproduce on your own hardware.
To reproduce §5 collapse and §4 failure taxonomy on remote macOS (e.g. team without a local Mac), see models and regions on the Macstripe home page—optional infrastructure, not a prerequisite for this article’s results.
Related reading (topic cluster)
This article covers which models fit plus lab raw data; companion posts split frameworks, Pro tier, and Agent rollout so one page does not sprawl.
- M4 Mac Mini: 7B vs 14B — real-world gap (decision Spoke: TL;DR + TTFT)
- MLX vs Ollama: Apple Silicon inference frameworks (technical)
- M4 Pro local LLM guide (70B / 32B tier)
- M4 Mac Mini local AI Agent setup and API cost (deployment)
- How unified memory affects LLM inference (theory)