Can a 16GB M4 Mac Mini run 70B?

No. 70B Q4 weights are ~40GB+, beyond 16GB physical RAM. mmap-style tricks use SSD as extension; tok/s is usually <3—not practical.

API, Claude Code, and multi-model switching → Ollama. Python evals and custom sampling → MLX. Both can install; avoid two large models resident at once.

Same 8B Q4 clean: M4 24GB median 51.2, M4 Pro 48GB 75.9 (same lab script). Pro helps more with concurrent batch; single-user agents may not need Pro.

For agents, tok/s or TTFT?

First turn: TTFT. Long replies: tok/s. Under swap, TTFT can jump from ~1.8s to ~5.8s—fix memory before changing frameworks.

Which Local LLMs Can an M4 Mac Mini Run? (16GB / 24GB / 32GB Lab Guide)

Q: How much faster is 24GB than 16GB?

Same 8B, clean state: 16GB median 28.8 tok/s, 24GB 51.2. With Chrome open, 16GB can drop to 20.8—background load often matters more than RAM tier for daily feel.

Q: Ollama or MLX?

API, Claude Code, and multi-model switching → Ollama. Python evals and custom sampling → MLX. Both can install; avoid two large models resident at once.

Q: Q4 vs Q8 quantization?

16GB: start with Q4_K_M. On 24GB you can try Q8 on 8B for quality; keep 14B on Q4. Q8 14B on 24GB nears the memory ceiling.

Q: How do I know if I am swapping?

Activity Monitor → Memory → Swap Used, or sysctl vm.swapusage. Swapins >0 with tok/s in single digits (Phase 3) is pressure collapse—use a smaller model or more RAM.

Q: Why is 24GB so much faster?

Not doubled bandwidth on the same M4 die. Larger L1 headroom keeps L3 from firing: ~28.8 vs ~51.2 median on 8B clean runs.

Q: Base M4 vs M4 Pro?

Same 8B Q4 clean: M4 24GB median 51.2, M4 Pro 48GB 75.9 (same lab script). Pro helps more with concurrent batch; single-user agents may not need Pro.

Q: For agents, tok/s or TTFT?

First turn: TTFT. Long replies: tok/s. Under swap, TTFT can jump from ~1.8s to ~5.8s—fix memory before changing frameworks.

Mac Mini on a desk with a monitor — a local LLM inference workstation

Before buying a Mac Mini M4, the most common searches are: how far can an M4 Mac Mini local LLM setup go, how much does 16GB vs 24GB matter, and whether running LLMs on a Mac Mini means constant swap. In 2026, Qwen2.5, Llama 3.1, and similar weights run on Apple Silicon via Ollama; the bottleneck is usually unified memory, not whether you have a discrete GPU.

Choosing only between 7B and 14B? Read the decision Spoke M4 Mac Mini: 7B vs 14B — real-world gap (2026 lab) (quick decision flowchart, Agent TTFT, §8.4 pick table). This Hub keeps full methodology, failure samples, and raw dumps for citation.

This is not a “model list roundup.” It is a lab log with failure samples and raw dumps (2026-05-20 through 06-02). Artifacts: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.

For framework comparison see MLX vs Ollama benchmarks; for unified memory mechanics see unified memory and LLM inference.

Reading path: Start with §1’s three-layer resource causal model, then §3–§4 raw logs and failure taxonomy, then §5’s three-phase collapse to explain why swap is a nonlinear breakpoint. Conditional takeaway: for a single-user Agent, 24GB on 8B often beats forcing 14B on 16GB (§7.1)—do not copy only the median.

1. System causal model and hard constraints

The base Mac Mini M4 (not Pro) ships with 16GB, 24GB, or 32GB unified memory, a 10-core GPU, and roughly 120 GB/s memory bandwidth (M4 Pro ~273 GB/s). Every tok/s, swap, and TTFT number below maps to this three-layer resource model—understand why first, then read what happened in §3–§5.

M4 local LLM inference: three-layer resource causal model

Layer	Controls	Causal mechanism	Observation anchors in this article
L1 Capacity Memory capacity	Fit in RAM, OOM or not	Footprint ≈ weights + KV cache (∝ `num_ctx`) + OS/foreground apps. Beyond physical RAM → compression → swap → runner killed	14B @ 16GB OOM; `num_ctx=65536` silent timeout
L2 Bandwidth Memory bandwidth	Clean-state tok/s ceiling	Each decode token reads the full weight stream; tok/s ≈ effective bandwidth ÷ bytes moved per token. Same chip, same model: L2 sets baseline ~29 tok/s, not GB count alone	16GB clean 8B median 28.8; 24GB same model 51.2 (see §1.4)
L3 Pressure Pressure / contention	Nonlinear performance collapse	wired rises → compressor active → memorystatus WARN/critical → swapins → GPU page faults → tok/s cliff (nonlinear; see §5 Phase 2–3)	Swapins 8421 → 3.4 tok/s; critical @ 14.8GB wired

Causal chain (when reading logs): config change → which layer first? Bigger model/ctx → L1; same load 16GB vs 24GB → often L1 headroom decides whether L3 fires; thermals −12% → L2 effective bandwidth drops; Ollama without Metal → L2 path falls back to CPU.

1.1 L1 capacity: weights + KV must fit in unified memory

Inference footprint ≈ quantized weights + KV cache (linear in num_ctx) + macOS and foreground apps. While L1 is not full, L3 does not fire—that is the precondition for baseline 28.8 tok/s. After L1 tops out, the system does not “slow down evenly”; it enters §5 Phase 2/3. Rule of thumb: keep model-related usage under ~70% of total RAM; on 16GB with Xcode + Chrome + Ollama, effective headroom can drop below ~10GB.

1.2 L2 bandwidth: why clean-state tok/s has a ceiling

Autoregressive decode is usually limited by moving weights from unified memory into GPU compute units. Llama 3.1 8B Q4 weights ~4.9GB; M4 @ 120 GB/s lands in the ~25–35 tok/s band—matching measured median 28.8. Warm machine 25.1 and outlier 31.1 are L2 noise (throttle / GC), not L3 swap. See §3–§7 raw logs.

1.3 L1 prerequisite: quantization decides fit

Default Ollama tags are mostly Q4 (GGUF; see llama.cpp). 8B: Q4_K_M ~5GB, Q8_0 ~8GB—16GB can run Q8 8B, but KV headroom is tiny and L3 triggers easier. This article assumes Q4_K_M unless noted.

1.4 Why 24GB is a “nonlinear” upgrade

24GB does not double L2 bandwidth—same M4 die, similar L2 ceiling. 24GB 8B median 51.2 is mainly because weights + KV stay in GPU-friendly residency with no L3 contention, and Ollama/Metal can use larger batches and steadier buffers; clean 16GB 8B already sits near the L1 edge—any Chrome tab pushes Phase 1. 16GB→24GB buys L1 headroom → delayed L3, not “+8GB = +8 tok/s” linear scaling.

2. Baseline: first clean-system run

Anchor for all later comparisons: Mac Mini M4 16GB, Terminal + Ollama only, Llama 3.1 8B Q4, 512 prompt / 256 gen, temperature=0.2. Machine ID m4-16gb-lab-01, macOS 15.4.

Metric	2026-05-28 first baseline	Notes
tok/s (5 runs)	28.1 / 29.8 / 27.2 / 20.8* / 31.1	*run4 had Chrome; excluded
median / mean	28.8 / 29.05	non-monotonic; see §3
TTFT wall clock	1.82 / 2.61 / 1.94 / 2.08 s	run2 jitter to 2.61s
Swapins	0	vm_stat snapshot
Metal	`ggml_metal_init: Apple M4`	see debug log

This is not “one benchmark then a summary”—it is the start of a three-week engineering log; the same config’s median can drift 27.9–29.2 across dates (§6.3).

3. Run log and raw system dumps

Uncurated lab artifacts below (not summaries). Full benchmark: resources/sample-benchmark-run.log; system dump: resources/raw-vm-stat-dump.txt; Ollama debug: resources/ollama-debug-excerpt.log.

3.1 Benchmark script output (excerpt)

--- run 2 / 5 ---  (machine warm, fan ~4200rpm)
  eval_count=256  elapsed=8.6s  tok/s=29.8  TTFT_wall=2.61s
--- run 5 / 5 ---  (outlier: GC pause mid-decode)
  eval_count=256  elapsed=8.0s  tok/s=31.1  TTFT_wall=2.08s

median tok/s:  28.8
mean:          29.05
p90:           30.4

3.2 vm_stat + memorystatus (same window as run 3)

Pages wired down:                        201888.
Pages stored in compressor:               94208.
Swapins:                                      0.
# 14B failure session later same day:
memorystatus: pressure level 4 (critical)
memorystatus: killing_low_priority_processes

Full dump includes top -l 1 and log show … memorystatus; see raw-vm-stat-dump.txt.

3.3 System-level noise (non-model factors)

Noise source	Observation	Effect on tok/s
Thermals / fan 4200rpm	runs after ~20min continuous	29.8 → 25.1 (~−12%), not excluded
TTFT jitter	1.82 → 2.61 → 1.94 s	Agent first-turn feel
memory compressor	94208 pages compressed	pre-swap; still usable
Metal buffer realloc	one WARN line in debug	single run −3~5%, non-fatal
Afternoon ambient	2026-06-02 14:00 retest	median 27.9, outlier 24.3

3.4 Reproduction

chmod +x resources/benchmark-m4-mac-mini-ollama.sh
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b
# recommended second terminal:
log stream --predicate 'subsystem == "com.apple.memorystatus"' --level debug

4. Resource exhaustion taxonomy (failure attribution)

§3 raw logs are chronological; this section classifies by exhaustion type—for any failure, ask: L1, L2, or L3?

Type	Layer	Typical symptoms	Cases in this article	Fix direction
E1 Capacity exhaustion Capacity exhaustion	L1	load OOM, runner killed, model cannot fit	qwen2.5:14b @ 16GB → `signal: killed (oom?)`	smaller model / RAM to 24GB
E2 Pressure collapse Pressure collapse	L3 → drags L2	Swapins spike, tok/s cliff, TTFT 5s+	Swapins 8421 → 11.2→3.4 tok/s	lower ctx / close background / §5 Phase 2–3
E3 Bandwidth path degradation Bandwidth path degradation	L2	no swap but very low tok/s; Metal not loaded	Ollama 0.5.13 no `ggml_metal_init` entire session → 4.2 tok/s	upgrade Ollama; check Metal WARN
E4 Latency-only failure Latency-only failure	L1 edge + L3 precursor	loads OK, first token 60s+; unusable before steady tok/s	`num_ctx=65536` + 14B @ 16GB	lower ctx; 24GB same config TTFT ~2.8s
E5 Aggregate exhaustion Aggregate exhaustion	L1 + L3	single path OK; multi-Agent / mmap large model unusable	5 Agents + 14B; 70B mmap <3 tok/s	split nodes; 70B needs M4 Pro 48GB+

4.1 E2 example: swap-driven pressure collapse (qwen2.5:14b @ 16GB)

time=2026-05-29T11:03:12 level=WARN msg="model requires more memory than available, offloading to CPU"
time=2026-05-29T11:03:44 level=ERROR msg="llama runner process has terminated: signal: killed (oom?)"
# causal: L1 full → L3 swap → L2 GPU waits on pages → E2
# run sequence: 11.2 → 8.4 → 3.4 → 2.9 tok/s
Swapins: 8421  (then > 20k)

4.2 E3 example: Metal path lost (Ollama 0.5.13, 2026-04-18)

# entire session: NO ggml_metal_init  →  L2 path = CPU only
eval rate=4.2 token/s
# fix: upgrade to 0.6.2 → Metal restored, median back to ~29

4.3 E4 example: num_ctx 65536 + 14B silent timeout

Load succeeds, first token 60s+ with no response—L1 hits KV dimension before steady measurable tok/s. 24GB same config TTFT ~2.8s: can load ≠ usable daily.

4.4 E5 and other boundaries (summary)

E1+E5: 70B mmap, tok/s < 3—capacity layer insufficient
E1+E5: 5 concurrent Agents + 14B, stacked KV OOM
E2 precursor: Xcode CI + 14B same machine, DerivedData eats L1—split workloads
L2 noise (not E-class failure): Metal buffer reallocation WARN, single run −3~5%, see §3.3

5. Three-phase collapse model

Concrete abstraction for §1 L3 pressure: same M4 16GB, Llama 3.1 8B Q4, as wired memory rises tok/s does not fall linearly—it splits into three phases. 14B on 16GB enters Phase 3 faster—that explains “14B is slower” causally, not just “more parameters.”

5.1 Phase definitions (8B Q4, gradual load)

Phase	memorystatus	wired approx.	8B tok/s	14B tok/s	Mechanism (layer)
Phase 1 Linear degradation	NORMAL	11.8 → 14.1 GB	28.8 → 22.1	— → 6.2	L1 headroom shrinks; L2 bandwidth still dominates, roughly linear
Phase 2 Contention zone	WARN → critical	14.1 → 14.8 GB	22.1 → 18.6	6.2 → 3.4	L3 starts: compressor active, GPU waits on reclaim, steeper slope
Phase 3 Swap collapse	swap active	15.2 GB+	9.1 → 3.2	2.9	L3 full: Swapins 8421+, nonlinear cliff; E2 failure

Readout: Phase 1—smaller model or fewer tabs often enough; Phase 2—must cut num_ctx or add RAM; Phase 3—only less load or more RAM; tuning temperature does nothing.

5.2 Measured snapshots (cross-phase)

System state	Collapse phase	tok/s	TTFT	Swapins
clean baseline	— (L2 steady)	28.8 med (28.1–31.1)	~1.99s	0
warm 20min	— (L2 noise)	25.1	2.4s	0
Chrome + Xcode	late Phase 1	20.8	2.38s	0
16GB + 14B run 1–2	Phase 2	11.2 / 8.4	~2.8s	0→1204
swap active + 14B	Phase 3	3.4 → 2.9	5.8s	8421+
24GB clean + 8B	— (large L1 headroom, no Phase 1)	51.2 med	~1.6s	0

5.3 Gradual load raw data (16GB, wired + anonymous)

Maps 1:1 to Phase 1→3; when reproducing, trust memorystatus phase, not GB alone:

wired + anonymous approx.	memorystatus	Phase	8B tok/s	14B tok/s
11.8 GB	NORMAL	—	28.8	—
13.2 GB	NORMAL	Phase 1	26.4	10.8
14.1 GB	WARN	Phase 1→2	22.1	6.2
14.8 GB	critical	Phase 2	18.6	3.4
15.2 GB+	swap active	Phase 3	9.1	2.9

5.4 Why swap is a “nonlinear breakpoint”

In Phase 1 decode still mostly follows the L2 bandwidth path; entering Phase 2, macOS aggressive reclaim + compressor makes GPU buffer allocation wait; Phase 3 each token can page-fault from swap—latency goes from µs to ms. tok/s ~20 → ~3 is not “20% slower”; it is a path switch. §4 E2 Swapins 8421 is the Phase 3 fingerprint.

6. TTFT, context, and time / version drift

6.1 TTFT and Agent feel

Scenario	TTFT samples (s)	Notes
model resident	0.41 / 0.58 / 0.52	acceptable
cold start after pull	1.82 / 2.61 / 1.94 / 2.08	system wake jitter
swap + 14B	4.5 / 5.8 / 6.2	unusable

6.2 num_ctx decay (8B Q4)

num_ctx	16GB tok/s runs	24GB tok/s runs
2048	28.1 / 29.8 / 27.2 / 31.1	51.2 / 54.6 / 49.8 / 52.1
8192	24.1 / 26.3 / 22.7	47.8 / 50.1 / 48.6
32768	14.6 / 13.8 (swap edge)	38.2 / 36.9

6.3 Time drift: same machine, same script, different dates

Date	Ollama	median tok/s	run range	Notes
2026-05-20	0.6.1	29.2	26.8–31.4	older allocator
2026-05-28	0.6.2	28.8	27.2–31.1	baseline in this article
2026-06-02	0.6.2	27.9	24.3–30.1	afternoon heat, outlier 24.3

0.6.1 → 0.6.2 median delta ~1.4 tok/s, smaller than day-to-day ±12% variance—cross-version compares need fixed date and room temperature.

6.4 Ollama minor version compare (same machine/model, 2026-05-29)

Version	median tok/s	Metal
0.6.1	29.2	OK
0.6.2	28.8	OK; occasional buffer realloc WARN
0.6.3	29.6	OK; fewer realloc WARN

7. Controlled experiments (three counterintuitive sets)

7.1 Fair load: light vs heavy (different goals)

“24GB on 8B feels smoother than 16GB on 14B” is not same-quality comparison. Same wall-clock 500 tokens:

Config	Model	median tok/s	~500 tokens	Quality tier
16GB clean	gemma3:4b	39.8	~12.6 s	light
16GB clean	llama3.1:8b	28.8	~17.4 s	general
16GB loaded	qwen2.5:14b	3.4 (after swap)	~147 s	high quality but failed
24GB clean	qwen2.5:14b	15.8	~31.6 s	coding sweet spot

Conditional conclusion: if you need 14B quality, 24GB is a hard floor; if you only need speed, 16GB + 8B is enough—there is no “force 14B on 16GB for better value.”

7.2 Same-machine A/B: swap off vs on (8B; §5 Phase 1→3)

Condition	run1	run2	run3
swap off, clean	28.1	29.8	27.2
artificial load to critical	18.6	9.1	3.2

7.3 MLX vs Ollama (same prompt / ctx / decode)

Llama 3.1 8B 4-bit, 512/256, num_ctx=2048, temp=0.2, 16GB clean:

Framework	run sequence tok/s	median	TTFT samples
Ollama 0.6.2	28.1 / 29.8 / 27.2 / 31.1	28.8	1.82 / 2.61 / 1.94 s
mlx-lm 0.22.x	30.4 / 29.1 / 31.6 / 28.3	29.8	1.71 / 2.10 / 1.88 s

MLX ~3–8% faster with similar run jitter; Agent delivery still favors Ollama HTTP. Deep dive: companion article.

8. M4 Mac Mini LLM memory matrix (16GB / 24GB / 32GB measured)

Engineering judgment from §1 causal model and §4–§7 measurements (not official specs).

Model (Q4_K_M)	Weight approx.	16GB	24GB	32GB
Qwen2.5 / Qwen3 7B	~4.5 GB	✅ Recommended	✅ Recommended	✅ Recommended
Llama 3.1 8B	~4.9 GB	✅ Recommended	✅ Recommended	✅ Recommended
DeepSeek-R1-Distill 8B	~5.5 GB	✅ Recommended	✅ Recommended	✅ Recommended
Gemma 3 4B / Phi-4-mini	~3 GB	✅ Fast	✅ Fast	✅ Fast
Qwen2.5-Coder 14B	~9 GB	⚠️ Borderline	✅ Recommended	✅ Recommended
Llama 3.1 13B / Phi-4 14B	~8–9 GB	⚠️ Borderline	✅ Recommended	✅ Recommended
Qwen2.5 32B	~20 GB	❌	⚠️ Borderline	✅ Usable
Llama 3.1 70B	~40 GB	❌	❌	❌

70B needs M4 Pro 48GB+; see M4 Pro local LLM guide.

9. Model picks by scenario

9.1 Daily chat / email (English and international)

16GB: llama3.1:8b or qwen2.5:7b for general English; strong multilingual with qwen2.5:7b. 24GB: upgrade to qwen2.5:14b for harder instruction-following and longer threads.

9.2 Coding and Claude Code local Agent

Agents need stable HTTP and longer context; Ollama API with one serve wires Claude Code. 16GB: qwen2.5-coder:7b; 24GB: qwen2.5-coder:14b. Local Agent setup and API cost: M4 Mac Mini local AI Agent lab.

9.3 Reasoning / math / chain-of-thought

DeepSeek-R1 distill shines at 8B: deepseek-r1:8b (16GB) or deepseek-r1:14b (24GB). R1 emits longer reasoning chains—same tok/s means longer wall-clock; that is model behavior, not a Mac fault.

9.4 Multilingual and open-source ecosystem

Llama 3.1 8B / 13B has the most tooling and Modelfile docs. If your team already runs Llama-based RAG, staying on Llama reduces migration cost.

9.5 Ultra-light assistant

gemma3:4b clean runs: 38.2 / 41.0 / 36.7 / 39.6 tok/s (median 39.4, non-integer).

10. Minimal Ollama setup

Commands verified on a fresh Mac Mini M4; Ollama uses Metal automatically—no manual GPU flag.

10.1 Install and verify

curl -fsSL https://ollama.com/install.sh | sh
ollama --version
ollama run qwen2.5:7b "Explain in three sentences why unified memory matters for local LLMs"

First run pulls the model (~4–5GB); keep > 20GB free on SSD. Log line ggml_metal_init means Metal backend loaded.

10.2 Pull by memory tier

# —— 16GB Mac Mini M4 ——
ollama pull qwen2.5:7b
ollama pull qwen2.5-coder:7b
ollama pull llama3.1:8b
ollama pull deepseek-r1:8b

# —— 24GB Mac Mini M4 ——
ollama pull qwen2.5:14b
ollama pull qwen2.5-coder:14b
ollama pull llama3.1:13b

# —— 32GB: optional 32B (slower) ——
ollama pull qwen2.5:32b

10.3 Team-shared API

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Point clients at http://<mac-ip>:11434 on the LAN. Use firewall or Tailscale in production—do not expose 11434 on the public internet.

10.4 Run benchmark and diff raw log

ollama pull llama3.1:8b
./resources/benchmark-m4-mac-mini-ollama.sh llama3.1:8b 2>&1 | tee my-run.log
diff -u resources/sample-benchmark-run.log my-run.log

11. 16GB → 24GB → M4 Pro decisions

Primary goal	Suggested config	Rationale
Personal trial, 7B chat / light coding	M4 16GB	lowest cost, full 8B Q4 experience
Claude Code Agent, 14B coding, small team share	M4 24GB	steady 14B + API headroom, best value
Must run 32B locally, long-ctx experiments	M4 32GB or M4 Pro 48GB	32GB base runs but slow; Pro has higher bandwidth
70B, fast 32B, multi-model concurrency	M4 Pro 48GB+	bandwidth and memory dual gate; see Pro article

If unsure: run one week of benchmarks on a clean system, log swap peaks and §5 three-phase collapse, then decide 24GB vs M4 Pro—cheaper than buying max spec upfront.

12. Experimental variance notes

Same config, same script: median can differ ±12% by date—normal for local LLM benchmarks, not a failed run.

Variable	Observed range	Reporting principle
date / room temp	median 27.9–29.2 (8B 16GB)	report interval + raw runs, not a single point
Ollama 0.6.1→0.6.3	~±1 tok/s	smaller than day variance; record version
thermals / fan	29.8 → 25.1 (−12%)	keep in log, do not exclude
GC / Metal realloc	single-run outlier 31.1 or −5%	report p90, not median only
Chrome background	20.8 tok/s	separate row, not in baseline

If your median differs >15% from this article, align first: Ollama version, num_ctx, temperature, swap yes/no, thermals. Raw files: sample-benchmark-run.log, raw-vm-stat-dump.txt, ollama-debug-excerpt.log.

We deliberately keep warm runs (25.1 tok/s) and afternoon outliers (24.3)—pretty medians alone cannot tell noise from misconfiguration. That is the line between lab notes and marketing benchmarks.

FAQ

Can M4 Mac Mini 16GB run 70B?

No. 70B Q4 weights ~40GB+, beyond 16GB physical RAM. “16GB runs 70B” mmap tricks use SSD as RAM; tok/s usually < 3—not practical.

How much faster is 24GB than 16GB?

Same 8B, clean: 16GB five-run median 28.8 tok/s, 24GB median 51.2. With Chrome, 16GB can drop to 20.8 in one run—background load often beats RAM tier for feel.

Ollama or MLX?

API, Claude Code, multi-model switching → Ollama. Python evals, custom sampling → MLX. Both can install; do not keep two large models resident at once.

Q4 vs Q8 quantization?

16GB: start Q4_K_M. 24GB on 8B can try Q8 for quality; 14B still Q4. Q8 14B on 24GB nears memory ceiling.

How do I know if I am swapping?

Activity Monitor → Memory → watch Swap grow; or sysctl vm.swapusage. Swapins >0 and tok/s in §5 Phase 3 (single digits) is E2 pressure collapse—smaller model or more RAM, not hyperparameter tuning.

Why is 24GB so much faster than 16GB?

Not doubled bandwidth (same M4 die, similar L2)—more L1 headroom, L3 not firing: 16GB clean 8B median ~28.8, 24GB ~51.2. See §1.4 and §5.2.

Base M4 vs M4 Pro?

Same 8B Q4, clean: M4 24GB median 51.2, M4 Pro 48GB median 75.9 (same lab script). Pro wins more under concurrent batch; single-user Agent may not need Pro.

Agent: tok/s or TTFT?

First turn cares about TTFT (§6.1); long replies care about tok/s. Under swap, TTFT can go 1.82s → 5.8s—fix memory before switching frameworks.

No physical Mac for the team?

Deploy Ollama on a dedicated macOS host; members hit it on the LAN—same commands as here. For remote macOS to reproduce §5 collapse and §4 failures, see “Lab environment disclosure” below.

Summary

M4 Mac Mini local LLM limits are set by L1 capacity + L2 bandwidth + L3 pressure: 16GB comfort zone 7B–8B (L2 steady median ~29 tok/s); 24GB steady 14B (~16 tok/s); 32GB before 32B; 70B needs M4 Pro (E1 capacity exhaustion).

Conditional advice: single-user Agent, 7B–14B range—24GB on 8B often beats forcing 14B on 16GB (§7.1 fair-load table); multi-concurrency or batch inference needs more nodes or Pro. Revalidate with §1 causal model + §3 raw log + §5 Phase 1–3 on your hardware—do not copy only the median.

References

Ollama documentation — install, API, Metal backend
Apple Metal documentation — GPU inference basics
Meta Llama 3.1 — model specs and license
Qwen2.5 release notes — 7B / 14B capability bounds
Apple MLX project — framework comparison reference
llama.cpp — Ollama engine and GGUF quantization

This article is a three-week engineering log, not a one-shot snapshot. Raw artifacts:

resources/sample-benchmark-run.log — benchmark output
resources/raw-vm-stat-dump.txt — vm_stat / top / memorystatus
resources/ollama-debug-excerpt.log — Ollama verbose + failure sessions
resources/benchmark-m4-mac-mini-ollama.sh — reproduction script

Lab environment disclosure (Infrastructure disclosure)

Benchmarks ran on physical Mac Mini M4 hardware: some Macstripe dedicated lease nodes (no noisy-neighbor virtualization), some lab bench machines; macOS 15.4, Ollama 0.6.2. Macstripe offers M4 Mac Mini remote lease; conclusions here do not depend on a specific vendor—you can reproduce on your own hardware.

To reproduce §5 collapse and §4 failure taxonomy on remote macOS (e.g. team without a local Mac), see models and regions on the Macstripe home page—optional infrastructure, not a prerequisite for this article’s results.

This article covers which models fit plus lab raw data; companion posts split frameworks, Pro tier, and Agent rollout so one page does not sprawl.

M4 Mac Mini: 7B vs 14B — real-world gap (decision Spoke: TL;DR + TTFT)
MLX vs Ollama: Apple Silicon inference frameworks (technical)
M4 Pro local LLM guide (70B / 32B tier)
M4 Mac Mini local AI Agent setup and API cost (deployment)
How unified memory affects LLM inference (theory)