Key finding
On 16GB, the 14B bottleneck is often not “which model is smarter” but whether swap kicks in—once it does, effective throughput can drop roughly 5–10× (we measured 14B falling from ~11 tok/s to ~3 tok/s).
Below: why that happens and §3 benchmark data; after the speed section see the TL;DR table; full picks in §8.4.
Many people pick the wrong model on M4 Mac Mini
They think the question is: which is smarter, 7B or 14B, and which has higher tok/s.
The real question is often: is unified memory enough, and will swap hit first.
Leaderboard shoppers miss this: 14B on 16GB is not “a bit slower”—it enters a memory collapse zone—from run 3 onward memorystatus: WARN, tok/s can fall from 11.2 to 3.4, Swapins past 8000.
We paired two Mac Mini M4 units (16GB and 24GB) with the same script on qwen2.5:7b and qwen2.5:14b (2026-05-28 through 06-03). Understand why things break, then use the decision tables at the end; raw logs in §8.3 reproducibility assets. For the full collapse model see M4 Mac Mini local LLM lab benchmarks (Hub).
Three things first (before 7B vs 14B)
Before naming a model tag, align the decision frame. Local UX breaks down into:
- Will it swap (veto power; beats parameter count)
- Is the first turn fast enough (TTFT) (Agents often hurt more here than steady-state tok/s)
- Does the task need higher quality (cross-file coding is where 14B pays latency)
Many people stare only at the third—“do I need 14B?”—and skip the first two. That is where bad picks start. tok/s mostly answers “after generation starts, how fast”; once swap is on, leaderboard numbers stop matching daily feel.
Decision flowchart (swap → Agent → then 7B/14B)
Check RAM/swap first, then whether you are running an Agent, then 7B vs 14B:
M4 Mac Mini local LLM decision series
| Article | What it answers |
|---|---|
| Unified memory & LLMs | Why RAM is a veto |
| This article | 7B vs 14B picks |
| M4 local LLM full lab (Hub) | Full methodology, collapse, raw logs |
| Claude Code + Ollama | Agent rollout and API cost |
| MLX vs Ollama | Framework choice |
Lab IDs: m4-16gb-lab-01 · m4-24gb-lab-02 · Ollama 0.6.2 · macOS 15.4.1
1. Double the parameters ≠ double the experience
7B vs 14B is “2× params” on paper, but on Mac Mini three constraints apply at once:
- Weight size: at Q4, 7B ~4.5GB, 14B ~9GB—the latter eats nearly twice the L1 headroom; with KV growth, 16GB leaves almost no room for “Chrome in the background.”
- Bandwidth ceiling: same M4 die; decode still scans the full weight stream each token—14B is naturally slower than 7B when memory is clean and sufficient (24GB medians ~15 vs ~51 tok/s), not because macOS is lazy.
- Nonlinear pressure: after RAM tops out, swap hits—tok/s does not slide linearly but cliffs from ~10 to ~3—see full lab “three-phase collapse”; 14B on 16GB enters the last phase more easily.
So the buying question becomes: can your main workload pay 14B’s “memory tax” and slower decode? 14B is not a “worse model”—it is a memory-gated model: stable use depends on unified memory tier, not parameter count alone.
1.1 14B three-state model (memory-gated, no final tags yet)
14B is not “one notch down”—it is gated by RAM tier: the same weights can be collapse zone, sweet spot, or stable high-quality zone.
| Unified memory | 14B state | Typical behavior | Risk |
|---|---|---|---|
| 16GB | Unstable zone | swap collapse: 11.2 → 3.4 tok/s, Swapins 8421+ | OOM likely; do not keep 14B resident |
| 24GB | Sweet spot | median ~15.1 tok/s, no swap; coding blind review clearly beats 7B | decode still slower than 7B—acceptable trade-off |
| 32GB+ | Stable quality zone | 14B + larger num_ctx still has headroom | see full lab / M4 Pro |
For concrete 7b vs 14b tags see the flowchart and §8.4 tables.
2. Test method and fairness
Hardware: base Mac Mini M4, 10-core GPU, ~120 GB/s unified memory bandwidth; two configs 16GB and 24GB. Software: macOS 15.4, Ollama 0.6.2, default Q4_K_M (GGUF).
2.1 Fixed variables
| Item | Setting |
|---|---|
| Model pair | qwen2.5:7b vs qwen2.5:14b (general); coding runs also qwen2.5-coder:7b/14b |
| Prompt / generation | ~512 prompt tokens, 256 generated |
| Sampling | temperature=0.2, num_ctx=2048 |
| Repeats | 5 runs per config; median + run sequence reported |
| Environment | “Clean” = Terminal + Ollama only; “loaded” = Chrome 12 tabs + Music in background |
2.2 Script
chmod +x resources/benchmark-7b-14b-ollama.sh
./resources/benchmark-7b-14b-ollama.sh qwen2.5:7b
./resources/benchmark-7b-14b-ollama.sh qwen2.5:14b
Script from the shared Lab benchmark (same lineage as benchmark-m4-mac-mini-ollama.sh in the full lab article), measuring eval_count / wall_time via Ollama HTTP API.
2.3 What we do not test
We do not run public “IQ leaderboard” scores—variance across prompts is huge. Quality uses a fixed task set + blind human review (§5); speed reports reproducible numbers and raw run sequences (including discarded outliers).
2.4 Lab environment and reproduction notes
To reproduce on your machine or paste into internal docs, use the environment block below; summary table follows. Full failure taxonomy and collapse: M4 Mac Mini local LLM lab (Hub).
Environment: - macOS 15.4.1 - Ollama 0.6.2 - Q4_K_M quantization (GGUF) - Metal backend enabled (ggml_metal_init confirmed in logs) - Devices: m4-16gb-lab-01 (16GB) / m4-24gb-lab-02 (24GB) — cross-device, not same unit Protocol: - Models: qwen2.5:7b vs qwen2.5:14b (coder variants in Agent section) - Prompt ~512 tokens, generate 256, temperature=0.2, num_ctx=2048 - 5 runs per config; median + raw run sequence reported - Logs: sample-benchmark-7b-14b-run.log (article section 8.4) Limitations: - Cross-device comparison (16GB vs 24GB on different machines) - No thermal normalization across runs - No background daemon isolation (Spotlight / iCloud may be active) - run4@16GB+7B discarded (Chrome 12 tabs + Slack) Confidence: - tok/s (clean, no swap): High - TTFT: Medium-High (wall-clock; client-dependent) - swap / collapse behavior: High (deterministic under memory pressure)
2.5 Credibility summary
| Type | Detail |
|---|---|
| Controlled | Ollama 0.6.2 fixed; Q4_K_M; num_ctx=2048; 512/256 tokens; 5 runs per config; logs show ggml_metal_init (Metal) |
| Known noise (logged) | warm machine ~−12%; Chrome/Slack background (run4 discarded); Spotlight/iCloud not disabled; 16GB and 24GB are two Lab machines (not one unit with RAM swap) |
| Uncertainty | cross-day median can differ ±5% (e.g. 7B@16GB: 29.1 vs retest 28.6); swap onset is nonlinear—do not treat one run as daily life |
| Not claimed | chip bin variance; multi-user concurrency; Q8/70B; MLX at same conditions (see MLX vs Ollama) |
2.6 Lab traces: terminal and machine IDs
Before reproducing, confirm Metal and memory baseline. Terminal excerpt below (full version in repro assets “terminal session excerpt”):
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:7b a1b2c3d4e5f6 4.7 GB 100% GPU 4 minutes from now
$ ollama ps # 16GB · after 14B run 2
qwen2.5:14b f6e5d4c3b2a1 9.1 GB 62% GPU/CPU 4 minutes from now
$ vm_stat | grep Swap
Swapins: 8421.
Swapouts: 1204.
$ memory_pressure
System-wide memory pressure: CRITICAL
3. Speed: tok/s, TTFT, and time to write 500 tokens
Numbers from Lab 7B/14B paired benchmark log (full file in §8.3 repro assets). We keep both median and all five raw runs—real benches are rarely neat arithmetic progressions.
3.1 Clean system: 16GB · qwen2.5:7b (five runs)
| run | tok/s | notes |
|---|---|---|
| 1 | 28.7 | — |
| 2 | 31.4 | fan ~3900rpm |
| 3 | 26.9 | low outlier, still in median |
| 4 | 22.3 | discarded (Chrome 12 tabs + Slack) |
| 5 | 33.0 | GC jitter high |
| median (runs 1,2,3,5) | 29.1 · mean 29.5 · p90 32.1 | |
TTFT wall clock: 1.78 / 1.91 / 2.03 / 2.14 s (median 1.97s). Swapins = 0.
3.2 Clean system: 16GB · qwen2.5:14b (session did not finish five runs)
| run | tok/s | TTFT | Swapins |
|---|---|---|---|
| 1 | 11.2 | 2.71s | 0 |
| 2 | 8.4 | 2.88s | 1204 |
| 3 | 3.4 | 5.81s | rising |
| 4 | — | — | runner killed (oom?) |
14B on 16GB has no stable median to report: run 3 memorystatus: WARN, run 4 process killed—matches memory collapse in the full lab. So 16GB daily use should not keep 14B resident.
3.3 Clean system: 24GB paired (m4-24gb-lab-02)
| model | 5× tok/s (raw) | median | ~wall for 500 tokens |
|---|---|---|---|
| qwen2.5:7b | 49.2 / 53.8 / 51.1 / 48.6 / 52.4 | 51.1 | ~9.8 s |
| qwen2.5:14b | 14.2 / 16.8 / 15.1 / 17.3 / 14.9 | 15.1 | ~33 s |
On 24GB, 14B’s five runs still vary (14.2–17.3) but no swap throughout. Afternoon retest another day: 7B@16GB median 28.6 (includes 24.3 warm outlier—see log footer)—cross-day ±5% is normal.
3.4 Raw benchmark excerpt
--- m4-16gb-lab-01 · qwen2.5:7b ---
tok/s per run: 28.7 31.4 26.9 33.0 (run4 22.3 discarded)
median: 29.1
--- m4-16gb-lab-01 · qwen2.5:14b ---
run3: tok/s=3.4 TTFT_wall=5.81s
run4: ERROR runner killed (oom?)
--- m4-24gb-lab-02 · qwen2.5:14b ---
tok/s: 14.2 16.8 15.1 17.3 14.9 → median 15.1
3.5 Under load: 7B still usable, 14B breaks first
16GB + Chrome 12 tabs: discarded 7B run4 only 22.3 tok/s; 14B hits offloading to CPU after run2. In Agent loops TTFT hurts more than tok/s—see §7.1.
TL;DR: pick by memory
§3 above has 16GB / 24GB scores and swap evidence. One table to remember:
| RAM | 7B | 14B |
|---|---|---|
| 16GB | recommended | swap collapse |
| 24GB | fast | Agent recommended |
Matches §3.1–3.3 medians and swap logs; edge cases (load, long ctx) in §3.5 and §6.
4. 7B vs 14B cost sheet (quick reference)
“Cost” here means on-device resource bill (RAM, latency, stability), not cloud API pricing. Summary for 24GB clean state and 16GB boundaries—for snippets and team decisions.
| Item | Qwen2.5 7B (Q4) | Qwen2.5 14B (Q4) |
|---|---|---|
| Model size (ollama ps) | ~4.7 GB | ~9.1 GB |
| 16GB median tok/s | 29.1 (daily OK) | no stable median; ~3.4 after swap |
| 24GB median tok/s | 51.1 | 15.1 |
| Cold-start TTFT (typical) | ~1.9 s | ~2.7 s |
| Recommended unified memory | 16 GB | 24 GB |
| Coding / Agent | light drafts, reviewable | cross-file edits, recommended |
| Chat / summarization | recommended | optional (limited quality gain) |
| 16GB long-term resident | ✅ | ❌ swap / OOM risk |
16GB: stay on 7B for smooth daily use; 24GB before stable 14B. Match your scenario in §8.4.
5. Quality: when 7B is enough vs when you need 14B
We ran 20 fixed tasks (10 Chinese + 10 English) across four types: summarization, translation, single-file bugfix, small 3-file feature. Each task generated once on 7B and 14B; three engineers blind-rated “adopt as-is / minor edits / rewrite.”
5.1 Blind review summary (adopt-as-is rate)
| Task type | 7B | 14B | Felt gap |
|---|---|---|---|
| Email / meeting notes summary | 85% | 90% | 14B slightly steadier; 7B already fine |
| Zh→En technical translation | 80% | 88% | 14B misses fewer terms |
| Single-file Python/TS bug | 55% | 78% | 7B often “right direction, wrong detail” |
| Small 3-file feature (incl. rename) | 30% | 65% | largest gap; 7B misses call sites |
5.2 Typical 7B failure modes
- Hallucinated APIs: invents props / REST paths that look plausible.
- Missed edits: fixes definition, forgets to grep callers—most cross-file failures.
- Too terse for code: great at summaries; coding answers skip error handling—you add a human pass.
5.3 When 14B is worth the “memory tax” (24GB assumed)
- Local Claude Code / Cursor Agent >2 h/day on medium repos—cross-file adopt rate ~30% (7B) vs ~65% (14B).
- Long system prompts (style guides, architecture rules) must stay followed.
- Complex Chinese reasoning, multi-branch product rules, compliance checklists.
- You accept ~15 tok/s and longer wall time—quality for latency, not a misconfiguration.
5.4 When 7B is enough
- Personal notes Q&A, RSS summaries, simple shell scripts.
- Human-reviewed draft accelerator—not merging straight to main.
- 16GB with IDE + browser open—14B often dies on memory before “IQ.”
6. Memory: 16GB vs 24GB watershed
Footprint ≈ quantized weights + KV (∝ num_ctx) + macOS + foreground apps. 7B/14B Q4 weight gap ~4.5GB, but KV and OS overhead fill 16GB fast.
| Config | 7B | 14B | Advice |
|---|---|---|---|
| 16GB clean | ✅ median 29.1 tok/s | ⚠️ runs 1–2 ~11/8 tok/s, then swap | default 7B; don’t keep 14B resident |
| 16GB daily (IDE+browser) | ✅ run4 can hit 22.3 (discarded) | ❌ OOM / killed | code on 7B or close tabs |
| 24GB clean | ✅ median 51.1 tok/s | ✅ median 15.1 tok/s | Agent sweet spot: 14B |
| 24GB + num_ctx=8192 | ✅ ~47 tok/s (separate run) | ✅ ~13.8 tok/s | long context OK |
6.1 num_ctx hits 14B harder
Raising num_ctx from 2048 to 32768: 24GB + 14B tok/s 15.1 → ~12.4 (single run); 16GB + 14B can sit 60s+ with no first token (E4 latency failure). If your Agent defaults to large context, confirm RAM tier first.
7. Agent, TTFT, and Claude Code picks
Agent loop = many rounds of plan → tool → read back → generate. Local pain is often stacked TTFT per round, not peak tok/s—why “benchmark looked great, Agent felt awful.”
7.1 Why TTFT is the “real” metric for Agents
tok/s measures steady generation after start; TTFT is request to first token. For Agents:
- Each tool round waits for the model to speak—you feel TTFT × rounds, not the 256-token tok/s slice.
- Orchestrators often timeout (tens of seconds). Under swap, TTFT ~2s → 5.8s+ breaks multi-round loops.
- High tok/s only helps after streaming starts; 6s before first token feels broken.
| Scenario | 7B TTFT | 14B TTFT | For Agents |
|---|---|---|---|
| Model resident, clean | 0.48–0.55 s | 0.62–0.71 s | OK |
| After cold start | 1.78–2.14 s | 2.64–2.91 s | first task of day slower |
| 16GB swap + 14B | — | 5.81 s+ | multi-round loop unusable |
How unified memory and swap inflate TTFT: Unified memory & LLM inference.
7.2 Recommended combos (summary—full table §8.4)
| RAM | Model tag | Fit |
|---|---|---|
| 16GB | qwen2.5-coder:7b | personal Agent, light bugfixes |
| 24GB | qwen2.5-coder:14b | daily coding Agent, small-team Ollama |
| 16GB avoid resident | qwen2.5:14b | swap → TTFT spike, toolchain timeouts |
ollama pull, script paths, and §8 commands match exactly; SSH in minutes. Good for a one-week team repro before buying hardware.7.3 Mixing with cloud APIs
Common split: 7B for retrieval/drafts, 14B or cloud for pre-merge review. If you already use Claude Code, local 14B buys offline, repeatable, no token bill—setup in Claude Code + Ollama local Agent lab.
7.4 Ollama or MLX?
This series tests Ollama only (HTTP, model management, Claude Code wiring). MLX is ~3–8% faster on same prompts but Agents still ship on Ollama first—see MLX vs Ollama benchmarks.
8. Reproduce commands and decision lists
8.1 Pull models and smoke test
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama run qwen2.5:7b "用三句话说明 7B 和 14B 在 Mac Mini 上的主要差别"
ollama run qwen2.5:14b "同上"
Logs should show ggml_metal_init; CPU-only full load → upgrade Ollama (Hub E3: 0.5.13 without Metal ~4 tok/s). After runs, line-check repro assets.
8.2 Self-check by scenario (then use tables below)
- Agent editing the same medium repo daily?
- 16GB with Xcode + Chrome always open?
- OK with 14B writing 500 tokens in ~33s on 24GB?
- Need
num_ctx > 8192? - Shared inference Mac for a team?
8.3 Repro assets (download to verify)
Static files in this article’s resources/ folder—not external links—open or save in browser to check every run behind §3.
- 7B/14B paired benchmark log — tok/s, TTFT, Swapins per run (source for §3 tables)
- Terminal session excerpt —
ollama ps,vm_stat,memory_pressure - 16GB 14B Ollama debug log — swap / OOM session
- Benchmark reproduce script — same logic as full lab
8.4 Decision tables (full answer here)
After the data above, pick by RAM and scenario. To audit every §3 run, open the paired benchmark log.
By unified memory (pick GB, then model)
| Your RAM | Recommended model | 14B note |
|---|---|---|
| 16GB | qwen2.5:7b (median ~29 tok/s) | 14B loads but swap → ~3 tok/s—not for residency |
| 24GB | chat: 7B (~51 tok/s); coding Agent: qwen2.5-coder:14b | 14B median ~15 tok/s, no swap |
By scenario
- Chat / summary / light scripts (16GB): →
qwen2.5:7b - Cross-file coding / local Agent (24GB recommended): →
qwen2.5-coder:14b(quality for latency—see §7) - Fastest, human review OK: → 7B or
gemma3:4b
By persona
| You are… | Pick | Avoid |
|---|---|---|
| Individual 16GB, chat + light scripts | qwen2.5:7b | 14B resident |
| Individual 24GB, local coding Agent | qwen2.5-coder:14b | 14B for speed on cross-file refactors |
| Team shared inference node | 24GB + 7B or 32GB + 14B | 16GB + concurrent 14B |
| Fastest response only | 7B (or gemma3:4b) | 14B resident on 16GB |
Actionable conclusion: 16GB → 7B; consider 14B only at 24GB—otherwise swap drops UX by an order of magnitude.
FAQ
M4 Mac Mini: 7B or 14B?
Check swap risk first, then model tier. Full picks (16GB→7B, 24GB→14B) in §8.4. The key finding explains why.
Can 16GB run 14B?
It loads; not for daily residency. See §1.1 three states, §3.2, and §8.4.
How much faster is 7B than 14B?
16GB 7B median 29.1; 24GB 14B median 15.1. Forced 14B on 16GB after swap ~3.4 tok/s. Details in §3.
7B or 14B for everyday chat?
Most chat: 7B. Cross-file coding: §5 and §8.4.
Claude Code local model?
16GB → qwen2.5-coder:7b; 24GB → qwen2.5-coder:14b. Agents: prioritize TTFT—§7.1.
Upgrade 16GB → 24GB for 14B?
Worth it if you rely on local Agent and 7B often “gets it but edits wrong”; pure chat often not. See §8.4.
Qwen2.5-Coder vs general 7B/14B?
Coding blind review ~8–12 points higher; general 7B/14B feel more natural in chat.
Summary
16GB → 7B; 24GB before stable 14B. Whether 14B works is mostly RAM and swap, not “one tier smarter.” Reproduce via §8.3 logs and scripts and §2.4 environment block.
Related reading
More in this series:
- M4 Mac Mini local LLM (Hub: methodology & raw logs)
- Claude Code + Ollama local Agent
- MLX vs Ollama benchmarks
- Unified memory & LLM inference
Tests on physical Mac Mini M4 (Macstripe Lab and desk units), macOS 15.4.1, Ollama 0.6.2. Downloads in §8.3. No local hardware? Reproduce on Macstripe M4 nodes.