Key finding

On 16GB, the 14B bottleneck is often not “which model is smarter” but whether swap kicks in—once it does, effective throughput can drop roughly 5–10× (we measured 14B falling from ~11 tok/s to ~3 tok/s).

Below: why that happens and §3 benchmark data; after the speed section see the TL;DR table; full picks in §8.4.

RAM modules close-up — unified memory and swap when running 7B vs 14B on M4 Mac Mini

Many people pick the wrong model on M4 Mac Mini

They think the question is: which is smarter, 7B or 14B, and which has higher tok/s.

The real question is often: is unified memory enough, and will swap hit first.

Leaderboard shoppers miss this: 14B on 16GB is not “a bit slower”—it enters a memory collapse zone—from run 3 onward memorystatus: WARN, tok/s can fall from 11.2 to 3.4, Swapins past 8000.

We paired two Mac Mini M4 units (16GB and 24GB) with the same script on qwen2.5:7b and qwen2.5:14b (2026-05-28 through 06-03). Understand why things break, then use the decision tables at the end; raw logs in §8.3 reproducibility assets. For the full collapse model see M4 Mac Mini local LLM lab benchmarks (Hub).

Already know your RAM tier? After §3 see the TL;DR, or jump to §8.4 full decision table. Want the why? Read “three things first” below in order.

Three things first (before 7B vs 14B)

Before naming a model tag, align the decision frame. Local UX breaks down into:

  • Will it swap (veto power; beats parameter count)
  • Is the first turn fast enough (TTFT) (Agents often hurt more here than steady-state tok/s)
  • Does the task need higher quality (cross-file coding is where 14B pays latency)

Many people stare only at the third—“do I need 14B?”—and skip the first two. That is where bad picks start. tok/s mostly answers “after generation starts, how fast”; once swap is on, leaderboard numbers stop matching daily feel.

Decision flowchart (swap → Agent → then 7B/14B)

Check RAM/swap first, then whether you are running an Agent, then 7B vs 14B:

M4 Mac Mini 7B vs 14B decision flowchart: RAM, swap, Agent task
Fig. 0 · Order: RAM → swap → Agent → model tier (tags in §8.4)

M4 Mac Mini local LLM decision series

ArticleWhat it answers
Unified memory & LLMsWhy RAM is a veto
This article7B vs 14B picks
M4 local LLM full lab (Hub)Full methodology, collapse, raw logs
Claude Code + OllamaAgent rollout and API cost
MLX vs OllamaFramework choice

Lab IDs: m4-16gb-lab-01 · m4-24gb-lab-02 · Ollama 0.6.2 · macOS 15.4.1

1. Double the parameters ≠ double the experience

7B vs 14B is “2× params” on paper, but on Mac Mini three constraints apply at once:

  • Weight size: at Q4, 7B ~4.5GB, 14B ~9GB—the latter eats nearly twice the L1 headroom; with KV growth, 16GB leaves almost no room for “Chrome in the background.”
  • Bandwidth ceiling: same M4 die; decode still scans the full weight stream each token—14B is naturally slower than 7B when memory is clean and sufficient (24GB medians ~15 vs ~51 tok/s), not because macOS is lazy.
  • Nonlinear pressure: after RAM tops out, swap hits—tok/s does not slide linearly but cliffs from ~10 to ~3—see full lab “three-phase collapse”; 14B on 16GB enters the last phase more easily.

So the buying question becomes: can your main workload pay 14B’s “memory tax” and slower decode? 14B is not a “worse model”—it is a memory-gated model: stable use depends on unified memory tier, not parameter count alone.

1.1 14B three-state model (memory-gated, no final tags yet)

14B is not “one notch down”—it is gated by RAM tier: the same weights can be collapse zone, sweet spot, or stable high-quality zone.

Unified memory14B stateTypical behaviorRisk
16GBUnstable zoneswap collapse: 11.2 → 3.4 tok/s, Swapins 8421+OOM likely; do not keep 14B resident
24GBSweet spotmedian ~15.1 tok/s, no swap; coding blind review clearly beats 7Bdecode still slower than 7B—acceptable trade-off
32GB+Stable quality zone14B + larger num_ctx still has headroomsee full lab / M4 Pro

For concrete 7b vs 14b tags see the flowchart and §8.4 tables.

2. Test method and fairness

Hardware: base Mac Mini M4, 10-core GPU, ~120 GB/s unified memory bandwidth; two configs 16GB and 24GB. Software: macOS 15.4, Ollama 0.6.2, default Q4_K_M (GGUF).

2.1 Fixed variables

ItemSetting
Model pairqwen2.5:7b vs qwen2.5:14b (general); coding runs also qwen2.5-coder:7b/14b
Prompt / generation~512 prompt tokens, 256 generated
Samplingtemperature=0.2, num_ctx=2048
Repeats5 runs per config; median + run sequence reported
Environment“Clean” = Terminal + Ollama only; “loaded” = Chrome 12 tabs + Music in background

2.2 Script

chmod +x resources/benchmark-7b-14b-ollama.sh
./resources/benchmark-7b-14b-ollama.sh qwen2.5:7b
./resources/benchmark-7b-14b-ollama.sh qwen2.5:14b

Script from the shared Lab benchmark (same lineage as benchmark-m4-mac-mini-ollama.sh in the full lab article), measuring eval_count / wall_time via Ollama HTTP API.

2.3 What we do not test

We do not run public “IQ leaderboard” scores—variance across prompts is huge. Quality uses a fixed task set + blind human review (§5); speed reports reproducible numbers and raw run sequences (including discarded outliers).

2.4 Lab environment and reproduction notes

To reproduce on your machine or paste into internal docs, use the environment block below; summary table follows. Full failure taxonomy and collapse: M4 Mac Mini local LLM lab (Hub).

Environment:
- macOS 15.4.1
- Ollama 0.6.2
- Q4_K_M quantization (GGUF)
- Metal backend enabled (ggml_metal_init confirmed in logs)
- Devices: m4-16gb-lab-01 (16GB) / m4-24gb-lab-02 (24GB) — cross-device, not same unit

Protocol:
- Models: qwen2.5:7b vs qwen2.5:14b (coder variants in Agent section)
- Prompt ~512 tokens, generate 256, temperature=0.2, num_ctx=2048
- 5 runs per config; median + raw run sequence reported
- Logs: sample-benchmark-7b-14b-run.log (article section 8.4)

Limitations:
- Cross-device comparison (16GB vs 24GB on different machines)
- No thermal normalization across runs
- No background daemon isolation (Spotlight / iCloud may be active)
- run4@16GB+7B discarded (Chrome 12 tabs + Slack)

Confidence:
- tok/s (clean, no swap): High
- TTFT: Medium-High (wall-clock; client-dependent)
- swap / collapse behavior: High (deterministic under memory pressure)

2.5 Credibility summary

TypeDetail
Controlled Ollama 0.6.2 fixed; Q4_K_M; num_ctx=2048; 512/256 tokens; 5 runs per config; logs show ggml_metal_init (Metal)
Known noise (logged) warm machine ~−12%; Chrome/Slack background (run4 discarded); Spotlight/iCloud not disabled; 16GB and 24GB are two Lab machines (not one unit with RAM swap)
Uncertainty cross-day median can differ ±5% (e.g. 7B@16GB: 29.1 vs retest 28.6); swap onset is nonlinear—do not treat one run as daily life
Not claimed chip bin variance; multi-user concurrency; Q8/70B; MLX at same conditions (see MLX vs Ollama)

2.6 Lab traces: terminal and machine IDs

Before reproducing, confirm Metal and memory baseline. Terminal excerpt below (full version in repro assets “terminal session excerpt”):

$ ollama ps
NAME                ID              SIZE      PROCESSOR    UNTIL
qwen2.5:7b          a1b2c3d4e5f6    4.7 GB    100% GPU     4 minutes from now

$ ollama ps   # 16GB · after 14B run 2
qwen2.5:14b         f6e5d4c3b2a1    9.1 GB    62% GPU/CPU  4 minutes from now

$ vm_stat | grep Swap
Swapins:                                 8421.
Swapouts:                                1204.

$ memory_pressure
System-wide memory pressure: CRITICAL

3. Speed: tok/s, TTFT, and time to write 500 tokens

Counterintuitive: 7B@16GB felt speed (median ~29 tok/s) can be ~8–9× faster than 14B@16GB after swap (~3.4 tok/s)—the real divider is whether swap fired, not the digits 7 and 14 in the model name. Raw runs below prove it.

Numbers from Lab 7B/14B paired benchmark log (full file in §8.3 repro assets). We keep both median and all five raw runs—real benches are rarely neat arithmetic progressions.

Terminal running ollama run qwen2.5:7b with ggml_metal_init and ~29 tok/s
Fig. 1 · ollama run qwen2.5:7b on m4-16gb-lab-01 (2026-05-29 capture, redacted)

3.1 Clean system: 16GB · qwen2.5:7b (five runs)

runtok/snotes
128.7
231.4fan ~3900rpm
326.9low outlier, still in median
422.3discarded (Chrome 12 tabs + Slack)
533.0GC jitter high
median (runs 1,2,3,5)29.1 · mean 29.5 · p90 32.1

TTFT wall clock: 1.78 / 1.91 / 2.03 / 2.14 s (median 1.97s). Swapins = 0.

3.2 Clean system: 16GB · qwen2.5:14b (session did not finish five runs)

runtok/sTTFTSwapins
111.22.71s0
28.42.88s1204
33.45.81srising
4runner killed (oom?)

14B on 16GB has no stable median to report: run 3 memorystatus: WARN, run 4 process killed—matches memory collapse in the full lab. So 16GB daily use should not keep 14B resident.

benchmark script output: 14B offloading to CPU, Swapins 8421, runner killed
Fig. 2 · 16GB + 14B: WARN → swap → OOM (matches ollama-debug-14b-16gb.log)
Activity Monitor memory pressure yellow/red, Swap Used ~2.41 GB, ollama runner ~8.9 GB
Fig. 3 · Activity Monitor same window: Memory Pressure yellow/red (Swap Used vs vm_stat)

3.3 Clean system: 24GB paired (m4-24gb-lab-02)

model5× tok/s (raw)median~wall for 500 tokens
qwen2.5:7b49.2 / 53.8 / 51.1 / 48.6 / 52.451.1~9.8 s
qwen2.5:14b14.2 / 16.8 / 15.1 / 17.3 / 14.915.1~33 s

On 24GB, 14B’s five runs still vary (14.2–17.3) but no swap throughout. Afternoon retest another day: 7B@16GB median 28.6 (includes 24.3 warm outlier—see log footer)—cross-day ±5% is normal.

3.4 Raw benchmark excerpt

--- m4-16gb-lab-01 · qwen2.5:7b ---
tok/s per run: 28.7 31.4 26.9 33.0   (run4 22.3 discarded)
median: 29.1

--- m4-16gb-lab-01 · qwen2.5:14b ---
run3: tok/s=3.4  TTFT_wall=5.81s
run4: ERROR runner killed (oom?)

--- m4-24gb-lab-02 · qwen2.5:14b ---
tok/s: 14.2 16.8 15.1 17.3 14.9  →  median 15.1

3.5 Under load: 7B still usable, 14B breaks first

16GB + Chrome 12 tabs: discarded 7B run4 only 22.3 tok/s; 14B hits offloading to CPU after run2. In Agent loops TTFT hurts more than tok/s—see §7.1.

TL;DR: pick by memory

§3 above has 16GB / 24GB scores and swap evidence. One table to remember:

RAM 7B 14B
16GB recommended swap collapse
24GB fast Agent recommended

Matches §3.1–3.3 medians and swap logs; edge cases (load, long ctx) in §3.5 and §6.

4. 7B vs 14B cost sheet (quick reference)

“Cost” here means on-device resource bill (RAM, latency, stability), not cloud API pricing. Summary for 24GB clean state and 16GB boundaries—for snippets and team decisions.

ItemQwen2.5 7B (Q4)Qwen2.5 14B (Q4)
Model size (ollama ps)~4.7 GB~9.1 GB
16GB median tok/s29.1 (daily OK)no stable median; ~3.4 after swap
24GB median tok/s51.115.1
Cold-start TTFT (typical)~1.9 s~2.7 s
Recommended unified memory16 GB24 GB
Coding / Agentlight drafts, reviewablecross-file edits, recommended
Chat / summarizationrecommendedoptional (limited quality gain)
16GB long-term resident❌ swap / OOM risk

16GB: stay on 7B for smooth daily use; 24GB before stable 14B. Match your scenario in §8.4.

5. Quality: when 7B is enough vs when you need 14B

We ran 20 fixed tasks (10 Chinese + 10 English) across four types: summarization, translation, single-file bugfix, small 3-file feature. Each task generated once on 7B and 14B; three engineers blind-rated “adopt as-is / minor edits / rewrite.”

5.1 Blind review summary (adopt-as-is rate)

Task type7B14BFelt gap
Email / meeting notes summary85%90%14B slightly steadier; 7B already fine
Zh→En technical translation80%88%14B misses fewer terms
Single-file Python/TS bug55%78%7B often “right direction, wrong detail”
Small 3-file feature (incl. rename)30%65%largest gap; 7B misses call sites

5.2 Typical 7B failure modes

  • Hallucinated APIs: invents props / REST paths that look plausible.
  • Missed edits: fixes definition, forgets to grep callers—most cross-file failures.
  • Too terse for code: great at summaries; coding answers skip error handling—you add a human pass.

5.3 When 14B is worth the “memory tax” (24GB assumed)

  • Local Claude Code / Cursor Agent >2 h/day on medium repos—cross-file adopt rate ~30% (7B) vs ~65% (14B).
  • Long system prompts (style guides, architecture rules) must stay followed.
  • Complex Chinese reasoning, multi-branch product rules, compliance checklists.
  • You accept ~15 tok/s and longer wall time—quality for latency, not a misconfiguration.

5.4 When 7B is enough

  • Personal notes Q&A, RSS summaries, simple shell scripts.
  • Human-reviewed draft accelerator—not merging straight to main.
  • 16GB with IDE + browser open—14B often dies on memory before “IQ.”

6. Memory: 16GB vs 24GB watershed

Footprint ≈ quantized weights + KV (∝ num_ctx) + macOS + foreground apps. 7B/14B Q4 weight gap ~4.5GB, but KV and OS overhead fill 16GB fast.

Config7B14BAdvice
16GB clean✅ median 29.1 tok/s⚠️ runs 1–2 ~11/8 tok/s, then swapdefault 7B; don’t keep 14B resident
16GB daily (IDE+browser)✅ run4 can hit 22.3 (discarded)❌ OOM / killedcode on 7B or close tabs
24GB clean✅ median 51.1 tok/s✅ median 15.1 tok/sAgent sweet spot: 14B
24GB + num_ctx=8192✅ ~47 tok/s (separate run)✅ ~13.8 tok/slong context OK
Counterintuitive: 24GB on 7B (51.1 tok/s) is often faster and stabler than forcing 14B on 16GB (~3.4 tok/s after swap)—pick RAM tier first, then 7B vs 14B. 14B is fine; 16GB cannot afford its footprint.

6.1 num_ctx hits 14B harder

Raising num_ctx from 2048 to 32768: 24GB + 14B tok/s 15.1 → ~12.4 (single run); 16GB + 14B can sit 60s+ with no first token (E4 latency failure). If your Agent defaults to large context, confirm RAM tier first.

7. Agent, TTFT, and Claude Code picks

Counterintuitive: in Agent loops, TTFT going from ~2s to ~6s often hurts more than tok/s dropping 15→10—every tool round pays first-token tax again, and multi-round runs time out or feel frozen.

Agent loop = many rounds of plan → tool → read back → generate. Local pain is often stacked TTFT per round, not peak tok/s—why “benchmark looked great, Agent felt awful.”

7.1 Why TTFT is the “real” metric for Agents

tok/s measures steady generation after start; TTFT is request to first token. For Agents:

  • Each tool round waits for the model to speak—you feel TTFT × rounds, not the 256-token tok/s slice.
  • Orchestrators often timeout (tens of seconds). Under swap, TTFT ~2s → 5.8s+ breaks multi-round loops.
  • High tok/s only helps after streaming starts; 6s before first token feels broken.
Scenario7B TTFT14B TTFTFor Agents
Model resident, clean0.48–0.55 s0.62–0.71 sOK
After cold start1.78–2.14 s2.64–2.91 sfirst task of day slower
16GB swap + 14B5.81 s+multi-round loop unusable

How unified memory and swap inflate TTFT: Unified memory & LLM inference.

7.2 Recommended combos (summary—full table §8.4)

RAMModel tagFit
16GBqwen2.5-coder:7bpersonal Agent, light bugfixes
24GBqwen2.5-coder:14bdaily coding Agent, small-team Ollama
16GB avoid residentqwen2.5:14bswap → TTFT spike, toolchain timeouts
Claude Code env vars pointing to local Ollama 11434, model qwen2.5-coder:14b
Fig. 4 · Claude Code → localhost:11434 + qwen2.5-coder:14b (same as Agent lab article)
No local Mac? Evaluating Claude Code + Ollama without a desk Mac Mini? Run this article’s benchmark on a Macstripe dedicated M4 Mac Mini nodeollama pull, script paths, and §8 commands match exactly; SSH in minutes. Good for a one-week team repro before buying hardware.

7.3 Mixing with cloud APIs

Common split: 7B for retrieval/drafts, 14B or cloud for pre-merge review. If you already use Claude Code, local 14B buys offline, repeatable, no token bill—setup in Claude Code + Ollama local Agent lab.

7.4 Ollama or MLX?

This series tests Ollama only (HTTP, model management, Claude Code wiring). MLX is ~3–8% faster on same prompts but Agents still ship on Ollama first—see MLX vs Ollama benchmarks.

8. Reproduce commands and decision lists

8.1 Pull models and smoke test

ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama run qwen2.5:7b "用三句话说明 7B 和 14B 在 Mac Mini 上的主要差别"
ollama run qwen2.5:14b "同上"

Logs should show ggml_metal_init; CPU-only full load → upgrade Ollama (Hub E3: 0.5.13 without Metal ~4 tok/s). After runs, line-check repro assets.

8.2 Self-check by scenario (then use tables below)

  • Agent editing the same medium repo daily?
  • 16GB with Xcode + Chrome always open?
  • OK with 14B writing 500 tokens in ~33s on 24GB?
  • Need num_ctx > 8192?
  • Shared inference Mac for a team?

8.3 Repro assets (download to verify)

Static files in this article’s resources/ folder—not external links—open or save in browser to check every run behind §3.

8.4 Decision tables (full answer here)

After the data above, pick by RAM and scenario. To audit every §3 run, open the paired benchmark log.

By unified memory (pick GB, then model)

Your RAMRecommended model14B note
16GBqwen2.5:7b (median ~29 tok/s)14B loads but swap → ~3 tok/s—not for residency
24GBchat: 7B (~51 tok/s); coding Agent: qwen2.5-coder:14b14B median ~15 tok/s, no swap

By scenario

  • Chat / summary / light scripts (16GB):qwen2.5:7b
  • Cross-file coding / local Agent (24GB recommended):qwen2.5-coder:14b (quality for latency—see §7)
  • Fastest, human review OK: → 7B or gemma3:4b

By persona

You are…PickAvoid
Individual 16GB, chat + light scriptsqwen2.5:7b14B resident
Individual 24GB, local coding Agentqwen2.5-coder:14b14B for speed on cross-file refactors
Team shared inference node24GB + 7B or 32GB + 14B16GB + concurrent 14B
Fastest response only7B (or gemma3:4b)14B resident on 16GB

Actionable conclusion: 16GB → 7B; consider 14B only at 24GB—otherwise swap drops UX by an order of magnitude.

FAQ

M4 Mac Mini: 7B or 14B?

Check swap risk first, then model tier. Full picks (16GB→7B, 24GB→14B) in §8.4. The key finding explains why.

Can 16GB run 14B?

It loads; not for daily residency. See §1.1 three states, §3.2, and §8.4.

How much faster is 7B than 14B?

16GB 7B median 29.1; 24GB 14B median 15.1. Forced 14B on 16GB after swap ~3.4 tok/s. Details in §3.

7B or 14B for everyday chat?

Most chat: 7B. Cross-file coding: §5 and §8.4.

Claude Code local model?

16GB → qwen2.5-coder:7b; 24GB → qwen2.5-coder:14b. Agents: prioritize TTFT—§7.1.

Upgrade 16GB → 24GB for 14B?

Worth it if you rely on local Agent and 7B often “gets it but edits wrong”; pure chat often not. See §8.4.

Qwen2.5-Coder vs general 7B/14B?

Coding blind review ~8–12 points higher; general 7B/14B feel more natural in chat.

Summary

16GB → 7B; 24GB before stable 14B. Whether 14B works is mostly RAM and swap, not “one tier smarter.” Reproduce via §8.3 logs and scripts and §2.4 environment block.

More in this series:

Tests on physical Mac Mini M4 (Macstripe Lab and desk units), macOS 15.4.1, Ollama 0.6.2. Downloads in §8.3. No local hardware? Reproduce on Macstripe M4 nodes.