Many developers search for Claude Code and Ollama separately. The question that actually matters for production—and for Macstripe customers—is: how do you run a local AI agent on an M4 Mac Mini? In 2026, a practical answer is Claude Code for agent orchestration (read the repo, run commands, edit files) plus Ollama on Apple Silicon (pay for hardware and power, not per-token cloud bills).
This article leads with cost and speed results (the “is it worth it?” question), then architecture and setup. If you own team infrastructure, we also cover a cloud Mac inference node and a planned “Claude Code + Apple Silicon” series. For framework choice, see MLX vs Ollama on Apple Silicon.
1. Real results: how much you save and whether speed is enough
The numbers below come from Macstripe’s benchmarks on a dedicated M4 Mac Mini (24GB unified memory) running Ollama, plus a billing review from an 8-person backend pilot that moved to “Claude Code + on-prem Ollama” (April–May 2026, hybrid setup). Your mileage will vary, but the order of magnitude is useful for decisions.
1.1 After about one month: API bill change (illustrative)
| Item | Before (cloud API only) | After (local-first) | Change |
|---|---|---|---|
| Claude / similar API usage | ~$300/month | ~$50/month (architecture review, etc.) | ~−83% |
| Inference compute | Bundled in API | 1× M4 Mac Mini cloud lease + power | Fixed, predictable cost |
| Data egress | Default off-network | Daily agent work stays on LAN | Compliance-friendly |
Most savings come from high-frequency, repetitive agent calls—test fixes, batch refactors, doc summaries. If everyone runs multi-round “whole-repo architecture” agents daily, keep a cloud budget for strong models or total time can increase.
1.2 Inference speed on M4 Mac Mini (Ollama, 4-bit quantization)
| Model | Generation speed (approx.) | Time to first token | Day-to-day agent feel |
|---|---|---|---|
| Qwen2.5-Coder 7B | ~25 token/s | ~200 ms | Fine for single-module edits and tests |
| Qwen2.5-Coder 14B | ~15 token/s | ~280 ms | Better quality for slightly harder tasks |
| glm-4.7-flash (~9GB class) | ~30 token/s | ~170 ms | Speed-biased; good for short Q&A |
Test conditions: M4 Mac Mini 24GB, macOS 15.x, Ollama 0.14+, ~2k-token prompt continuation. On 16GB machines, 14B often triggers swap—team inference boxes should start at 24GB. On the same hardware, MLX is typically ~10%–15% faster; see our comparison article.
1.3 Concurrency and stability (one shared inference machine)
- 24GB + 7B model: 2–3 people doing light agent work (small read scopes) is acceptable; latency climbs noticeably from the 4th user.
- 24GB + 14B model: Prefer only one heavy agent at a time; queue others or fall back to 7B.
- One-month observation: pilot team agent success rate (tests pass on first try) rose from ~55% to ~68%—mostly from 64K context reducing “half the files dropped” retries, not from the model getting smarter.
2. Why more developers route agents through Ollama instead of APIs
Claude Code is Anthropic’s terminal agent: search the tree, edit files, run bash, open PRs. By default it hits the cloud Claude API; heavy agent use in a week can burn through multiples of a subscription. Point the endpoint at Ollama and the same agent capabilities run on a local or LAN model—fixed cost (machine + power) instead of per-token pricing.
| Approach | Typical monthly cost feel | Data leaves network? | Best for |
|---|---|---|---|
| Claude Code (cloud only) | Subscription + overage API | Yes (unless enterprise private deploy) | Hard reasoning, long architecture chains |
| Claude Code + Ollama (local) | Hardware / cloud Mac rent | Can stay fully on LAN | Daily edits, batch refactors, sensitive repos |
| Hybrid: local-first + cloud fallback | Below cloud-only Max tier | As needed | Most engineering teams (recommended) |
3. Workflow architecture (diagrams)
claude (Claude Code)Stacks well with Agent Skills: Skills enforce “align before you code”; Claude Code executes; Ollama supplies “compute per call.”
4. Get running on an M4 Mac Mini in about 10 minutes
These steps are the same on a local or cloud M4 Mac Mini. We follow Ollama’s official Claude Code integration; on Apple Silicon, Homebrew install is recommended.
4.1 Install Ollama and pull a model
brew install ollama
ollama pull qwen2.5-coder:7b
# or: ollama pull glm-4.7-flash (size/speed tradeoff—check ollama.com for current tags)
4.2 Extend context to 64K+ (strongly recommended)
Claude Code as an agent repeatedly stuffs repo fragments into context. Too small a window causes truncation and retry loops—slower and more expensive in practice. If the default context is small, write a Modelfile:
cat > Modelfile <<'EOF'
FROM qwen2.5-coder:7b
PARAMETER num_ctx 65536
EOF
ollama create qwen2.5-coder-agent -f Modelfile
4.3 Connect Claude Code (two ways)
Option A (recommended): Ollama 0.14.5+ one-liner
ollama launch claude --model qwen2.5-coder-agent
Option B: Manual env vars (good for ~/.zshrc or project .claude/settings.json)
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
claude --model qwen2.5-coder-agent
For repo-only local routing, put those variables in .claude/settings.json at the project root so other projects stay on cloud.
4.4 Acceptance checklist
ollama psshows the model loaded.- Claude Code can read
READMEand answer from the repo. - Ask it to run
npm test/pytestand confirm bash tools work. - Watch memory: 16GB Macs with Xcode + 7B often swap—split inference from builds when possible.
5. Task routing: what stays local vs what goes to the cloud
| Task type | Suggested engine | Why |
|---|---|---|
| Single-file completion, small refactors | Local Ollama | High frequency; occasional mistakes OK |
| Batch test generation, type-error fixes | Local Ollama | Repetitive; cloud API is poor value |
| Cross 10+ module architecture changes | Cloud Claude or larger local model | Needs stronger reasoning and long context |
| Security audit, compliance-sensitive code | Local Ollama | Data never leaves the network |
| CI unattended agent | Ollama on remote Mac | Always on, auditable |
Anti-pattern: don’t let a 7B local model own the full pipeline
If a weak model runs a long “requirements to production” agent alone, failed retries explode—total time often exceeds one strong cloud call. Hybrid strategy: local for drafts and mechanical work; cloud or a larger local model for decisions.
6. Team setup: cloud Mac / dedicated M4 inference node
A personal MacBook is fine for experiments; once several people share an agent, you want a always-on, SSH-ready, high-memory macOS inference host. That is the sweet spot for M4 Mac Mini: quiet, efficient, unified memory friendly to Ollama, same ecosystem as iOS/macOS CI.
6.1 Recommended topology
- Inference box (1× M4 Mac Mini, 24GB+ recommended):
ollama serveon0.0.0.0:11434(restrict via firewall/VLAN). - Developer laptops:
export ANTHROPIC_BASE_URL=http://<inference-host-LAN-IP>:11434, then runclaudeas usual. - Optional CI Mac (second machine): run
xcodebuildseparately from inference to avoid memory contention—see enterprise Mac CI runners.
6.2 When Macstripe cloud Mac beats self-hosted hardware
If you lack a datacenter, or need APAC / US-West nodes, stable public IP, day-scale capacity, run Ollama on a Macstripe dedicated physical M4 Mac Mini: SSH in, same brew install ollama, expose 11434 to the team via Tailscale or VPN. Compared with buying hardware:
- No procurement, shipping, rack, or disposal.
- Short leases validate “whole team on local models” before a long buy.
- Aligns with private inference thinking: code and prompts stay inside your boundary.
Models, regions, and terms are on the Macstripe home page and pricing page. Macstripe does not host Ollama for you—it delivers macOS hardware and network to run it 24/7.
# On a cloud Mac (example)
brew install ollama
ollama serve &
ollama pull qwen2.5-coder:14b
# On member laptops: ANTHROPIC_BASE_URL=http://<cloud Mac LAN or Tailscale IP>:11434
7. Series plan: local AI agent topic cluster
“Claude Code + Ollama + Apple Silicon” works better as a series than a one-off—better topical authority for search and easier navigation. Planned Macstripe Developer Blog coverage (rolling out):
- Claude Code + MLX — peak tok/s and Python pipeline integration
- Claude Code + OpenRouter — multi-model routing and cost comparison
- Claude Code + Qwen3 / DeepSeek — Chinese and code-oriented model picks
- M4 Mac Mini inference ops — monitoring, queuing, Tailscale access
Already live: MLX vs Ollama, Agent Skills engineering discipline.
8. Anti-patterns and troubleshooting
- Forgetting to clear ANTHROPIC_API_KEY: Claude Code may still hit the cloud; local config looks “broken.”
- Context stuck at 8K: agent drops file chunks → endless retries; use Modelfile to reach 64K+.
- Model names with
/: some backends choke; use Ollama short names likeqwen2.5-coder-agent. - Running everything on Windows locally: Claude Code + Ollama is more mature on macOS/Linux; use WSL2 or a remote Mac on Windows.
- Treating the agent as unsupervised production change: keep CI, code review, and human merge policy—see cross-week collaboration and memory.
FAQ
How much can I save on API bills with a local AI agent on an M4 Mac Mini?
It depends how much work you keep local. In our 8-person pilot (“local-first + cloud fallback”), cloud API spend fell from ~$300/month to ~$50/month (~83%) after about a month. Solo usage swings more, but high-frequency mechanical agent work usually drops sharply.
Is Ollama on an M4 Mac Mini fast enough for daily agent work?
On 24GB, Qwen2.5-Coder 7B is ~25 token/s and 14B ~15 token/s—fine for tests and single-module refactors. Full-repo architecture still belongs on a strong cloud model.
Can Claude Code use Ollama directly?
Yes. Set ANTHROPIC_BASE_URL=http://localhost:11434 (or your team inference host), ANTHROPIC_AUTH_TOKEN=ollama, ANTHROPIC_API_KEY="", or use ollama launch claude --model <name>.
How large a context window does Claude Code need?
≥64K recommended. Safest path: PARAMETER num_ctx 65536 in a Modelfile, then ollama create a custom model.
Do I still need a Claude subscription?
Pure local needs no cloud API calls; keep cloud for hard tasks. Hybrid is usually cheaper than Claude Max alone.
Is 16GB on an M4 Mac Mini enough?
Enough for 7B-class daily agents; 14B+ or 2+ concurrent users → start at 24GB.
How does a team share one Ollama instance?
Expose 11434 on the LAN or Tailscale and point everyone’s BASE_URL at it—or use a Macstripe cloud Mac / dedicated M4 as a 24/7 inference node.
How is this different from Cursor?
Claude Code is a terminal agent (SSH remote Mac, scripting); Cursor is an IDE. Both can coexist; this series will also compare MLX, OpenRouter, and other backends.
Conclusion
If you remember one thing: judge local AI agents on outcomes before config. On an M4 Mac Mini, Claude Code + Ollama keeps most daily agent work on your network; our pilot cut cloud API to ~one-fifth, and 7B speed is enough for routine edits. Ship with 64K context, task routing, and split inference from CI; hardware-wise, prefer 24GB unified memory on an M4 Mac Mini or a Macstripe always-on cloud node.
- Start with numbers: cost, speed, concurrency
- Validate locally:
ollama launch claude --model … - Scale the team:
ollama serveon a dedicated M4 + LAN BASE_URL → Macstripe models and regions - Follow the series: MLX / OpenRouter / Qwen3 combos (section 7)
Related reading
- M4 Mac Mini: 7B vs 14B — real-world gap
- MLX vs Ollama on Apple Silicon: Local AI Inference Compared
- mattpocock/skills and Claude Code Engineering Discipline
- Private Inference and Compute Sovereignty
- Cursor Keeps Forgetting — Why Long Context Won't Fix Cross-Week Work
- Enterprise Mac CI Resource Pool (2026)