OpenClaw Gateway with Ollama and vLLM on intranet, timeouts and doctor status acceptance

Teams routing agents through OpenClaw often park models on private LANs instead of public APIs. Ollama speaks OpenAI-compatible paths next to your desk; vLLM fronts pooled GPUs with batched scheduling and strict queue semantics. The Gateway still presents /v1/models and /v1/chat/completions to clients, but acceptance must prove three layers line up: normalized upstream URLs, end-to-end idle timeouts that survive slow first tokens, and concurrency slices that match GPU RAM and KV cache pressure โ€” not laptop optimism. This tutorial documents a reproducible ladder you can paste into runbooks: map hosts and TLS once, reproduce stall signatures with controlled prompts, cap parallel streams, then sign off with openclaw doctor and openclaw gateway status --require-rpc. For the client-facing HTTP smoke matrix and Bearer alignment, extend the companion walkthrough on OpenClaw Gateway OpenAI-compatible HTTP API, chat completions and models versus CLI channels. When the Gateway runs as an always-on daemon on macOS, cross-check ports and duplicate LaunchAgents using OpenClaw Gateway launchd stability: doctor, status, and logs cross-checklist so recurring restarts do not masquerade as model faults.

1. Endpoint mapping: Ollama versus vLLM behind one Gateway profile

Pick a single canonical upstream base per environment โ€” http://ollama.internal:11434 or https://vllm-prod.lan โ€” and mirror it exactly inside openclaw.json (scheme, port, trailing slash rules). Ollama typically exposes chat under the same OpenAI shim your IDE already understands; vLLM often terminates TLS on an ingress while workers stay unroutable, so the Gateway must trust the same CA bundle your curl tests use. Run GET /v1/models through the Gateway first: verify names match what agents request and that LAN DNS resolves identically from the Gateway host and from a bastion hop. If models list succeeds only on loopback but fails through the advertised hostname, fix bind addresses before touching GPU drivers.

Rule: capture one redacted curl transcript for models and one for streaming chat per upstream โ€” attachments beat screenshots.

2. Timeout ladder: clients, reverse proxies, and slow first-byte stalls

Local inference frequently violates SaaS latency assumptions: prefill can pause seconds while KV caches warm. Instrument four timers in parallel: SDK read timeout, Gateway upstream deadline, proxy idle timeout, and OS TCP keepalive. Reproduce with a deterministic prompt that forces a measurable gap before token one โ€” log wall-clock at connection accepted, first upstream byte, and first chunk forwarded. If direct Gateway curls succeed while IDE clients stall, lower client timeouts are innocent; if curls stall only through nginx or Envoy, raise proxy read timeouts and disable buffering on streaming locations. When timeouts correlate with swap on the Gateway host, memory contention is masquerading as network failure โ€” move heavy jobs before tuning TCP.

3. Concurrency slicing for streaming sessions

Treat concurrent stream: true sessions like a leaky bucket: each holds buffers in the Gateway process plus KV RAM in Ollama or vLLM. Declare a fleet-wide maximum parallel streams per Gateway identity and enforce it at the agent scheduler โ€” not only inside the model server. For vLLM, align Gateway concurrency with server-side max_num_seqs and tensor-parallel width so you never oversubscribe waiting queues that shed requests silently. For Ollama on Apple Silicon or discrete GPUs, pair concurrent caps with quantized model variants when operators insist on interactive latency during batch peaks.

  • Measure p95 time-to-first-token under your capped concurrency, not best-case idle runs.
  • Track RSS for Gateway and upstream on the same graph during soak tests.
  • Rollback path documented: toggle routing to a smaller model without editing secrets.

4. doctor and openclaw gateway status --require-rpc acceptance

Run openclaw doctor after any Node, plugin, or config edit โ€” capture stdout to your change ticket. Follow with openclaw gateway status --require-rpc so RPC dependencies used by HTTP handlers prove healthy, not merely local listeners. Compare reported upstream hosts against the values your curl ladder hits; a mismatched management plane hostname is a classic source of green dashboards with failing agents. Gate releases on both commands returning clean status plus the streaming curl transcript from section two.

5. Remote high-memory Mac spillover for long-context inference

When agents ingest large repositories or transcripts, context windows spike KV footprint beyond what a shared GPU shelf tolerates mid-sprint. Park those sessions on a dedicated remote Mac with high unified memory โ€” still on your tailnet โ€” and keep the Gateway routing table explicit: short-context traffic may stay on the LAN GPU cluster while overflow jobs target the Mac upstream profile. Snapshot doctor before and after routing changes, and extend timeouts on that lane only; mixing spillover rules into default pools recreates flaky stalls. This pattern mirrors segregating compile-heavy CI from interactive shells: same tooling, different capacity contract.

Why Apple Silicon Mac mini fits intranet Gateway plus spillover lanes

Unified memory on Apple Silicon lets a quietly cooled desktop-class Mac absorb bursty agent contexts without the PCIe shuffle common to discrete GPU workstations, while macOS supplies predictable power curves for always-on listeners managed through launchd. Gatekeeper, SIP, and FileVault stack neatly with zero-trust tailnets already wrapping most OpenClaw deployments, and idle draw stays low enough that a spillover node can remain online without rack noise budgets. When you need a high-memory anchor reachable over SSH or Tailscale without babysitting drivers, that combination routinely beats ad-hoc Linux towers on operational drag.

If you want this intranet Gateway layout โ€” capped streams, honest timeouts, and a clean doctor trail โ€” running on hardware that tolerates 24/7 duty cycles, review the Macstripe home page for dedicated cloud Mac capacity that matches your regions and RAM envelope; Mac mini M4 remains an approachable baseline before you scale spillover tiers.