Mac mini M4 Pro: 309 tok/s on 2B Models, and the Limits of Unified Memory
The Mac mini M4 Pro sits in an unusual position in the local LLM market: it has 64 GB of memory — more than any consumer GPU — at a price (~$1,400 for the base M4 Pro with 24 GB, more for 64 GB) that's accessible to serious homelabbers. And it runs silently.
We collected 6,854 benchmark rows on the Mac mini M4 Pro. Here's an honest accounting of where it excels and where it struggles.
The Peak Numbers
On small models, the Mac mini is genuinely fast:
| Model | Quant | Users | Peak tok/s | Avg TTFT |
|---|---|---|---|---|
| Qwen3.5-2B | UD-IQ2_XXS | 32 | 309.0 | 597 ms |
| Qwen3.5-2B | UD-IQ2_XXS | 32 | 306.8 | 780 ms |
| Qwen3.5-0.8B | Q8_0 | 2 | 124.4 | 63 ms |
| Qwen3.5-0.8B | Q8_0 | 1 | 93.9 | 48 ms |
309 tok/s for a 2B model at 32 concurrent users is strong — comparable to a mid-range NVIDIA GPU. But notice the quantization: UD-IQ2_XXS is an extreme 2-bit quant. At 2 bits per weight, you're sacrificing significant quality for throughput. The Mac mini reaches those numbers by running a heavily quantized model that fits trivially in its memory.
At quality quantizations (Q4_K_M or better), the Mac mini's throughput is more modest:
| Model | Quant | Users | tok/s |
|---|---|---|---|
| Qwen3.5-0.8B | Q4_K_M | 4 | ~150 |
| Qwen3.5-9B | Q4_K_M | 4 | ~45 |
| Qwen3.5-27B | Q4_K_M | 2 | ~28 |
Why Shared Bandwidth Limits Inference Speed
The M4 Pro has a unified memory architecture: the CPU and GPU share the same physical memory pool (up to 64 GB) and the same memory bandwidth (~200 GB/s for LPDDR5X).
For LLM inference, the GPU needs high bandwidth for two things simultaneously:
- Loading model weights for each token generation
- Accessing the KV-cache for attention
On a discrete GPU (RTX 5090: 1.8 TB/s), all of that bandwidth is dedicated to the GPU. On the M4 Pro, that same bandwidth is shared with the CPU (which is doing background work, running macOS, etc.) — leaving effective GPU bandwidth at perhaps 150–180 GB/s in practice.
The result: the Mac mini generates tokens at roughly 10–15% of what an RTX 5090 achieves for the same model and quantization, despite having twice the memory.
Where the Mac mini Wins Decisively
Large models at quality quants (single user): Qwen3.5-27B-Q8_0 requires ~28 GB. The Mac mini (64 GB) runs it comfortably. An RTX 5090 (32 GB) runs it with minimal KV-cache headroom, and an RTX 5060 Ti (16 GB) can't run it at all.
Silence and power: The Mac mini draws ~25–35W at inference load. An RTX 5090 system draws 450W+. If you're running inference 24/7, the electricity cost difference is meaningful.
macOS compatibility: Everything works out of the box via Metal. No CUDA driver management, no VRAM fragmentation issues, no Windows/Linux decisions.
The Honest Verdict
The Mac mini M4 Pro is the right choice if:
- You want a silent, low-power inference machine
- Your primary use case is single-user interaction with 27B+ models at quality quants
- macOS is your environment of choice
- The price premium over a dedicated GPU machine doesn't bother you
It's the wrong choice if:
- You need to serve multiple concurrent users at reasonable latency
- Throughput per dollar is the primary metric
- You're running models ≤9B (a $350 RTX 5060 Ti is 3–5× faster)
The 64 GB memory pool is genuinely useful for large models. The bandwidth limitation is real and not solvable by software. Accept both and the Mac mini M4 Pro is an excellent local AI machine for a specific use profile.