Mac mini M4 Pro: 309 tok/s on 2B Models, and the Limits of Unified Memory

The Mac mini M4 Pro sits in an unusual position in the local LLM market: it has 64 GB of memory — more than any consumer GPU — at a price (~$1,400 for the base M4 Pro with 24 GB, more for 64 GB) that's accessible to serious homelabbers. And it runs silently.

We collected 6,854 benchmark rows on the Mac mini M4 Pro. Here's an honest accounting of where it excels and where it struggles.

The Peak Numbers

On small models, the Mac mini is genuinely fast:

Model	Quant	Users	Peak tok/s	Avg TTFT
Qwen3.5-2B	UD-IQ2_XXS	32	309.0	597 ms
Qwen3.5-2B	UD-IQ2_XXS	32	306.8	780 ms
Qwen3.5-0.8B	Q8_0	2	124.4	63 ms
Qwen3.5-0.8B	Q8_0	1	93.9	48 ms

309 tok/s for a 2B model at 32 concurrent users is strong — comparable to a mid-range NVIDIA GPU. But notice the quantization: UD-IQ2_XXS is an extreme 2-bit quant. At 2 bits per weight, you're sacrificing significant quality for throughput. The Mac mini reaches those numbers by running a heavily quantized model that fits trivially in its memory.

At quality quantizations (Q4_K_M or better), the Mac mini's throughput is more modest:

Model	Quant	Users	tok/s
Qwen3.5-0.8B	Q4_K_M	4	~150
Qwen3.5-9B	Q4_K_M	4	~45
Qwen3.5-27B	Q4_K_M	2	~28

Why Shared Bandwidth Limits Inference Speed

The M4 Pro has a unified memory architecture: the CPU and GPU share the same physical memory pool (up to 64 GB) and the same memory bandwidth (~200 GB/s for LPDDR5X).

For LLM inference, the GPU needs high bandwidth for two things simultaneously:

Loading model weights for each token generation
Accessing the KV-cache for attention

On a discrete GPU (RTX 5090: 1.8 TB/s), all of that bandwidth is dedicated to the GPU. On the M4 Pro, that same bandwidth is shared with the CPU (which is doing background work, running macOS, etc.) — leaving effective GPU bandwidth at perhaps 150–180 GB/s in practice.

The result: the Mac mini generates tokens at roughly 10–15% of what an RTX 5090 achieves for the same model and quantization, despite having twice the memory.

Where the Mac mini Wins Decisively

Large models at quality quants (single user): Qwen3.5-27B-Q8_0 requires ~28 GB. The Mac mini (64 GB) runs it comfortably. An RTX 5090 (32 GB) runs it with minimal KV-cache headroom, and an RTX 5060 Ti (16 GB) can't run it at all.

Silence and power: The Mac mini draws ~25–35W at inference load. An RTX 5090 system draws 450W+. If you're running inference 24/7, the electricity cost difference is meaningful.

macOS compatibility: Everything works out of the box via Metal. No CUDA driver management, no VRAM fragmentation issues, no Windows/Linux decisions.

The Honest Verdict

The Mac mini M4 Pro is the right choice if:

You want a silent, low-power inference machine
Your primary use case is single-user interaction with 27B+ models at quality quants
macOS is your environment of choice
The price premium over a dedicated GPU machine doesn't bother you

It's the wrong choice if:

You need to serve multiple concurrent users at reasonable latency
Throughput per dollar is the primary metric
You're running models ≤9B (a $350 RTX 5060 Ti is 3–5× faster)

The 64 GB memory pool is genuinely useful for large models. The bandwidth limitation is real and not solvable by software. Accept both and the Mac mini M4 Pro is an excellent local AI machine for a specific use profile.