Qwen3.6-35B-A3B Across Five Machines: The MoE Architecture Test

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, but only ~3B are active on any given token. On paper this means you get 35B-equivalent quality while paying the compute cost of a 3B model. In practice, it means MoE inference has a different performance profile than dense models — and it behaves differently depending on your hardware.

We ran it on all five machines in our benchmark fleet. Here's what we found.

Hardware Setup

Machine	GPU	VRAM	Special Characteristic
GB10	Grace Blackwell Superchip	120 GB (unified)	HBM3e bandwidth
Mac mini M4 Pro	Apple M4 Pro	64 GB (unified)	Shared CPU/GPU bandwidth
RTX 5060 Ti	GeForce RTX 5060 Ti	16 GB	GDDR7 bandwidth
RTX 4060 Ti ×2	2× GeForce RTX 4060 Ti	32 GB	Dual-GPU tensor split
RTX 5090	GeForce RTX 5090	32 GB	GDDR7X bandwidth

Single-User Throughput

At 1 concurrent user with a 2,048-token context, the GB10 surprises everyone:

Machine	Quant	tok/s (1 user)	TTFT
GB10	MXFP4_MOE	61.9	88 ms
GB10	Q8_0	53.6	103 ms
RTX 5090	Q4_K_M	~48–55*	~120 ms
RTX 4060 Ti ×2	Q4_K_M	~35–45*	~200 ms
Mac mini M4 Pro	Q4_K_M	~28–35*	~350 ms

*Estimated from fleet averages — these machines were benchmarked on different dates with slightly different configurations.

The GB10 leads on single-user throughput despite using a "mere" MXFP4 quantization. The reason: HBM3e memory bandwidth. The active experts' weights are fetched on every token, and HBM3e delivers that bandwidth far more efficiently than GDDR systems.

Multi-User Scaling

At 32 concurrent users, the picture changes:

Machine	Quant	tok/s (32 users)	Avg TTFT (32u)
GB10	MXFP4_MOE	172.7	1,148 ms
GB10	Q8_0	135.4	1,278 ms
RTX 5090	Q4_K_M	~120*	~900 ms

The GB10 still leads, but the TTFT degradation is real. At 32 concurrent users asking 2,048-token prompts, first-token latency climbs to over a second — uncomfortable for interactive use, acceptable for batch processing.

The MoE Memory Paradox

Here's the counterintuitive finding: Qwen3.6-35B-A3B fits easily in 16 GB on the RTX 5060 Ti at Q4_K_M quantization (~8 GB loaded), and the 5060 Ti delivers competitive single-user throughput. MoE models are VRAM-efficient in ways that dense models aren't.

But when you push to 8+ concurrent users, the attention cache fills the remaining VRAM quickly, and you start seeing OOM errors or context truncation. The 120 GB of the GB10 means you can run 32 concurrent users with a 130,000-token context and still have room for the OS.

128K Context: Only One Machine Can

One test we ran that most machines couldn't complete: Qwen3.6-35B-A3B-MXFP4_MOE at 130,064-token context with 32 concurrent users.

Only the GB10 managed this. The result: 168 tok/s at that context length — essentially flat compared to 8K context. That's the benefit of HBM3e's bandwidth and the GB10's massive memory pool.

The Verdict

MoE models like Qwen3.6-35B-A3B reward machines with high memory bandwidth and large memory pools more than dense models do. The GB10's unified 120 GB HBM3e architecture makes it uniquely suited for this workload. But for single-user homelab use, even the RTX 5060 Ti's 16 GB is sufficient if you keep context windows modest.

If you're choosing hardware specifically for MoE inference, prioritize memory bandwidth and pool size over raw CUDA core count.