Qwen3.6-35B-A3B Across Five Machines: The MoE Architecture Test
Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35B total parameters, but only ~3B are active on any given token. On paper this means you get 35B-equivalent quality while paying the compute cost of a 3B model. In practice, it means MoE inference has a different performance profile than dense models — and it behaves differently depending on your hardware.
We ran it on all five machines in our benchmark fleet. Here's what we found.
Hardware Setup
| Machine | GPU | VRAM | Special Characteristic |
|---|---|---|---|
| GB10 | Grace Blackwell Superchip | 120 GB (unified) | HBM3e bandwidth |
| Mac mini M4 Pro | Apple M4 Pro | 64 GB (unified) | Shared CPU/GPU bandwidth |
| RTX 5060 Ti | GeForce RTX 5060 Ti | 16 GB | GDDR7 bandwidth |
| RTX 4060 Ti ×2 | 2× GeForce RTX 4060 Ti | 32 GB | Dual-GPU tensor split |
| RTX 5090 | GeForce RTX 5090 | 32 GB | GDDR7X bandwidth |
Single-User Throughput
At 1 concurrent user with a 2,048-token context, the GB10 surprises everyone:
| Machine | Quant | tok/s (1 user) | TTFT |
|---|---|---|---|
| GB10 | MXFP4_MOE | 61.9 | 88 ms |
| GB10 | Q8_0 | 53.6 | 103 ms |
| RTX 5090 | Q4_K_M | ~48–55* | ~120 ms |
| RTX 4060 Ti ×2 | Q4_K_M | ~35–45* | ~200 ms |
| Mac mini M4 Pro | Q4_K_M | ~28–35* | ~350 ms |
*Estimated from fleet averages — these machines were benchmarked on different dates with slightly different configurations.
The GB10 leads on single-user throughput despite using a "mere" MXFP4 quantization. The reason: HBM3e memory bandwidth. The active experts' weights are fetched on every token, and HBM3e delivers that bandwidth far more efficiently than GDDR systems.
Multi-User Scaling
At 32 concurrent users, the picture changes:
| Machine | Quant | tok/s (32 users) | Avg TTFT (32u) |
|---|---|---|---|
| GB10 | MXFP4_MOE | 172.7 | 1,148 ms |
| GB10 | Q8_0 | 135.4 | 1,278 ms |
| RTX 5090 | Q4_K_M | ~120* | ~900 ms |
The GB10 still leads, but the TTFT degradation is real. At 32 concurrent users asking 2,048-token prompts, first-token latency climbs to over a second — uncomfortable for interactive use, acceptable for batch processing.
The MoE Memory Paradox
Here's the counterintuitive finding: Qwen3.6-35B-A3B fits easily in 16 GB on the RTX 5060 Ti at Q4_K_M quantization (~8 GB loaded), and the 5060 Ti delivers competitive single-user throughput. MoE models are VRAM-efficient in ways that dense models aren't.
But when you push to 8+ concurrent users, the attention cache fills the remaining VRAM quickly, and you start seeing OOM errors or context truncation. The 120 GB of the GB10 means you can run 32 concurrent users with a 130,000-token context and still have room for the OS.
128K Context: Only One Machine Can
One test we ran that most machines couldn't complete: Qwen3.6-35B-A3B-MXFP4_MOE at 130,064-token context with 32 concurrent users.
Only the GB10 managed this. The result: 168 tok/s at that context length — essentially flat compared to 8K context. That's the benefit of HBM3e's bandwidth and the GB10's massive memory pool.
The Verdict
MoE models like Qwen3.6-35B-A3B reward machines with high memory bandwidth and large memory pools more than dense models do. The GB10's unified 120 GB HBM3e architecture makes it uniquely suited for this workload. But for single-user homelab use, even the RTX 5060 Ti's 16 GB is sufficient if you keep context windows modest.
If you're choosing hardware specifically for MoE inference, prioritize memory bandwidth and pool size over raw CUDA core count.