← Back to articles

Three GPUs, One Model: Qwen3.5-0.8B Across RTX 5090, GB10, and M4 Pro

analysisqwenrtx-5090gb10m4-proarchitecture-comparison

The 0.8B parameter model is the lightest in the Qwen3.5 family — small enough to fit on anything, fast enough to stress-test memory bandwidth. We ran it across three very different platforms to see how architecture shapes performance when the model itself isn't the bottleneck.

The Hardware

PlatformMemoryArchitecturePower Envelope
NVIDIA GeForce RTX 509031.8 GB dedicated VRAMBlackwell, CUDA 13.1~450W
NVIDIA GB10 (Project DIGITS)119.6 GB sharedGrace Blackwell, aarch64~100W
Apple M4 Pro64 GB unifiedMetal 4, arm64~30W

Three different memory architectures, three different power budgets, three different design philosophies. Let's see what the numbers say.

Single-User Performance

For the interactive chat use case — one person, one conversation — here's how each platform handles 0.8B Q8_0:

MetricRTX 5090GB10M4 Pro
Throughput395 tok/s172 tok/s92 tok/s
Avg TTFT20 ms27 ms40 ms
Avg ITL2.5 ms5.7 ms10.7 ms
p99 TTFT31 ms50 ms127 ms
p99 ITL3.2 ms7.0 ms11.2 ms

At single user, all three platforms are "fast enough" — a 10.7 ms ITL on the M4 Pro means tokens arrive faster than you can read them. But the RTX 5090 is running at 4.3× the throughput of the M4 Pro and 2.3× the GB10.

That 2.5 ms ITL on the RTX 5090 is essentially instantaneous. You'd need specialized equipment to perceive the difference between 2.5 ms and 5.7 ms.

Peak Throughput (Any Configuration)

When we remove the single-user constraint and let each platform find its best concurrency, context, and batch settings:

QuantRTX 5090GB10M4 Pro
BF16730.0 tok/s269.9 tok/s124.0 tok/s
Q8_0777.3 tok/s335.4 tok/s151.5 tok/s
Q4_1808.5 tok/s393.0 tok/s144.8 tok/s
Q4_0788.6 tok/s395.6 tok/s144.0 tok/s
IQ4_NL786.2 tok/s386.2 tok/s146.0 tok/s
Q3_K_S766.9 tok/s385.0 tok/s136.0 tok/s
Q6_K775.4 tok/s341.1 tok/s134.2 tok/s

The RTX 5090 peaks at 808.5 tok/s with Q4_1. The GB10 reaches 395.6 tok/s with Q4_0. The M4 Pro maxes out at 151.5 tok/s with Q8_0.

A few things jump out:

  • The RTX 5090 is 2× the GB10 and 5× the M4 Pro on peak throughput. Dedicated VRAM bandwidth is king.
  • The GB10 benefits more from quantization than the other platforms. Its throughput jumps from 270 tok/s (BF16) to 396 tok/s (Q4_0) — a 47% improvement. The smaller quants reduce memory traffic on the shared bus.
  • The M4 Pro barely budges across quants. BF16 at 124 tok/s vs Q8_0 at 152 tok/s is only a 22% range. The unified memory bandwidth appears to be the fixed ceiling.

Concurrency Scaling

This is where the architectures truly diverge. Here's how throughput and TTFT scale as we add concurrent users (0.8B, best quant, 8k context):

Throughput Under Load

Concurrent UsersRTX 5090 (tok/s)GB10 (tok/s)M4 Pro (tok/s)
1~395~172~93
4~750+~386+~146
8~780+~360~144
32~700~270~152

The RTX 5090 scales beautifully from 1 to 4 users, nearly doubling throughput. The GB10 also scales well through 4 users. The M4 Pro? It reaches its ceiling almost immediately and stays flat.

TTFT Under Load

Here's the real differentiator — how long users wait for the first token:

Concurrent UsersRTX 5090 TTFTGB10 TTFTM4 Pro TTFT
120 ms27 ms40 ms
4~115 ms~125 ms~117 ms
8~1,300 ms~2,600 ms~6,900 ms
32~5,200 ms~24,000 ms~42,500 ms

At 32 users, the M4 Pro's TTFT is 42 seconds. That's not a typo. The GB10 isn't great either at 24 seconds. The RTX 5090 keeps it at roughly 5 seconds, which is still noticeable but at least in the range where a UI spinning indicator feels reasonable.

Context Length Scaling

One of the RTX 5090's quiet superpowers: context length barely matters.

At single user with 0.8B BF16, throughput is essentially flat from 8k to 130k context on the RTX 5090 (393–399 tok/s). Same story on the GB10 (~171 tok/s) and M4 Pro (~92 tok/s). The KV cache for 0.8B is small enough that even 130k tokens fit comfortably in all three memory pools.

This changes with larger models, but for 0.8B, you can set n_ctx=131072 on any platform without meaningful throughput penalty.

The Efficiency Angle

Raw throughput doesn't account for the fact that these platforms draw very different amounts of power:

PlatformPeak 0.8B tok/sApprox. Power DrawRough tok/s per Watt
RTX 5090808~450W (system)~1.8
GB10396~100W (system)~4.0
M4 Pro152~30W (package)~5.1

The M4 Pro delivers 5.1 tok/s per watt — nearly 3× the RTX 5090's efficiency. If you're running inference 24/7 on a small model and paying for electricity, that math matters. The GB10 sits in the middle at 4.0 tok/s per watt, which is impressive given its much larger memory pool.

Recommendations

Go with the RTX 5090 if: You need to serve multiple concurrent users, you prioritize TTFT under load, or you plan to scale up to larger models later. The 5090 has headroom for 9B+ models that the M4 Pro can't match.

Go with the GB10 if: You want to run 27B models (or larger) without quantization headaches. The 119.6 GB of shared memory is a unique advantage. The throughput is moderate (395 tok/s peak for 0.8B), but it handles models that would choke the RTX 5090's VRAM.

Go with the M4 Pro if: You're running single-user inference on a laptop or desktop and want a silent, energy-efficient experience. 92 tok/s is plenty fast for one person, and you're not paying for a power-hungry GPU when you step away from the keyboard.

For 0.8B specifically, all three platforms are overkill. The model is so small that the real question is whether you need concurrency — and if you do, the RTX 5090 is the only serious option.


All data from Poor Paul's Benchmark running llama-server. Explore the full dataset on the Leaderboard or chart your own comparisons on the Explore page.