First Benchmark Roundup: Qwen3.5-0.8B on RTX 5090 vs M4 Pro

Our initial dataset includes 104 benchmark runs of Qwen3.5-0.8B-Q8_0 using llama-server across two very different systems:

Hardware	VRAM / Memory	Runs
NVIDIA GeForce RTX 5090	31.8 GB	56
Apple M4 Pro	64 GB (unified)	48

Key Findings

Throughput

The RTX 5090 dominates on raw throughput, hitting 768.85 tok/s at its peak configuration (130k context, 8 concurrent users). The M4 Pro tops out at 151.49 tok/s (8k context, 32 concurrent users).

That's roughly a 5× throughput advantage for the RTX 5090 — not surprising given its CUDA cores and dedicated VRAM bandwidth, but impressive nonetheless for a consumer GPU.

Context Length

One of the most interesting dimensions in the dataset is how throughput scales with context size. The RTX 5090 handles extremely large contexts (up to 130k tokens) while maintaining high throughput. The M4 Pro was tested at smaller context sizes, which is typical for unified memory architectures that need to share bandwidth between CPU and GPU workloads.

Latency

Time-to-first-token (TTFT) tells a different story. At high concurrency, the M4 Pro's average TTFT climbs to 42 seconds — a reminder that throughput and latency can diverge dramatically under load. The RTX 5090 keeps TTFT around 1.2 seconds at 8 concurrent users with its 130k context.

Inter-token latency (ITL) is excellent on both platforms, with the RTX 5090 averaging 4.98 ms and the M4 Pro at 26.2 ms.

Looking Ahead

This is just one model at one quantization level. As more community members contribute benchmarks for different models, quant levels, and hardware, we'll publish deeper analysis. Head to the Leaderboard to see the latest results, or check out the Explore page to chart the data yourself.

Want to add your hardware? Run Poor Paul's Benchmark and your results will automatically appear in the dataset.