RTX 5090 Quantization Guide: Every Qwen3.5 Variant Benchmarked

Quantization is one of those topics where conventional wisdom ("just use Q4_K_M") meets reality and gets complicated. We ran every Unsloth Qwen3.5 GGUF variant on an RTX 5090 — 22 quantization formats across 5 model sizes — to map the actual performance landscape.

The question isn't just "which quant is fastest?" It's "which quant gives me the best balance of speed and quality at my model size, my context length, and my concurrency level?"

The Full Picture: 0.8B

The 0.8B model is so small that it fits entirely in the RTX 5090's 31.8 GB VRAM at any quantization, including BF16. This makes it the perfect test of "does quant format even matter when memory isn't a constraint?"

Standard Quants

Quant	Peak tok/s	Avg ITL (ms)	Best Config
Q4_1	808.5	4.5	ctx=32768, cu=4
Q4_0	788.6	4.7	ctx=130064, cu=4
IQ4_NL	786.2	4.8	ctx=8192, cu=8
Q8_0	777.3	4.8	ctx=16384, cu=8
Q6_K	775.4	4.6	ctx=130064, cu=4
IQ4_XS	774.5	4.8	ctx=16384, cu=8
Q3_K_S	766.9	4.8	ctx=8192, cu=4
Q3_K_M	752.6	4.9	ctx=8192, cu=4
BF16	730.0	5.0	ctx=130064, cu=4
Q5_K_M	732.2	5.1	ctx=16384, cu=4
Q4_K_S	723.5	5.0	ctx=8192, cu=4
Q4_K_M	712.4	5.4	ctx=8192, cu=8
Q5_K_S	708.9	5.4	ctx=16384, cu=8

Unsloth Dynamic (UD) Quants

Quant	Peak tok/s	Avg ITL (ms)	Best Config
Q2_K_XL	775.1	4.7	ctx=130064, cu=8
IQ2_XXS	773.7	4.8	ctx=16384, cu=8
IQ2_M	746.2	5.0	ctx=130064, cu=8
Q8_K_XL	741.1	5.0	ctx=8192, cu=4
Q6_K_XL	739.8	5.1	ctx=130064, cu=8
Q5_K_XL	737.3	5.0	ctx=130064, cu=4
IQ3_XXS	713.5	5.2	ctx=8192, cu=4
Q3_K_XL	713.4	5.2	ctx=16384, cu=8
Q4_K_XL	694.2	5.4	ctx=130064, cu=8

The 0.8B Verdict

The spread from fastest (808.5 tok/s) to slowest (694.2 tok/s) is only 14%. At 0.8B, the model is so small that quantization format barely matters for throughput. BF16 is actually not the fastest — Q4_1 beats it by 10.7% because smaller weights mean less memory traffic even on a GPU with abundant bandwidth.

Recommendation for 0.8B: Use Q8_0 or Q6_K if you want near-lossless quality with no meaningful speed penalty. Pick Q4_1 if you want to squeeze every last tok/s out of the hardware. Don't use BF16 — it's the slowest option for no quality benefit over Q8_0 in llama.cpp serving.

The Inflection Point: 2B

At 2B, the quant landscape starts to differentiate more.

Top Performers

Quant	Peak tok/s	Best Config
IQ4_NL	759.5	ctx=65551, cu=4
Q4_0	744.2	ctx=32768, cu=8
Q4_1	724.2	ctx=130064, cu=16
Q2_K_XL-UD	714.0	ctx=65551, cu=4
IQ4_XS	690.1	ctx=65551, cu=8
Q6_K	690.9	ctx=16384, cu=4
Q3_K_S	686.7	ctx=65551, cu=4
IQ2_XXS-UD	679.8	ctx=16384, cu=8
Q8_0	678.0	ctx=130064, cu=8
Q3_K_M	657.6	ctx=32768, cu=4
IQ2_M-UD	664.4	ctx=65551, cu=8
BF16	518.8	ctx=16384, cu=32

The BF16 Tax

BF16 drops to 518.8 tok/s — a 32% penalty vs IQ4_NL. At 2B, the model size is large enough that full-precision weights consume meaningfully more bandwidth. This is the first model size where the "just use BF16 for quality" approach has a real cost.

Recommendation for 2B: IQ4_NL is the sweet spot — top throughput with the imatrix-optimized quantization preserving quality better than raw Q4 formats. Q8_0 at 678 tok/s is still fast if you want higher quality. Avoid BF16 unless you specifically need full precision.

The Middle Ground: 4B

4B doubles the 2B parameter count, and on the RTX 5090, the quant format starts to have larger throughput effects.

Standard Quants

Quant	Peak tok/s	Best Config
Q4_0	446.6	ctx=32768, cu=8
Q4_1	437.7	ctx=8192, cu=8
IQ4_NL	434.5	ctx=8192, cu=8
Q3_K_S	417.4	ctx=8192, cu=4
Q6_K	403.6	ctx=130064, cu=8
Q8_0	393.8	ctx=32768, cu=16
Q3_K_M	386.1	ctx=32768, cu=32
Q4_K_M	360.2	ctx=32768, cu=8
Q5_K_M	358.4	ctx=16384, cu=4
Q4_K_S	354.3	ctx=32768, cu=16
Q5_K_S	347.9	ctx=8192, cu=32
BF16	313.2	ctx=130064, cu=16

UD Quants

Quant	Peak tok/s	Best Config
IQ2_XXS	428.7	ctx=8192, cu=8
Q2_K_XL	427.3	ctx=16384, cu=8
IQ2_M	407.6	ctx=130064, cu=8
IQ3_XXS	400.4	ctx=32768, cu=8
Q3_K_XL	384.3	ctx=16384, cu=16
Q6_K_XL	373.0	ctx=32768, cu=32
Q5_K_XL	363.1	ctx=8192, cu=16
Q4_K_XL	356.3	ctx=8192, cu=8
Q8_K_XL	349.1	ctx=32768, cu=16

The K-Quant Overhead

An unexpected pattern at 4B: the K-quant variants (Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S) are consistently slower than their simpler counterparts (Q4_0, Q4_1). Q4_0 at 446.6 tok/s beats Q4_K_M at 360.2 tok/s by 24%.

This happens because K-quants use per-block scaling factors that add computational overhead. On the RTX 5090, which has massive compute throughput, this overhead matters more than the slightly better memory efficiency of K-quants. The simple Q4_0/Q4_1 formats let the GPU dequantize faster.

Recommendation for 4B: Q4_0 for maximum throughput, IQ4_NL if you want imatrix quality optimization, or Q6_K/Q8_0 if you're willing to trade 10–15% throughput for less quantization noise. The K-quant overhead is real at this model size.

The Challenge: 9B

At 9B, the model is large enough that quantization choices have dramatic throughput implications, and concurrency becomes a major factor.

Peak Throughput (32 Concurrent Users)

Quant	Peak tok/s	Avg TTFT	Avg ITL
IQ2_XXS-UD	557.7	424 ms	54.1 ms
Q3_K_XL-UD	555.9	260 ms	54.7 ms
Q4_K_XL-UD	555.2	196 ms	27.4 ms†
IQ3_XXS-UD	553.6	303 ms	54.7 ms
IQ4_XS	553.5	267 ms	55.4 ms
Q4_0	552.5	251 ms	55.8 ms
Q4_1	551.2	286 ms	55.4 ms
IQ2_M-UD	548.3	665 ms	54.6 ms
Q3_K_M	546.8	263 ms	55.9 ms
Q5_K_XL-UD	552.1	332 ms	54.6 ms
Q4_K_M	536.3	265 ms	56.9 ms
Q5_K_M	532.5	356 ms	57.7 ms
Q8_0	529.6	256 ms	58.0 ms
BF16	482.0	377 ms	62.8 ms

†At 16 concurrent users, not 32.

The Quality-Speed Frontier

At 9B, the throughput range from best (557.7 tok/s) to BF16 (482.0 tok/s) is a 16% spread. That's narrower than you might expect — the RTX 5090's VRAM bandwidth is generous enough that even BF16 at 9B doesn't create a severe bottleneck.

But look at the ITL column: all the quants cluster between 54–58 ms at 32 concurrent users, while BF16 is at 62.8 ms. Per-user latency is where the full-precision penalty shows up.

Single-User Performance vs Multi-User

The 9B data reveals a fascinating concurrency effect. At single user:

BF16: ~130 tok/s
Q4_0: ~150 tok/s
Best quant: ~170 tok/s

At 32 users, throughput jumps to 480–558 tok/s because the GPU can process multiple requests in parallel, batching attention computation. But this comes at the cost of ITL (2.4 ms single-user → 55 ms at 32 users) and TTFT.

For interactive single-user chat, quant choice barely matters at 9B. The throughput difference between BF16 and the best quant is < 2× at single user, and all options produce acceptable ITL.

For serving multiple users, quant choice matters a lot. The delta between BF16 and IQ2_XXS-UD is 76 tok/s of aggregate throughput at 32 users.

The Wall: 27B

27B is where the RTX 5090's 31.8 GB VRAM becomes the defining constraint.

What Fits — And What Doesn't

Quant	Fits in VRAM?	Peak tok/s	Notes
IQ2_XXS-UD	Yes	149.0	Aggressive, quality concerns
IQ2_M-UD	Yes	139.3
IQ3_XXS-UD	Yes	216.6
Q2_K_XL-UD	Yes	211.4
Q3_K_S	Yes	211.8
Q3_K_M	Yes	210.8
Q3_K_XL-UD	Yes	212.4
IQ4_XS	Yes	218.0	Sweet spot
IQ4_NL	Yes	217.4
Q4_0	Yes	214.5
Q4_1	Yes	212.5
Q4_K_M	Yes	208.6
Q4_K_XL-UD	Yes	217.2
Q4_K_S	Yes	205.8
Q5_K_M	Yes (barely)	197.7
Q5_K_S	Yes (barely)	196.1
Q5_K_XL-UD	Yes (barely)	212.7
Q6_K	Borderline	186.1	Edge of VRAM
Q6_K_XL-UD	Spills	55.3	75% throughput loss
Q8_0	Spills	45.9	79% throughput loss
Q8_K_XL-UD	Spills	22.9	90% throughput loss

The VRAM Cliff

The transition from "fits in VRAM" to "spills to system RAM" is catastrophic:

Q6_K (186.1 tok/s) → Q6_K_XL (55.3 tok/s): 70% throughput loss
Q8_0 at 45.9 tok/s and Q8_K_XL at 22.9 tok/s are worse than running 27B on the GB10

This is the most important finding for 27B on the RTX 5090: never run a quantization level that spills VRAM. The performance penalty is so severe that you're better off using a smaller model at higher quality or switching to a platform with more memory.

The Sweet Spot

IQ4_XS at 218.0 tok/s is the optimal 27B quant on the RTX 5090. It's the highest throughput that comfortably fits in VRAM, uses imatrix quantization for better quality than naive Q4, and handles up to 32k context without issues.

For applications where Q3-level quality is acceptable, Q3_K_XL-UD (212.4 tok/s) and IQ3_XXS-UD (216.6 tok/s) offer similar throughput with slightly smaller memory footprints, freeing context window headroom.

Recommendation for 27B on RTX 5090: IQ4_XS or IQ4_NL for the best speed-quality trade-off. Never use Q6_K+ unless you've verified it fits in your specific VRAM configuration (other processes may reduce available VRAM). If you need Q8 quality, switch to the GB10 or a multi-GPU setup.

Summary: The Quantization Cheat Sheet

Model Size	Best Throughput Quant	Best Quality/Speed Balance	Avoid
0.8B	Q4_1 (808 tok/s)	Q8_0 (777 tok/s)	BF16 (slowest)
2B	IQ4_NL (759 tok/s)	Q8_0 (678 tok/s)	BF16 (32% penalty)
4B	Q4_0 (447 tok/s)	Q6_K (404 tok/s)	K-quants (overhead)
9B	IQ2_XXS-UD (558 tok/s)	Q4_0 (553 tok/s)	BF16 for multi-user
27B	IQ4_XS (218 tok/s)	IQ4_NL (217 tok/s)	Q6_K+ (VRAM spill)

The recurring theme: on the RTX 5090, BF16 is never the fastest option, and for models ≥4B, the throughput penalty for full precision is significant. Quantization isn't just about saving VRAM — it's a direct performance optimization.

All benchmarks from Poor Paul's Benchmark. Run your own with the CLI and your results will appear in the dataset automatically.