RTX 5090 Quantization Guide: Every Qwen3.5 Variant Benchmarked
Quantization is one of those topics where conventional wisdom ("just use Q4_K_M") meets reality and gets complicated. We ran every Unsloth Qwen3.5 GGUF variant on an RTX 5090 — 22 quantization formats across 5 model sizes — to map the actual performance landscape.
The question isn't just "which quant is fastest?" It's "which quant gives me the best balance of speed and quality at my model size, my context length, and my concurrency level?"
The Full Picture: 0.8B
The 0.8B model is so small that it fits entirely in the RTX 5090's 31.8 GB VRAM at any quantization, including BF16. This makes it the perfect test of "does quant format even matter when memory isn't a constraint?"
Standard Quants
| Quant | Peak tok/s | Avg ITL (ms) | Best Config |
|---|---|---|---|
| Q4_1 | 808.5 | 4.5 | ctx=32768, cu=4 |
| Q4_0 | 788.6 | 4.7 | ctx=130064, cu=4 |
| IQ4_NL | 786.2 | 4.8 | ctx=8192, cu=8 |
| Q8_0 | 777.3 | 4.8 | ctx=16384, cu=8 |
| Q6_K | 775.4 | 4.6 | ctx=130064, cu=4 |
| IQ4_XS | 774.5 | 4.8 | ctx=16384, cu=8 |
| Q3_K_S | 766.9 | 4.8 | ctx=8192, cu=4 |
| Q3_K_M | 752.6 | 4.9 | ctx=8192, cu=4 |
| BF16 | 730.0 | 5.0 | ctx=130064, cu=4 |
| Q5_K_M | 732.2 | 5.1 | ctx=16384, cu=4 |
| Q4_K_S | 723.5 | 5.0 | ctx=8192, cu=4 |
| Q4_K_M | 712.4 | 5.4 | ctx=8192, cu=8 |
| Q5_K_S | 708.9 | 5.4 | ctx=16384, cu=8 |
Unsloth Dynamic (UD) Quants
| Quant | Peak tok/s | Avg ITL (ms) | Best Config |
|---|---|---|---|
| Q2_K_XL | 775.1 | 4.7 | ctx=130064, cu=8 |
| IQ2_XXS | 773.7 | 4.8 | ctx=16384, cu=8 |
| IQ2_M | 746.2 | 5.0 | ctx=130064, cu=8 |
| Q8_K_XL | 741.1 | 5.0 | ctx=8192, cu=4 |
| Q6_K_XL | 739.8 | 5.1 | ctx=130064, cu=8 |
| Q5_K_XL | 737.3 | 5.0 | ctx=130064, cu=4 |
| IQ3_XXS | 713.5 | 5.2 | ctx=8192, cu=4 |
| Q3_K_XL | 713.4 | 5.2 | ctx=16384, cu=8 |
| Q4_K_XL | 694.2 | 5.4 | ctx=130064, cu=8 |
The 0.8B Verdict
The spread from fastest (808.5 tok/s) to slowest (694.2 tok/s) is only 14%. At 0.8B, the model is so small that quantization format barely matters for throughput. BF16 is actually not the fastest — Q4_1 beats it by 10.7% because smaller weights mean less memory traffic even on a GPU with abundant bandwidth.
Recommendation for 0.8B: Use Q8_0 or Q6_K if you want near-lossless quality with no meaningful speed penalty. Pick Q4_1 if you want to squeeze every last tok/s out of the hardware. Don't use BF16 — it's the slowest option for no quality benefit over Q8_0 in llama.cpp serving.
The Inflection Point: 2B
At 2B, the quant landscape starts to differentiate more.
Top Performers
| Quant | Peak tok/s | Best Config |
|---|---|---|
| IQ4_NL | 759.5 | ctx=65551, cu=4 |
| Q4_0 | 744.2 | ctx=32768, cu=8 |
| Q4_1 | 724.2 | ctx=130064, cu=16 |
| Q2_K_XL-UD | 714.0 | ctx=65551, cu=4 |
| IQ4_XS | 690.1 | ctx=65551, cu=8 |
| Q6_K | 690.9 | ctx=16384, cu=4 |
| Q3_K_S | 686.7 | ctx=65551, cu=4 |
| IQ2_XXS-UD | 679.8 | ctx=16384, cu=8 |
| Q8_0 | 678.0 | ctx=130064, cu=8 |
| Q3_K_M | 657.6 | ctx=32768, cu=4 |
| IQ2_M-UD | 664.4 | ctx=65551, cu=8 |
| BF16 | 518.8 | ctx=16384, cu=32 |
The BF16 Tax
BF16 drops to 518.8 tok/s — a 32% penalty vs IQ4_NL. At 2B, the model size is large enough that full-precision weights consume meaningfully more bandwidth. This is the first model size where the "just use BF16 for quality" approach has a real cost.
Recommendation for 2B: IQ4_NL is the sweet spot — top throughput with the imatrix-optimized quantization preserving quality better than raw Q4 formats. Q8_0 at 678 tok/s is still fast if you want higher quality. Avoid BF16 unless you specifically need full precision.
The Middle Ground: 4B
4B doubles the 2B parameter count, and on the RTX 5090, the quant format starts to have larger throughput effects.
Standard Quants
| Quant | Peak tok/s | Best Config |
|---|---|---|
| Q4_0 | 446.6 | ctx=32768, cu=8 |
| Q4_1 | 437.7 | ctx=8192, cu=8 |
| IQ4_NL | 434.5 | ctx=8192, cu=8 |
| Q3_K_S | 417.4 | ctx=8192, cu=4 |
| Q6_K | 403.6 | ctx=130064, cu=8 |
| Q8_0 | 393.8 | ctx=32768, cu=16 |
| Q3_K_M | 386.1 | ctx=32768, cu=32 |
| Q4_K_M | 360.2 | ctx=32768, cu=8 |
| Q5_K_M | 358.4 | ctx=16384, cu=4 |
| Q4_K_S | 354.3 | ctx=32768, cu=16 |
| Q5_K_S | 347.9 | ctx=8192, cu=32 |
| BF16 | 313.2 | ctx=130064, cu=16 |
UD Quants
| Quant | Peak tok/s | Best Config |
|---|---|---|
| IQ2_XXS | 428.7 | ctx=8192, cu=8 |
| Q2_K_XL | 427.3 | ctx=16384, cu=8 |
| IQ2_M | 407.6 | ctx=130064, cu=8 |
| IQ3_XXS | 400.4 | ctx=32768, cu=8 |
| Q3_K_XL | 384.3 | ctx=16384, cu=16 |
| Q6_K_XL | 373.0 | ctx=32768, cu=32 |
| Q5_K_XL | 363.1 | ctx=8192, cu=16 |
| Q4_K_XL | 356.3 | ctx=8192, cu=8 |
| Q8_K_XL | 349.1 | ctx=32768, cu=16 |
The K-Quant Overhead
An unexpected pattern at 4B: the K-quant variants (Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S) are consistently slower than their simpler counterparts (Q4_0, Q4_1). Q4_0 at 446.6 tok/s beats Q4_K_M at 360.2 tok/s by 24%.
This happens because K-quants use per-block scaling factors that add computational overhead. On the RTX 5090, which has massive compute throughput, this overhead matters more than the slightly better memory efficiency of K-quants. The simple Q4_0/Q4_1 formats let the GPU dequantize faster.
Recommendation for 4B: Q4_0 for maximum throughput, IQ4_NL if you want imatrix quality optimization, or Q6_K/Q8_0 if you're willing to trade 10–15% throughput for less quantization noise. The K-quant overhead is real at this model size.
The Challenge: 9B
At 9B, the model is large enough that quantization choices have dramatic throughput implications, and concurrency becomes a major factor.
Peak Throughput (32 Concurrent Users)
| Quant | Peak tok/s | Avg TTFT | Avg ITL |
|---|---|---|---|
| IQ2_XXS-UD | 557.7 | 424 ms | 54.1 ms |
| Q3_K_XL-UD | 555.9 | 260 ms | 54.7 ms |
| Q4_K_XL-UD | 555.2 | 196 ms | 27.4 ms† |
| IQ3_XXS-UD | 553.6 | 303 ms | 54.7 ms |
| IQ4_XS | 553.5 | 267 ms | 55.4 ms |
| Q4_0 | 552.5 | 251 ms | 55.8 ms |
| Q4_1 | 551.2 | 286 ms | 55.4 ms |
| IQ2_M-UD | 548.3 | 665 ms | 54.6 ms |
| Q3_K_M | 546.8 | 263 ms | 55.9 ms |
| Q5_K_XL-UD | 552.1 | 332 ms | 54.6 ms |
| Q4_K_M | 536.3 | 265 ms | 56.9 ms |
| Q5_K_M | 532.5 | 356 ms | 57.7 ms |
| Q8_0 | 529.6 | 256 ms | 58.0 ms |
| BF16 | 482.0 | 377 ms | 62.8 ms |
†At 16 concurrent users, not 32.
The Quality-Speed Frontier
At 9B, the throughput range from best (557.7 tok/s) to BF16 (482.0 tok/s) is a 16% spread. That's narrower than you might expect — the RTX 5090's VRAM bandwidth is generous enough that even BF16 at 9B doesn't create a severe bottleneck.
But look at the ITL column: all the quants cluster between 54–58 ms at 32 concurrent users, while BF16 is at 62.8 ms. Per-user latency is where the full-precision penalty shows up.
Single-User Performance vs Multi-User
The 9B data reveals a fascinating concurrency effect. At single user:
- BF16: ~130 tok/s
- Q4_0: ~150 tok/s
- Best quant: ~170 tok/s
At 32 users, throughput jumps to 480–558 tok/s because the GPU can process multiple requests in parallel, batching attention computation. But this comes at the cost of ITL (2.4 ms single-user → 55 ms at 32 users) and TTFT.
For interactive single-user chat, quant choice barely matters at 9B. The throughput difference between BF16 and the best quant is < 2× at single user, and all options produce acceptable ITL.
For serving multiple users, quant choice matters a lot. The delta between BF16 and IQ2_XXS-UD is 76 tok/s of aggregate throughput at 32 users.
The Wall: 27B
27B is where the RTX 5090's 31.8 GB VRAM becomes the defining constraint.
What Fits — And What Doesn't
| Quant | Fits in VRAM? | Peak tok/s | Notes |
|---|---|---|---|
| IQ2_XXS-UD | Yes | 149.0 | Aggressive, quality concerns |
| IQ2_M-UD | Yes | 139.3 | |
| IQ3_XXS-UD | Yes | 216.6 | |
| Q2_K_XL-UD | Yes | 211.4 | |
| Q3_K_S | Yes | 211.8 | |
| Q3_K_M | Yes | 210.8 | |
| Q3_K_XL-UD | Yes | 212.4 | |
| IQ4_XS | Yes | 218.0 | Sweet spot |
| IQ4_NL | Yes | 217.4 | |
| Q4_0 | Yes | 214.5 | |
| Q4_1 | Yes | 212.5 | |
| Q4_K_M | Yes | 208.6 | |
| Q4_K_XL-UD | Yes | 217.2 | |
| Q4_K_S | Yes | 205.8 | |
| Q5_K_M | Yes (barely) | 197.7 | |
| Q5_K_S | Yes (barely) | 196.1 | |
| Q5_K_XL-UD | Yes (barely) | 212.7 | |
| Q6_K | Borderline | 186.1 | Edge of VRAM |
| Q6_K_XL-UD | Spills | 55.3 | 75% throughput loss |
| Q8_0 | Spills | 45.9 | 79% throughput loss |
| Q8_K_XL-UD | Spills | 22.9 | 90% throughput loss |
The VRAM Cliff
The transition from "fits in VRAM" to "spills to system RAM" is catastrophic:
- Q6_K (186.1 tok/s) → Q6_K_XL (55.3 tok/s): 70% throughput loss
- Q8_0 at 45.9 tok/s and Q8_K_XL at 22.9 tok/s are worse than running 27B on the GB10
This is the most important finding for 27B on the RTX 5090: never run a quantization level that spills VRAM. The performance penalty is so severe that you're better off using a smaller model at higher quality or switching to a platform with more memory.
The Sweet Spot
IQ4_XS at 218.0 tok/s is the optimal 27B quant on the RTX 5090. It's the highest throughput that comfortably fits in VRAM, uses imatrix quantization for better quality than naive Q4, and handles up to 32k context without issues.
For applications where Q3-level quality is acceptable, Q3_K_XL-UD (212.4 tok/s) and IQ3_XXS-UD (216.6 tok/s) offer similar throughput with slightly smaller memory footprints, freeing context window headroom.
Recommendation for 27B on RTX 5090: IQ4_XS or IQ4_NL for the best speed-quality trade-off. Never use Q6_K+ unless you've verified it fits in your specific VRAM configuration (other processes may reduce available VRAM). If you need Q8 quality, switch to the GB10 or a multi-GPU setup.
Summary: The Quantization Cheat Sheet
| Model Size | Best Throughput Quant | Best Quality/Speed Balance | Avoid |
|---|---|---|---|
| 0.8B | Q4_1 (808 tok/s) | Q8_0 (777 tok/s) | BF16 (slowest) |
| 2B | IQ4_NL (759 tok/s) | Q8_0 (678 tok/s) | BF16 (32% penalty) |
| 4B | Q4_0 (447 tok/s) | Q6_K (404 tok/s) | K-quants (overhead) |
| 9B | IQ2_XXS-UD (558 tok/s) | Q4_0 (553 tok/s) | BF16 for multi-user |
| 27B | IQ4_XS (218 tok/s) | IQ4_NL (217 tok/s) | Q6_K+ (VRAM spill) |
The recurring theme: on the RTX 5090, BF16 is never the fastest option, and for models ≥4B, the throughput penalty for full precision is significant. Quantization isn't just about saving VRAM — it's a direct performance optimization.
All benchmarks from Poor Paul's Benchmark. Run your own with the CLI and your results will appear in the dataset automatically.