← Back to articles

RTX 5090 Quantization Guide: Every Qwen3.5 Variant Benchmarked

analysisqwenrtx-5090quantizationguide

Quantization is one of those topics where conventional wisdom ("just use Q4_K_M") meets reality and gets complicated. We ran every Unsloth Qwen3.5 GGUF variant on an RTX 5090 — 22 quantization formats across 5 model sizes — to map the actual performance landscape.

The question isn't just "which quant is fastest?" It's "which quant gives me the best balance of speed and quality at my model size, my context length, and my concurrency level?"

The Full Picture: 0.8B

The 0.8B model is so small that it fits entirely in the RTX 5090's 31.8 GB VRAM at any quantization, including BF16. This makes it the perfect test of "does quant format even matter when memory isn't a constraint?"

Standard Quants

QuantPeak tok/sAvg ITL (ms)Best Config
Q4_1808.54.5ctx=32768, cu=4
Q4_0788.64.7ctx=130064, cu=4
IQ4_NL786.24.8ctx=8192, cu=8
Q8_0777.34.8ctx=16384, cu=8
Q6_K775.44.6ctx=130064, cu=4
IQ4_XS774.54.8ctx=16384, cu=8
Q3_K_S766.94.8ctx=8192, cu=4
Q3_K_M752.64.9ctx=8192, cu=4
BF16730.05.0ctx=130064, cu=4
Q5_K_M732.25.1ctx=16384, cu=4
Q4_K_S723.55.0ctx=8192, cu=4
Q4_K_M712.45.4ctx=8192, cu=8
Q5_K_S708.95.4ctx=16384, cu=8

Unsloth Dynamic (UD) Quants

QuantPeak tok/sAvg ITL (ms)Best Config
Q2_K_XL775.14.7ctx=130064, cu=8
IQ2_XXS773.74.8ctx=16384, cu=8
IQ2_M746.25.0ctx=130064, cu=8
Q8_K_XL741.15.0ctx=8192, cu=4
Q6_K_XL739.85.1ctx=130064, cu=8
Q5_K_XL737.35.0ctx=130064, cu=4
IQ3_XXS713.55.2ctx=8192, cu=4
Q3_K_XL713.45.2ctx=16384, cu=8
Q4_K_XL694.25.4ctx=130064, cu=8

The 0.8B Verdict

The spread from fastest (808.5 tok/s) to slowest (694.2 tok/s) is only 14%. At 0.8B, the model is so small that quantization format barely matters for throughput. BF16 is actually not the fastest — Q4_1 beats it by 10.7% because smaller weights mean less memory traffic even on a GPU with abundant bandwidth.

Recommendation for 0.8B: Use Q8_0 or Q6_K if you want near-lossless quality with no meaningful speed penalty. Pick Q4_1 if you want to squeeze every last tok/s out of the hardware. Don't use BF16 — it's the slowest option for no quality benefit over Q8_0 in llama.cpp serving.


The Inflection Point: 2B

At 2B, the quant landscape starts to differentiate more.

Top Performers

QuantPeak tok/sBest Config
IQ4_NL759.5ctx=65551, cu=4
Q4_0744.2ctx=32768, cu=8
Q4_1724.2ctx=130064, cu=16
Q2_K_XL-UD714.0ctx=65551, cu=4
IQ4_XS690.1ctx=65551, cu=8
Q6_K690.9ctx=16384, cu=4
Q3_K_S686.7ctx=65551, cu=4
IQ2_XXS-UD679.8ctx=16384, cu=8
Q8_0678.0ctx=130064, cu=8
Q3_K_M657.6ctx=32768, cu=4
IQ2_M-UD664.4ctx=65551, cu=8
BF16518.8ctx=16384, cu=32

The BF16 Tax

BF16 drops to 518.8 tok/s — a 32% penalty vs IQ4_NL. At 2B, the model size is large enough that full-precision weights consume meaningfully more bandwidth. This is the first model size where the "just use BF16 for quality" approach has a real cost.

Recommendation for 2B: IQ4_NL is the sweet spot — top throughput with the imatrix-optimized quantization preserving quality better than raw Q4 formats. Q8_0 at 678 tok/s is still fast if you want higher quality. Avoid BF16 unless you specifically need full precision.


The Middle Ground: 4B

4B doubles the 2B parameter count, and on the RTX 5090, the quant format starts to have larger throughput effects.

Standard Quants

QuantPeak tok/sBest Config
Q4_0446.6ctx=32768, cu=8
Q4_1437.7ctx=8192, cu=8
IQ4_NL434.5ctx=8192, cu=8
Q3_K_S417.4ctx=8192, cu=4
Q6_K403.6ctx=130064, cu=8
Q8_0393.8ctx=32768, cu=16
Q3_K_M386.1ctx=32768, cu=32
Q4_K_M360.2ctx=32768, cu=8
Q5_K_M358.4ctx=16384, cu=4
Q4_K_S354.3ctx=32768, cu=16
Q5_K_S347.9ctx=8192, cu=32
BF16313.2ctx=130064, cu=16

UD Quants

QuantPeak tok/sBest Config
IQ2_XXS428.7ctx=8192, cu=8
Q2_K_XL427.3ctx=16384, cu=8
IQ2_M407.6ctx=130064, cu=8
IQ3_XXS400.4ctx=32768, cu=8
Q3_K_XL384.3ctx=16384, cu=16
Q6_K_XL373.0ctx=32768, cu=32
Q5_K_XL363.1ctx=8192, cu=16
Q4_K_XL356.3ctx=8192, cu=8
Q8_K_XL349.1ctx=32768, cu=16

The K-Quant Overhead

An unexpected pattern at 4B: the K-quant variants (Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S) are consistently slower than their simpler counterparts (Q4_0, Q4_1). Q4_0 at 446.6 tok/s beats Q4_K_M at 360.2 tok/s by 24%.

This happens because K-quants use per-block scaling factors that add computational overhead. On the RTX 5090, which has massive compute throughput, this overhead matters more than the slightly better memory efficiency of K-quants. The simple Q4_0/Q4_1 formats let the GPU dequantize faster.

Recommendation for 4B: Q4_0 for maximum throughput, IQ4_NL if you want imatrix quality optimization, or Q6_K/Q8_0 if you're willing to trade 10–15% throughput for less quantization noise. The K-quant overhead is real at this model size.


The Challenge: 9B

At 9B, the model is large enough that quantization choices have dramatic throughput implications, and concurrency becomes a major factor.

Peak Throughput (32 Concurrent Users)

QuantPeak tok/sAvg TTFTAvg ITL
IQ2_XXS-UD557.7424 ms54.1 ms
Q3_K_XL-UD555.9260 ms54.7 ms
Q4_K_XL-UD555.2196 ms27.4 ms†
IQ3_XXS-UD553.6303 ms54.7 ms
IQ4_XS553.5267 ms55.4 ms
Q4_0552.5251 ms55.8 ms
Q4_1551.2286 ms55.4 ms
IQ2_M-UD548.3665 ms54.6 ms
Q3_K_M546.8263 ms55.9 ms
Q5_K_XL-UD552.1332 ms54.6 ms
Q4_K_M536.3265 ms56.9 ms
Q5_K_M532.5356 ms57.7 ms
Q8_0529.6256 ms58.0 ms
BF16482.0377 ms62.8 ms

†At 16 concurrent users, not 32.

The Quality-Speed Frontier

At 9B, the throughput range from best (557.7 tok/s) to BF16 (482.0 tok/s) is a 16% spread. That's narrower than you might expect — the RTX 5090's VRAM bandwidth is generous enough that even BF16 at 9B doesn't create a severe bottleneck.

But look at the ITL column: all the quants cluster between 54–58 ms at 32 concurrent users, while BF16 is at 62.8 ms. Per-user latency is where the full-precision penalty shows up.

Single-User Performance vs Multi-User

The 9B data reveals a fascinating concurrency effect. At single user:

  • BF16: ~130 tok/s
  • Q4_0: ~150 tok/s
  • Best quant: ~170 tok/s

At 32 users, throughput jumps to 480–558 tok/s because the GPU can process multiple requests in parallel, batching attention computation. But this comes at the cost of ITL (2.4 ms single-user → 55 ms at 32 users) and TTFT.

For interactive single-user chat, quant choice barely matters at 9B. The throughput difference between BF16 and the best quant is < 2× at single user, and all options produce acceptable ITL.

For serving multiple users, quant choice matters a lot. The delta between BF16 and IQ2_XXS-UD is 76 tok/s of aggregate throughput at 32 users.


The Wall: 27B

27B is where the RTX 5090's 31.8 GB VRAM becomes the defining constraint.

What Fits — And What Doesn't

QuantFits in VRAM?Peak tok/sNotes
IQ2_XXS-UDYes149.0Aggressive, quality concerns
IQ2_M-UDYes139.3
IQ3_XXS-UDYes216.6
Q2_K_XL-UDYes211.4
Q3_K_SYes211.8
Q3_K_MYes210.8
Q3_K_XL-UDYes212.4
IQ4_XSYes218.0Sweet spot
IQ4_NLYes217.4
Q4_0Yes214.5
Q4_1Yes212.5
Q4_K_MYes208.6
Q4_K_XL-UDYes217.2
Q4_K_SYes205.8
Q5_K_MYes (barely)197.7
Q5_K_SYes (barely)196.1
Q5_K_XL-UDYes (barely)212.7
Q6_KBorderline186.1Edge of VRAM
Q6_K_XL-UDSpills55.375% throughput loss
Q8_0Spills45.979% throughput loss
Q8_K_XL-UDSpills22.990% throughput loss

The VRAM Cliff

The transition from "fits in VRAM" to "spills to system RAM" is catastrophic:

  • Q6_K (186.1 tok/s) → Q6_K_XL (55.3 tok/s): 70% throughput loss
  • Q8_0 at 45.9 tok/s and Q8_K_XL at 22.9 tok/s are worse than running 27B on the GB10

This is the most important finding for 27B on the RTX 5090: never run a quantization level that spills VRAM. The performance penalty is so severe that you're better off using a smaller model at higher quality or switching to a platform with more memory.

The Sweet Spot

IQ4_XS at 218.0 tok/s is the optimal 27B quant on the RTX 5090. It's the highest throughput that comfortably fits in VRAM, uses imatrix quantization for better quality than naive Q4, and handles up to 32k context without issues.

For applications where Q3-level quality is acceptable, Q3_K_XL-UD (212.4 tok/s) and IQ3_XXS-UD (216.6 tok/s) offer similar throughput with slightly smaller memory footprints, freeing context window headroom.

Recommendation for 27B on RTX 5090: IQ4_XS or IQ4_NL for the best speed-quality trade-off. Never use Q6_K+ unless you've verified it fits in your specific VRAM configuration (other processes may reduce available VRAM). If you need Q8 quality, switch to the GB10 or a multi-GPU setup.


Summary: The Quantization Cheat Sheet

Model SizeBest Throughput QuantBest Quality/Speed BalanceAvoid
0.8BQ4_1 (808 tok/s)Q8_0 (777 tok/s)BF16 (slowest)
2BIQ4_NL (759 tok/s)Q8_0 (678 tok/s)BF16 (32% penalty)
4BQ4_0 (447 tok/s)Q6_K (404 tok/s)K-quants (overhead)
9BIQ2_XXS-UD (558 tok/s)Q4_0 (553 tok/s)BF16 for multi-user
27BIQ4_XS (218 tok/s)IQ4_NL (217 tok/s)Q6_K+ (VRAM spill)

The recurring theme: on the RTX 5090, BF16 is never the fastest option, and for models ≥4B, the throughput penalty for full precision is significant. Quantization isn't just about saving VRAM — it's a direct performance optimization.


All benchmarks from Poor Paul's Benchmark. Run your own with the CLI and your results will appear in the dataset automatically.