← Back to articles

The Quantization Ladder: Every Quant Level Benchmarked on RTX 5090 for Qwen3.5-27B

quantizationqwen3.5-27brtx-5090quant-comparisonq4-vs-q8

Choosing a quantization level involves a quality-vs-speed trade-off that most guides treat theoretically. Our dataset has actual benchmark data for multiple quant levels of Qwen3.5-27B on the RTX 5090 — real numbers from real inference, not theoretical estimates.

Here's the full picture.

Throughput vs. Quantization Level

Data from our RTX 5090 benchmark sessions, Qwen3.5-27B, llama-server runner at typical workloads:

QuantBits/WeightApprox VRAM1 User tok/s8 User tok/s16 User tok/s
BF1616.0~54 GB
Q8_08.0~27.5 GB~55~85
Q6_K6.6~22 GB~68~115~145
Q5_K_M5.7~19 GB~78~135~170
Q4_K_M4.8~16 GB~88~155~200
Q3_K_M3.9~13 GB~95~175~220
IQ4_NL4.5~15 GB~90~160~205
IQ4_XS4.3~14 GB~92~165~210
Q2_K2.6~9 GB~110~210~265

(BF16 at 54 GB exceeds RTX 5090's 32 GB VRAM — not tested on this hardware)

The Non-Obvious Finding: Q4 Often Outperforms Q5/Q6 in Server Workloads

At 16 concurrent users, Q4_K_M (~200 tok/s) is notably faster than Q5_K_M (~170 tok/s) and Q6_K (~145 tok/s). The reason: smaller model size means more KV-cache room per user, which directly improves throughput under concurrent load.

It's counterintuitive but real: a lower quantization lets you serve more users simultaneously, which increases aggregate throughput even though per-token generation is similar.

For single-user interactive use, the throughput difference between Q4 and Q6 is modest (88 vs 68 tok/s — both feel fast). But for multi-user serving, Q4_K_M is clearly the better choice on a 32 GB GPU.

The Quality Trade-Off

We don't have perplexity measurements yet, but community benchmarks for Qwen3.5-27B suggest:

QuantRelative Quality Loss
Q8_0~0.1% vs BF16
Q6_K~0.5% vs BF16
Q5_K_M~1.2% vs BF16
Q4_K_M~2.5% vs BF16
Q3_K_M~5.5% vs BF16
Q2_K~15% vs BF16

The Q4_K_M sweet spot (2.5% quality loss, 60% faster than Q8_0 at multi-user loads) is the community consensus for a reason — it holds up well under these real-world throughput numbers.

The IQ4 Variants: Worth It?

IQ4_NL and IQ4_XS are importance-matrix quantizations — they allocate bits unevenly, preserving more precision in the model's most critical weights. The result: similar or better quality than Q4_K_M at slightly smaller file size.

From our data, they perform essentially identically to Q4_K_M in throughput. The quality benefit (from community reports: IQ4 variants consistently outperform Q4_K_M on reasoning benchmarks) is free — take it.

Recommendation: If your tool supports IQ4_NL or IQ4_XS, prefer them over Q4_K_M for Qwen3.5-27B.

Practical Recommendations

ScenarioRecommendation
Maximum quality, single user, 32 GB GPUQ8_0
Best quality-speed balance, multi-userQ4_K_M or IQ4_NL
Maximum speed, 8+ usersQ3_K_M (if quality is acceptable)
Limited VRAM (16 GB)Q2_K or Q3_K_M (27B barely fits)

The Bottom Line

The quantization ladder for Qwen3.5-27B on RTX 5090 reveals that Q4_K_M is not a compromise — it's the optimal choice for multi-user workloads. Q8_0 only makes sense when you're single-user and quality is paramount. Everything below Q3_K_M involves quality degradation that's likely noticeable.

Don't let the bits-per-weight hierarchy mislead you: in practice, the right quant depends on your VRAM budget and concurrent user count, not just quality preference.