The Quantization Ladder: Every Quant Level Benchmarked on RTX 5090 for Qwen3.5-27B
Choosing a quantization level involves a quality-vs-speed trade-off that most guides treat theoretically. Our dataset has actual benchmark data for multiple quant levels of Qwen3.5-27B on the RTX 5090 — real numbers from real inference, not theoretical estimates.
Here's the full picture.
Throughput vs. Quantization Level
Data from our RTX 5090 benchmark sessions, Qwen3.5-27B, llama-server runner at typical workloads:
| Quant | Bits/Weight | Approx VRAM | 1 User tok/s | 8 User tok/s | 16 User tok/s |
|---|---|---|---|---|---|
| BF16 | 16.0 | ~54 GB | — | — | — |
| Q8_0 | 8.0 | ~27.5 GB | ~55 | ~85 | — |
| Q6_K | 6.6 | ~22 GB | ~68 | ~115 | ~145 |
| Q5_K_M | 5.7 | ~19 GB | ~78 | ~135 | ~170 |
| Q4_K_M | 4.8 | ~16 GB | ~88 | ~155 | ~200 |
| Q3_K_M | 3.9 | ~13 GB | ~95 | ~175 | ~220 |
| IQ4_NL | 4.5 | ~15 GB | ~90 | ~160 | ~205 |
| IQ4_XS | 4.3 | ~14 GB | ~92 | ~165 | ~210 |
| Q2_K | 2.6 | ~9 GB | ~110 | ~210 | ~265 |
(BF16 at 54 GB exceeds RTX 5090's 32 GB VRAM — not tested on this hardware)
The Non-Obvious Finding: Q4 Often Outperforms Q5/Q6 in Server Workloads
At 16 concurrent users, Q4_K_M (~200 tok/s) is notably faster than Q5_K_M (~170 tok/s) and Q6_K (~145 tok/s). The reason: smaller model size means more KV-cache room per user, which directly improves throughput under concurrent load.
It's counterintuitive but real: a lower quantization lets you serve more users simultaneously, which increases aggregate throughput even though per-token generation is similar.
For single-user interactive use, the throughput difference between Q4 and Q6 is modest (88 vs 68 tok/s — both feel fast). But for multi-user serving, Q4_K_M is clearly the better choice on a 32 GB GPU.
The Quality Trade-Off
We don't have perplexity measurements yet, but community benchmarks for Qwen3.5-27B suggest:
| Quant | Relative Quality Loss |
|---|---|
| Q8_0 | ~0.1% vs BF16 |
| Q6_K | ~0.5% vs BF16 |
| Q5_K_M | ~1.2% vs BF16 |
| Q4_K_M | ~2.5% vs BF16 |
| Q3_K_M | ~5.5% vs BF16 |
| Q2_K | ~15% vs BF16 |
The Q4_K_M sweet spot (2.5% quality loss, 60% faster than Q8_0 at multi-user loads) is the community consensus for a reason — it holds up well under these real-world throughput numbers.
The IQ4 Variants: Worth It?
IQ4_NL and IQ4_XS are importance-matrix quantizations — they allocate bits unevenly, preserving more precision in the model's most critical weights. The result: similar or better quality than Q4_K_M at slightly smaller file size.
From our data, they perform essentially identically to Q4_K_M in throughput. The quality benefit (from community reports: IQ4 variants consistently outperform Q4_K_M on reasoning benchmarks) is free — take it.
Recommendation: If your tool supports IQ4_NL or IQ4_XS, prefer them over Q4_K_M for Qwen3.5-27B.
Practical Recommendations
| Scenario | Recommendation |
|---|---|
| Maximum quality, single user, 32 GB GPU | Q8_0 |
| Best quality-speed balance, multi-user | Q4_K_M or IQ4_NL |
| Maximum speed, 8+ users | Q3_K_M (if quality is acceptable) |
| Limited VRAM (16 GB) | Q2_K or Q3_K_M (27B barely fits) |
The Bottom Line
The quantization ladder for Qwen3.5-27B on RTX 5090 reveals that Q4_K_M is not a compromise — it's the optimal choice for multi-user workloads. Q8_0 only makes sense when you're single-user and quality is paramount. Everything below Q3_K_M involves quality degradation that's likely noticeable.
Don't let the bits-per-weight hierarchy mislead you: in practice, the right quant depends on your VRAM budget and concurrent user count, not just quality preference.