Unsloth Dynamic vs Standard GGUF: When Mixed-Precision Quantization Pays Off
Unsloth Dynamic (UD) quantization is a mixed-precision technique that uses higher precision for sensitive layers and lower precision for less critical ones. The promise: better output quality than uniform quantization at the same average bit width.
But what about performance? Every UD GGUF in the Qwen3.5 family carries the -UD suffix and uses _XL format variants (Q8_K_XL, Q6_K_XL, etc.) along with ultra-low-bit imatrix quants (IQ2_XXS, IQ2_M, IQ3_XXS). We ran every combination on the RTX 5090 to answer: does mixed-precision quantization have a throughput cost?
0.8B: Standard vs UD — No Difference
| Standard Quant | tok/s | UD Equivalent | tok/s | Δ |
|---|---|---|---|---|
| Q8_0 | 777.3 | Q8_K_XL | 741.1 | -4.7% |
| Q6_K | 775.4 | Q6_K_XL | 739.8 | -4.6% |
| Q5_K_M | 732.2 | Q5_K_XL | 737.3 | +0.7% |
| Q4_K_M | 712.4 | Q4_K_XL | 694.2 | -2.6% |
| Q3_K_M | 752.6 | Q3_K_XL | 713.4 | -5.2% |
| — | — | Q2_K_XL | 775.1 | — |
| — | — | IQ2_XXS | 773.7 | — |
| — | — | IQ2_M | 746.2 | — |
| — | — | IQ3_XXS | 713.5 | — |
At 0.8B, UD quants run 0–5% slower than their standard counterparts at equivalent quality levels. The XL format's mixed-precision overhead is minimal but measurable. The ultra-low quants (IQ2_XXS at 773.7 tok/s, Q2_K_XL at 775.1 tok/s) perform excellently — matching Q6_K speed while using a fraction of the memory.
Verdict: At 0.8B, no meaningful performance difference. Pick UD if you want quality headroom at low bit widths; pick standard for the very last drops of throughput.
2B: UD Starts to Shine
| Standard Quant | tok/s | UD Equivalent | tok/s | Δ |
|---|---|---|---|---|
| Q8_0 | 678.0 | Q8_K_XL | 597.3 | -11.9% |
| Q6_K | 690.9 | Q6_K_XL | 643.3 | -6.9% |
| Q4_0 | 744.2 | Q2_K_XL | 714.0 | -4.1% |
| Q3_K_M | 657.6 | Q3_K_XL | 629.7 | -4.2% |
| — | — | IQ2_XXS | 679.8 | — |
| — | — | IQ2_M | 664.4 | — |
At Q8 quality levels, the UD format shows an 11.9% throughput penalty (678 vs 597 tok/s). The mixed-precision overhead is more pronounced at higher bit widths because the XL format's variable block sizes add dequantization complexity.
But the ultra-low UD quants tell a different story: IQ2_XXS at 679.8 tok/s nearly matches standard Q8_0's 678.0 tok/s while using roughly 4× less memory. This is where UD earns its keep — you get Q8-comparable speed at 2-bit precision.
4B: The Overhead Grows
| Standard Quant | tok/s | UD Equivalent | tok/s | Δ |
|---|---|---|---|---|
| Q8_0 | 393.8 | Q8_K_XL | 349.1 | -11.3% |
| Q6_K | 403.6 | Q6_K_XL | 373.0 | -7.6% |
| Q4_K_M | 360.2 | Q4_K_XL | 356.3 | -1.1% |
| Q3_K_M | 386.1 | Q3_K_XL | 384.3 | -0.5% |
| — | — | IQ2_XXS | 428.7 | — |
| — | — | IQ2_M | 407.6 | — |
| — | — | Q2_K_XL | 427.3 | — |
The pattern solidifies at 4B:
- High-bit UD quants cost 7–11% throughput vs standard equivalents
- Low-bit UD quants are actually faster than standard high-bit quants (IQ2_XXS at 428.7 tok/s > Q8_0 at 393.8 tok/s)
Why? Because at 4B, memory bandwidth matters. The ultra-low quants reduce weight size so dramatically that even with mixed-precision decoding overhead, the net memory traffic is lower — and the RTX 5090's compute cores chew through the dequantization without blinking.
9B: UD Dominates the Leaderboard
| Rank | Quant | tok/s | Type |
|---|---|---|---|
| 1 | IQ2_XXS-UD | 557.7 | UD |
| 2 | Q3_K_XL-UD | 555.9 | UD |
| 3 | Q4_K_XL-UD | 555.2 | UD |
| 4 | IQ3_XXS-UD | 553.6 | UD |
| 5 | IQ4_XS | 553.5 | Standard |
| 6 | Q4_0 | 552.5 | Standard |
| 7 | Q5_K_XL-UD | 552.1 | UD |
| 8 | Q4_1 | 551.2 | Standard |
| 9 | IQ2_M-UD | 548.3 | UD |
| 10 | Q2_K_XL-UD | 543.7 | UD |
At 9B, 7 of the top 10 performers are UD quants. The mixed-precision approach pays off handsomely at this model size because:
- The model is large enough that memory bandwidth is the primary bottleneck
- Ultra-low quants (IQ2_XXS, IQ2_M) shrink the model dramatically, freeing bandwidth
- The UD format's selective high-precision layers preserve quality where it matters
The throughput range is narrow (543–558 tok/s across the top 10), but the quality difference between IQ2_XXS and Q4_0 can be significant depending on your use case. UD quants give you the speed of 2-bit with (theoretically) better quality than uniform 2-bit.
27B: UD is Survival, Not Optimization
At 27B, UD quants serve a different purpose — they determine whether the model fits in VRAM at all.
| Quant | Fits in VRAM? | tok/s | Type |
|---|---|---|---|
| IQ4_XS | Yes | 218.0 | Standard |
| Q4_K_XL | Yes | 217.2 | UD |
| IQ4_NL | Yes | 217.4 | Standard |
| IQ3_XXS | Yes | 216.6 | UD |
| Q3_K_XL | Yes | 212.4 | UD |
| Q2_K_XL | Yes | 211.4 | UD |
| Q5_K_XL | Yes (barely) | 212.7 | UD |
| Q6_K_XL | Spills | 55.3 | UD |
| Q8_K_XL | Spills | 22.9 | UD |
The UD variants at Q3–Q5 levels offer comparable throughput to standard quants (211–217 tok/s) while the UD format's mixed-precision potentially preserves more quality at the same bit width.
But Q6_K_XL and Q8_K_XL-UD both spill to system RAM, demonstrating that the XL format's per-layer scaling metadata adds memory overhead that can tip a borderline model over the VRAM edge. If your 27B model barely fits at Q6 standard, Q6_K_XL may not fit.
When Should You Use UD Quants?
Use UD when:
- You're running ≥4B models and want the fastest possible throughput (ultra-low UD quants win)
- You need low-bit quantization but care about output quality (mixed-precision helps)
- You're at 9B where UD quants dominate the throughput leaderboard
Stick with standard when:
- You're running 0.8B–2B models where the overhead isn't worth it
- You want Q8+ quality (the XL format overhead is noticeable at high bit widths)
- You're near the VRAM boundary at 27B (the XL metadata adds to model size)
The general rule: UD quants are a speed optimization at low bit widths and a quality optimization at all bit widths, but come with a small throughput penalty at high bit widths (Q6+, Q8+). The sweet spot is UD at Q3–Q4 levels on models ≥4B.
Benchmarks from Poor Paul's Benchmark using llama-server on an NVIDIA GeForce RTX 5090 (31.8 GB VRAM). Explore all results on the Leaderboard.