← Back to articles

Unsloth Dynamic vs Standard GGUF: When Mixed-Precision Quantization Pays Off

analysisqwenrtx-5090quantizationunsloth-dynamic

Unsloth Dynamic (UD) quantization is a mixed-precision technique that uses higher precision for sensitive layers and lower precision for less critical ones. The promise: better output quality than uniform quantization at the same average bit width.

But what about performance? Every UD GGUF in the Qwen3.5 family carries the -UD suffix and uses _XL format variants (Q8_K_XL, Q6_K_XL, etc.) along with ultra-low-bit imatrix quants (IQ2_XXS, IQ2_M, IQ3_XXS). We ran every combination on the RTX 5090 to answer: does mixed-precision quantization have a throughput cost?

0.8B: Standard vs UD — No Difference

Standard Quanttok/sUD Equivalenttok/sΔ
Q8_0777.3Q8_K_XL741.1-4.7%
Q6_K775.4Q6_K_XL739.8-4.6%
Q5_K_M732.2Q5_K_XL737.3+0.7%
Q4_K_M712.4Q4_K_XL694.2-2.6%
Q3_K_M752.6Q3_K_XL713.4-5.2%
Q2_K_XL775.1
IQ2_XXS773.7
IQ2_M746.2
IQ3_XXS713.5

At 0.8B, UD quants run 0–5% slower than their standard counterparts at equivalent quality levels. The XL format's mixed-precision overhead is minimal but measurable. The ultra-low quants (IQ2_XXS at 773.7 tok/s, Q2_K_XL at 775.1 tok/s) perform excellently — matching Q6_K speed while using a fraction of the memory.

Verdict: At 0.8B, no meaningful performance difference. Pick UD if you want quality headroom at low bit widths; pick standard for the very last drops of throughput.

2B: UD Starts to Shine

Standard Quanttok/sUD Equivalenttok/sΔ
Q8_0678.0Q8_K_XL597.3-11.9%
Q6_K690.9Q6_K_XL643.3-6.9%
Q4_0744.2Q2_K_XL714.0-4.1%
Q3_K_M657.6Q3_K_XL629.7-4.2%
IQ2_XXS679.8
IQ2_M664.4

At Q8 quality levels, the UD format shows an 11.9% throughput penalty (678 vs 597 tok/s). The mixed-precision overhead is more pronounced at higher bit widths because the XL format's variable block sizes add dequantization complexity.

But the ultra-low UD quants tell a different story: IQ2_XXS at 679.8 tok/s nearly matches standard Q8_0's 678.0 tok/s while using roughly 4× less memory. This is where UD earns its keep — you get Q8-comparable speed at 2-bit precision.

4B: The Overhead Grows

Standard Quanttok/sUD Equivalenttok/sΔ
Q8_0393.8Q8_K_XL349.1-11.3%
Q6_K403.6Q6_K_XL373.0-7.6%
Q4_K_M360.2Q4_K_XL356.3-1.1%
Q3_K_M386.1Q3_K_XL384.3-0.5%
IQ2_XXS428.7
IQ2_M407.6
Q2_K_XL427.3

The pattern solidifies at 4B:

  • High-bit UD quants cost 7–11% throughput vs standard equivalents
  • Low-bit UD quants are actually faster than standard high-bit quants (IQ2_XXS at 428.7 tok/s > Q8_0 at 393.8 tok/s)

Why? Because at 4B, memory bandwidth matters. The ultra-low quants reduce weight size so dramatically that even with mixed-precision decoding overhead, the net memory traffic is lower — and the RTX 5090's compute cores chew through the dequantization without blinking.

9B: UD Dominates the Leaderboard

RankQuanttok/sType
1IQ2_XXS-UD557.7UD
2Q3_K_XL-UD555.9UD
3Q4_K_XL-UD555.2UD
4IQ3_XXS-UD553.6UD
5IQ4_XS553.5Standard
6Q4_0552.5Standard
7Q5_K_XL-UD552.1UD
8Q4_1551.2Standard
9IQ2_M-UD548.3UD
10Q2_K_XL-UD543.7UD

At 9B, 7 of the top 10 performers are UD quants. The mixed-precision approach pays off handsomely at this model size because:

  1. The model is large enough that memory bandwidth is the primary bottleneck
  2. Ultra-low quants (IQ2_XXS, IQ2_M) shrink the model dramatically, freeing bandwidth
  3. The UD format's selective high-precision layers preserve quality where it matters

The throughput range is narrow (543–558 tok/s across the top 10), but the quality difference between IQ2_XXS and Q4_0 can be significant depending on your use case. UD quants give you the speed of 2-bit with (theoretically) better quality than uniform 2-bit.

27B: UD is Survival, Not Optimization

At 27B, UD quants serve a different purpose — they determine whether the model fits in VRAM at all.

QuantFits in VRAM?tok/sType
IQ4_XSYes218.0Standard
Q4_K_XLYes217.2UD
IQ4_NLYes217.4Standard
IQ3_XXSYes216.6UD
Q3_K_XLYes212.4UD
Q2_K_XLYes211.4UD
Q5_K_XLYes (barely)212.7UD
Q6_K_XLSpills55.3UD
Q8_K_XLSpills22.9UD

The UD variants at Q3–Q5 levels offer comparable throughput to standard quants (211–217 tok/s) while the UD format's mixed-precision potentially preserves more quality at the same bit width.

But Q6_K_XL and Q8_K_XL-UD both spill to system RAM, demonstrating that the XL format's per-layer scaling metadata adds memory overhead that can tip a borderline model over the VRAM edge. If your 27B model barely fits at Q6 standard, Q6_K_XL may not fit.

When Should You Use UD Quants?

Use UD when:

  • You're running ≥4B models and want the fastest possible throughput (ultra-low UD quants win)
  • You need low-bit quantization but care about output quality (mixed-precision helps)
  • You're at 9B where UD quants dominate the throughput leaderboard

Stick with standard when:

  • You're running 0.8B–2B models where the overhead isn't worth it
  • You want Q8+ quality (the XL format overhead is noticeable at high bit widths)
  • You're near the VRAM boundary at 27B (the XL metadata adds to model size)

The general rule: UD quants are a speed optimization at low bit widths and a quality optimization at all bit widths, but come with a small throughput penalty at high bit widths (Q6+, Q8+). The sweet spot is UD at Q3–Q4 levels on models ≥4B.


Benchmarks from Poor Paul's Benchmark using llama-server on an NVIDIA GeForce RTX 5090 (31.8 GB VRAM). Explore all results on the Leaderboard.