Unsloth Dynamic vs Standard GGUF: When Mixed-Precision Quantization Pays Off

Unsloth Dynamic (UD) quantization is a mixed-precision technique that uses higher precision for sensitive layers and lower precision for less critical ones. The promise: better output quality than uniform quantization at the same average bit width.

But what about performance? Every UD GGUF in the Qwen3.5 family carries the -UD suffix and uses _XL format variants (Q8_K_XL, Q6_K_XL, etc.) along with ultra-low-bit imatrix quants (IQ2_XXS, IQ2_M, IQ3_XXS). We ran every combination on the RTX 5090 to answer: does mixed-precision quantization have a throughput cost?

0.8B: Standard vs UD — No Difference

Standard Quant	tok/s	UD Equivalent	tok/s	Δ
Q8_0	777.3	Q8_K_XL	741.1	-4.7%
Q6_K	775.4	Q6_K_XL	739.8	-4.6%
Q5_K_M	732.2	Q5_K_XL	737.3	+0.7%
Q4_K_M	712.4	Q4_K_XL	694.2	-2.6%
Q3_K_M	752.6	Q3_K_XL	713.4	-5.2%
—	—	Q2_K_XL	775.1	—
—	—	IQ2_XXS	773.7	—
—	—	IQ2_M	746.2	—
—	—	IQ3_XXS	713.5	—

At 0.8B, UD quants run 0–5% slower than their standard counterparts at equivalent quality levels. The XL format's mixed-precision overhead is minimal but measurable. The ultra-low quants (IQ2_XXS at 773.7 tok/s, Q2_K_XL at 775.1 tok/s) perform excellently — matching Q6_K speed while using a fraction of the memory.

Verdict: At 0.8B, no meaningful performance difference. Pick UD if you want quality headroom at low bit widths; pick standard for the very last drops of throughput.

2B: UD Starts to Shine

Standard Quant	tok/s	UD Equivalent	tok/s	Δ
Q8_0	678.0	Q8_K_XL	597.3	-11.9%
Q6_K	690.9	Q6_K_XL	643.3	-6.9%
Q4_0	744.2	Q2_K_XL	714.0	-4.1%
Q3_K_M	657.6	Q3_K_XL	629.7	-4.2%
—	—	IQ2_XXS	679.8	—
—	—	IQ2_M	664.4	—

At Q8 quality levels, the UD format shows an 11.9% throughput penalty (678 vs 597 tok/s). The mixed-precision overhead is more pronounced at higher bit widths because the XL format's variable block sizes add dequantization complexity.

But the ultra-low UD quants tell a different story: IQ2_XXS at 679.8 tok/s nearly matches standard Q8_0's 678.0 tok/s while using roughly 4× less memory. This is where UD earns its keep — you get Q8-comparable speed at 2-bit precision.

4B: The Overhead Grows

Standard Quant	tok/s	UD Equivalent	tok/s	Δ
Q8_0	393.8	Q8_K_XL	349.1	-11.3%
Q6_K	403.6	Q6_K_XL	373.0	-7.6%
Q4_K_M	360.2	Q4_K_XL	356.3	-1.1%
Q3_K_M	386.1	Q3_K_XL	384.3	-0.5%
—	—	IQ2_XXS	428.7	—
—	—	IQ2_M	407.6	—
—	—	Q2_K_XL	427.3	—

The pattern solidifies at 4B:

High-bit UD quants cost 7–11% throughput vs standard equivalents
Low-bit UD quants are actually faster than standard high-bit quants (IQ2_XXS at 428.7 tok/s > Q8_0 at 393.8 tok/s)

Why? Because at 4B, memory bandwidth matters. The ultra-low quants reduce weight size so dramatically that even with mixed-precision decoding overhead, the net memory traffic is lower — and the RTX 5090's compute cores chew through the dequantization without blinking.

9B: UD Dominates the Leaderboard

Rank	Quant	tok/s	Type
1	IQ2_XXS-UD	557.7	UD
2	Q3_K_XL-UD	555.9	UD
3	Q4_K_XL-UD	555.2	UD
4	IQ3_XXS-UD	553.6	UD
5	IQ4_XS	553.5	Standard
6	Q4_0	552.5	Standard
7	Q5_K_XL-UD	552.1	UD
8	Q4_1	551.2	Standard
9	IQ2_M-UD	548.3	UD
10	Q2_K_XL-UD	543.7	UD

At 9B, 7 of the top 10 performers are UD quants. The mixed-precision approach pays off handsomely at this model size because:

The model is large enough that memory bandwidth is the primary bottleneck
Ultra-low quants (IQ2_XXS, IQ2_M) shrink the model dramatically, freeing bandwidth
The UD format's selective high-precision layers preserve quality where it matters

The throughput range is narrow (543–558 tok/s across the top 10), but the quality difference between IQ2_XXS and Q4_0 can be significant depending on your use case. UD quants give you the speed of 2-bit with (theoretically) better quality than uniform 2-bit.

27B: UD is Survival, Not Optimization

At 27B, UD quants serve a different purpose — they determine whether the model fits in VRAM at all.

Quant	Fits in VRAM?	tok/s	Type
IQ4_XS	Yes	218.0	Standard
Q4_K_XL	Yes	217.2	UD
IQ4_NL	Yes	217.4	Standard
IQ3_XXS	Yes	216.6	UD
Q3_K_XL	Yes	212.4	UD
Q2_K_XL	Yes	211.4	UD
Q5_K_XL	Yes (barely)	212.7	UD
Q6_K_XL	Spills	55.3	UD
Q8_K_XL	Spills	22.9	UD

The UD variants at Q3–Q5 levels offer comparable throughput to standard quants (211–217 tok/s) while the UD format's mixed-precision potentially preserves more quality at the same bit width.

But Q6_K_XL and Q8_K_XL-UD both spill to system RAM, demonstrating that the XL format's per-layer scaling metadata adds memory overhead that can tip a borderline model over the VRAM edge. If your 27B model barely fits at Q6 standard, Q6_K_XL may not fit.

When Should You Use UD Quants?

Use UD when:

You're running ≥4B models and want the fastest possible throughput (ultra-low UD quants win)
You need low-bit quantization but care about output quality (mixed-precision helps)
You're at 9B where UD quants dominate the throughput leaderboard

Stick with standard when:

You're running 0.8B–2B models where the overhead isn't worth it
You want Q8+ quality (the XL format overhead is noticeable at high bit widths)
You're near the VRAM boundary at 27B (the XL metadata adds to model size)

The general rule: UD quants are a speed optimization at low bit widths and a quality optimization at all bit widths, but come with a small throughput penalty at high bit widths (Q6+, Q8+). The sweet spot is UD at Q3–Q4 levels on models ≥4B.

Benchmarks from Poor Paul's Benchmark using llama-server on an NVIDIA GeForce RTX 5090 (31.8 GB VRAM). Explore all results on the Leaderboard.