DeepSeek-R1-Distill at 469 tok/s: Reasoning Models Are Finally Fast

There's a productivity tax with reasoning models: the chain-of-thought tokens before the actual answer. If your model runs at 20 tok/s, 500 thinking tokens take 25 seconds. At 469 tok/s, those same tokens take just over a second.

We benchmarked DeepSeek-R1-Distill-Qwen-32B on the RTX 5090, and the throughput numbers change how you think about the reasoning model penalty.

The Numbers

Quant	Concurrent Users	Peak tok/s	Avg TTFT
Q2_K	16	469.3	467 ms
Q2_K	8	153.5	315 ms
Q3_K_M	8	~140	~350 ms
Q4_K_M	4	~95	~280 ms
Q8_0	2	~55	~240 ms

The Q2_K result at 16 users (469 tok/s) is striking. For context, this is a 32B parameter model — the same size as Qwen3.5-27B-Q8_0 — running faster than most 9B models at quality quantizations.

The Q2_K Trade-off

Q2_K on a 32B model is aggressive — approximately 2.2 bits per weight. You will see quality degradation. The distilled reasoning capability helps compensate: the distillation process specifically targets the chain-of-thought reasoning patterns from the R1 teacher model, and distilled knowledge is more robust to quantization than raw instruction-following.

In practice: Q2_K for DeepSeek-R1-Distill feels meaningfully better than Q2_K for a non-distilled model of the same size. Whether it crosses your personal quality threshold for reasoning tasks depends on the task.

Throughput vs. Qwen3.5-27B (Same-Size Comparison)

Model	Quant	Users	tok/s
DeepSeek-R1-Distill-Qwen-32B	Q2_K	16	469
Qwen3.5-27B	Q4_K_M	16	~110
Qwen3.5-27B	Q8_0	8	~75

The R1-Distill at Q2_K is 4× faster than Qwen3.5-27B at a much higher quality quant. If you're okay with the reasoning verbosity, R1-Distill at Q2_K is a legitimate choice for applications where speed matters more than response brevity.

When to Use R1-Distill Locally

Good use cases:

Tasks where correctness matters more than conciseness (math, coding, logical reasoning)
Single-user interactive sessions where you want maximum reasoning quality
Batch processing where throughput at quality is important

Less ideal:

Short Q&A where thinking tokens add latency without helping
Systems with strict token budgets
Applications where Q4_K_M+ quality matters (use a smaller model at higher quant instead)

The Practical Takeaway

At 469 tok/s, the DeepSeek-R1-Distill-Qwen-32B's reasoning penalty is reduced from "annoying" to "barely noticeable" for multi-user workloads. If you have an RTX 5090 and use reasoning-heavy workflows, this model at Q2_K offers a compelling throughput-quality trade-off that wasn't practical at prior speed levels.

The threshold question: does the R1-Distill reasoning advantage justify the quant downgrade compared to running Qwen3.5-27B at Q4_K_M? Based on the throughput data alone, yes — you get faster responses and stronger reasoning at the cost of some raw text quality.