← Back to articles

DeepSeek-R1-Distill at 469 tok/s: Reasoning Models Are Finally Fast

deepseekreasoning-modelschain-of-thoughtrtx-5090quantization

There's a productivity tax with reasoning models: the chain-of-thought tokens before the actual answer. If your model runs at 20 tok/s, 500 thinking tokens take 25 seconds. At 469 tok/s, those same tokens take just over a second.

We benchmarked DeepSeek-R1-Distill-Qwen-32B on the RTX 5090, and the throughput numbers change how you think about the reasoning model penalty.

The Numbers

QuantConcurrent UsersPeak tok/sAvg TTFT
Q2_K16469.3467 ms
Q2_K8153.5315 ms
Q3_K_M8~140~350 ms
Q4_K_M4~95~280 ms
Q8_02~55~240 ms

The Q2_K result at 16 users (469 tok/s) is striking. For context, this is a 32B parameter model — the same size as Qwen3.5-27B-Q8_0 — running faster than most 9B models at quality quantizations.

The Q2_K Trade-off

Q2_K on a 32B model is aggressive — approximately 2.2 bits per weight. You will see quality degradation. The distilled reasoning capability helps compensate: the distillation process specifically targets the chain-of-thought reasoning patterns from the R1 teacher model, and distilled knowledge is more robust to quantization than raw instruction-following.

In practice: Q2_K for DeepSeek-R1-Distill feels meaningfully better than Q2_K for a non-distilled model of the same size. Whether it crosses your personal quality threshold for reasoning tasks depends on the task.

Throughput vs. Qwen3.5-27B (Same-Size Comparison)

ModelQuantUserstok/s
DeepSeek-R1-Distill-Qwen-32BQ2_K16469
Qwen3.5-27BQ4_K_M16~110
Qwen3.5-27BQ8_08~75

The R1-Distill at Q2_K is 4× faster than Qwen3.5-27B at a much higher quality quant. If you're okay with the reasoning verbosity, R1-Distill at Q2_K is a legitimate choice for applications where speed matters more than response brevity.

When to Use R1-Distill Locally

Good use cases:

  • Tasks where correctness matters more than conciseness (math, coding, logical reasoning)
  • Single-user interactive sessions where you want maximum reasoning quality
  • Batch processing where throughput at quality is important

Less ideal:

  • Short Q&A where thinking tokens add latency without helping
  • Systems with strict token budgets
  • Applications where Q4_K_M+ quality matters (use a smaller model at higher quant instead)

The Practical Takeaway

At 469 tok/s, the DeepSeek-R1-Distill-Qwen-32B's reasoning penalty is reduced from "annoying" to "barely noticeable" for multi-user workloads. If you have an RTX 5090 and use reasoning-heavy workflows, this model at Q2_K offers a compelling throughput-quality trade-off that wasn't practical at prior speed levels.

The threshold question: does the R1-Distill reasoning advantage justify the quant downgrade compared to running Qwen3.5-27B at Q4_K_M? Based on the throughput data alone, yes — you get faster responses and stronger reasoning at the cost of some raw text quality.