DeepSeek-R1-Distill at 469 tok/s: Reasoning Models Are Finally Fast
There's a productivity tax with reasoning models: the chain-of-thought tokens before the actual answer. If your model runs at 20 tok/s, 500 thinking tokens take 25 seconds. At 469 tok/s, those same tokens take just over a second.
We benchmarked DeepSeek-R1-Distill-Qwen-32B on the RTX 5090, and the throughput numbers change how you think about the reasoning model penalty.
The Numbers
| Quant | Concurrent Users | Peak tok/s | Avg TTFT |
|---|---|---|---|
| Q2_K | 16 | 469.3 | 467 ms |
| Q2_K | 8 | 153.5 | 315 ms |
| Q3_K_M | 8 | ~140 | ~350 ms |
| Q4_K_M | 4 | ~95 | ~280 ms |
| Q8_0 | 2 | ~55 | ~240 ms |
The Q2_K result at 16 users (469 tok/s) is striking. For context, this is a 32B parameter model — the same size as Qwen3.5-27B-Q8_0 — running faster than most 9B models at quality quantizations.
The Q2_K Trade-off
Q2_K on a 32B model is aggressive — approximately 2.2 bits per weight. You will see quality degradation. The distilled reasoning capability helps compensate: the distillation process specifically targets the chain-of-thought reasoning patterns from the R1 teacher model, and distilled knowledge is more robust to quantization than raw instruction-following.
In practice: Q2_K for DeepSeek-R1-Distill feels meaningfully better than Q2_K for a non-distilled model of the same size. Whether it crosses your personal quality threshold for reasoning tasks depends on the task.
Throughput vs. Qwen3.5-27B (Same-Size Comparison)
| Model | Quant | Users | tok/s |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-32B | Q2_K | 16 | 469 |
| Qwen3.5-27B | Q4_K_M | 16 | ~110 |
| Qwen3.5-27B | Q8_0 | 8 | ~75 |
The R1-Distill at Q2_K is 4× faster than Qwen3.5-27B at a much higher quality quant. If you're okay with the reasoning verbosity, R1-Distill at Q2_K is a legitimate choice for applications where speed matters more than response brevity.
When to Use R1-Distill Locally
Good use cases:
- Tasks where correctness matters more than conciseness (math, coding, logical reasoning)
- Single-user interactive sessions where you want maximum reasoning quality
- Batch processing where throughput at quality is important
Less ideal:
- Short Q&A where thinking tokens add latency without helping
- Systems with strict token budgets
- Applications where Q4_K_M+ quality matters (use a smaller model at higher quant instead)
The Practical Takeaway
At 469 tok/s, the DeepSeek-R1-Distill-Qwen-32B's reasoning penalty is reduced from "annoying" to "barely noticeable" for multi-user workloads. If you have an RTX 5090 and use reasoning-heavy workflows, this model at Q2_K offers a compelling throughput-quality trade-off that wasn't practical at prior speed levels.
The threshold question: does the R1-Distill reasoning advantage justify the quant downgrade compared to running Qwen3.5-27B at Q4_K_M? Based on the throughput data alone, yes — you get faster responses and stronger reasoning at the cost of some raw text quality.