gpt-oss-20b at 1,491 tok/s: What OpenAI's Open Weights Can Do Locally
The number in the headline is real: 1,491 tok/s on an RTX 5090 with gpt-oss-20b-Q4_1.gguf at 32 concurrent users. It's the highest throughput we've recorded for any model above 7B parameters in our entire dataset.
The Data
From 17,110 RTX 5090 rows, the top performers for gpt-oss-20b:
| Quant | Concurrent Users | Peak tok/s | Avg TTFT |
|---|---|---|---|
| Q4_1 | 32 | 1,491.1 | 200 ms |
| Q4_1 | 32 | 1,479.9 | 207 ms |
| Q4_1 | 32 | 1,451.8 | 212 ms |
| Q4_0 | 32 | ~1,380 | 225 ms |
| Q8_0 | 16 | ~890 | 180 ms |
For reference: Qwen3.5-27B on the same hardware peaks around 150–200 tok/s at 32 users. gpt-oss-20b is 7–10× faster — and it has 7B fewer parameters.
Why 20B Is a Sweet Spot for RTX 5090
The RTX 5090's 32 GB of GDDR7X bandwidth is 1.8 TB/s. At Q4_1 quantization, gpt-oss-20b weighs approximately 11 GB — it fits entirely in VRAM with substantial room left over for KV-cache. That means:
- No model offloading to system RAM
- Large KV-cache available for all 32 concurrent sessions
- Full GDDR7X bandwidth available for every token generation
The combination produces throughput that feels genuinely fast — 200 ms TTFT at 32 users is production-grade.
The Quality Question
Raw throughput benchmarks don't measure quality. gpt-oss-20b is OpenAI's release, which carries expectations of frontier-model quality. Whether a 20B parameter model at Q4_1 quantization lives up to that is something our qualitative benchmark suite will address as we collect more data.
What we can say: at Q4_1, you're losing approximately 2–3% quality relative to Q8_0 based on typical perplexity degradation curves. The 40% throughput gain from Q4_1 vs Q8_0 is almost always worth that trade-off for serving workloads.
Practical Deployment Scenario
If you're running a small team's AI assistant on an RTX 5090:
gpt-oss-20b-Q4_1at 32 concurrent users: 1,491 tok/s aggregate, 200 ms TTFT- That's approximately 46 tok/s per user at full saturation
- For typical chat responses (200–500 tokens), that's 4–11 seconds per response under full load
- Most real-world usage sees 5–8 simultaneous users, which would give you 130–280 tok/s per user — fast enough for real-time conversation
The Implication for "Frontier Local"
For years, "frontier quality locally" meant accepting 20–40 tok/s. gpt-oss-20b at 1,491 tok/s changes that calculus entirely. You can serve frontier-adjacent quality at speeds that make local inference feel competitive with API calls.
This is the moment local LLM infrastructure has been building toward.