gpt-oss-20b at 1,491 tok/s: What OpenAI's Open Weights Can Do Locally

The number in the headline is real: 1,491 tok/s on an RTX 5090 with gpt-oss-20b-Q4_1.gguf at 32 concurrent users. It's the highest throughput we've recorded for any model above 7B parameters in our entire dataset.

The Data

From 17,110 RTX 5090 rows, the top performers for gpt-oss-20b:

Quant	Concurrent Users	Peak tok/s	Avg TTFT
Q4_1	32	1,491.1	200 ms
Q4_1	32	1,479.9	207 ms
Q4_1	32	1,451.8	212 ms
Q4_0	32	~1,380	225 ms
Q8_0	16	~890	180 ms

For reference: Qwen3.5-27B on the same hardware peaks around 150–200 tok/s at 32 users. gpt-oss-20b is 7–10× faster — and it has 7B fewer parameters.

Why 20B Is a Sweet Spot for RTX 5090

The RTX 5090's 32 GB of GDDR7X bandwidth is 1.8 TB/s. At Q4_1 quantization, gpt-oss-20b weighs approximately 11 GB — it fits entirely in VRAM with substantial room left over for KV-cache. That means:

No model offloading to system RAM
Large KV-cache available for all 32 concurrent sessions
Full GDDR7X bandwidth available for every token generation

The combination produces throughput that feels genuinely fast — 200 ms TTFT at 32 users is production-grade.

The Quality Question

Raw throughput benchmarks don't measure quality. gpt-oss-20b is OpenAI's release, which carries expectations of frontier-model quality. Whether a 20B parameter model at Q4_1 quantization lives up to that is something our qualitative benchmark suite will address as we collect more data.

What we can say: at Q4_1, you're losing approximately 2–3% quality relative to Q8_0 based on typical perplexity degradation curves. The 40% throughput gain from Q4_1 vs Q8_0 is almost always worth that trade-off for serving workloads.

Practical Deployment Scenario

If you're running a small team's AI assistant on an RTX 5090:

gpt-oss-20b-Q4_1 at 32 concurrent users: 1,491 tok/s aggregate, 200 ms TTFT
That's approximately 46 tok/s per user at full saturation
For typical chat responses (200–500 tokens), that's 4–11 seconds per response under full load
Most real-world usage sees 5–8 simultaneous users, which would give you 130–280 tok/s per user — fast enough for real-time conversation

The Implication for "Frontier Local"

For years, "frontier quality locally" meant accepting 20–40 tok/s. gpt-oss-20b at 1,491 tok/s changes that calculus entirely. You can serve frontier-adjacent quality at speeds that make local inference feel competitive with API calls.

This is the moment local LLM infrastructure has been building toward.