Context Length Scaling: How Throughput Holds Up from 2K to 130K Tokens

Most benchmark results you see online use 2,048-token context windows. That's fine for testing, but real-world workloads often need 8K, 32K, or even 128K contexts for code analysis, document summarization, or long conversations.

We collected throughput data at five context lengths — 2K, 8K, 32K, 65K, and 130K — on the machines that could handle the full range. Here's what we found.

The Data: GB10 with Qwen3.6-35B-A3B-MXFP4_MOE

The GB10 is the only machine in our fleet that ran all 5 context lengths at all tested concurrency levels. Using Qwen3.6-35B-A3B-MXFP4_MOE as our reference model:

Context	1 User tok/s	8 Users tok/s	32 Users tok/s
2,048	61.9	—	—
8,192	61.9	57.1	—
32,768	62.0	52.9	168.2
65,536	61.9	57.1	163.6
130,064	61.7	56.8	168.1

The single-user throughput is almost completely flat from 2K to 130K contexts. That's remarkable — and it's a direct consequence of HBM3e bandwidth. The attention mechanism's KV-cache grows with context length, but HBM3e can serve that cache fast enough that it doesn't become a bottleneck until extremely high concurrency.

At 32 concurrent users, you see a ~2% throughput reduction from 8K to 130K context — effectively noise.

RTX 5090: Context Penalty Is Real

The RTX 5090 tells a different story. Using Qwen3.5-27B-Q4_K_M as the reference:

Context	1 User tok/s	8 Users tok/s	16 Users tok/s
2,048	~85	~180	~220
8,192	~80	~160	~195
32,768	~70	~120	~145
65,536	~55	~85	~95

(Estimated from llama-bench PP data in our dataset)

At 65K context, single-user throughput drops ~35% vs 2K. The GDDR7X bandwidth on the RTX 5090 is excellent (1.8 TB/s), but it's not HBM3e — and 27B models fill up available VRAM quickly, leaving less room for KV-cache, which causes cache eviction and performance degradation.

Mac mini M4 Pro: The Unified Memory Advantage

The Mac mini M4 Pro shows behavior closer to the GB10 than the RTX 5090 for context scaling, because its 64 GB of LPDDR5X unified memory means the GPU and CPU share the same physical memory pool. KV-cache can grow into system memory without a PCIe copy overhead.

The throughput penalty for long contexts is real (LPDDR5X bandwidth is ~200 GB/s, much less than HBM3e), but the ability to host 64K+ token contexts at all without OOM errors is a significant advantage over discrete GPUs with limited VRAM.

Practical Recommendations by Use Case

Use Case	Recommended Hardware
Chat (2–8K context)	Any machine — RTX 5060 Ti for speed
Code analysis (8–32K)	RTX 5090 or Mac mini M4 Pro
Document summarization (32–65K)	Mac mini M4 Pro or GB10
Full-document RAG (65K–130K)	GB10 only for multi-user; Mac mini for single user

The Takeaway

Context length scaling exposes a fundamental divide between unified-memory and discrete-GPU architectures. If your use case requires long contexts at scale, unified memory architectures (GB10, Apple Silicon) have a structural advantage — not because they're faster per token, but because they don't hit a hard wall at their VRAM ceiling.

For typical homelab use (chat, coding assistant, single user), any machine handles 8K–32K contexts fine. The differences only become consequential at 65K+ tokens or when serving many concurrent long-context sessions.