Articles
Benchmark reports, analysis posts, and project updates.
The cmoe Trap: What Actually Happens When You Enable MoE Flags on RTX 5090
Enabling -cmoe or -ncmoe on Qwen3.6-35B-A3B-MXFP4_MOE tanks throughput by 57–74% on the RTX 5090. The PPB dataset, including baseline runs, shows the correct recommendation: use no flags at all.
RTX 5060 Ti: The New Throughput King for Small Models
The RTX 5060 Ti hits 768 tok/s on Qwen3.5-0.8B — outpacing the dual RTX 4060 Ti and matching last-gen high-end cards. But 16 GB of VRAM is a real ceiling.
RTX 5060 Ti vs RTX 5090: The Real Price-Performance Ratio for LLMs
The RTX 5090 costs 6× more than the RTX 5060 Ti. Is it 6× better for local LLM inference? Our benchmark data gives a definitive answer — and it depends entirely on your model size.
Qwen3.6-35B-A3B Across Five Machines: The MoE Architecture Test
We ran Qwen3.6-35B-A3B on every machine in our fleet — GB10, Mac mini M4 Pro, dual RTX 4060 Ti, RTX 5060 Ti, and RTX 5090. The results reveal something counterintuitive about Mixture-of-Experts inference.
The Quantization Ladder: Every Quant Level Benchmarked on RTX 5090 for Qwen3.5-27B
From BF16 (near-lossless) to IQ2_XS (2-bit imatrix), we benchmarked every practical quantization level for Qwen3.5-27B on the RTX 5090. The throughput curve has a surprise: Q4 beats Q5 and Q6 in real-world server workloads.
Mac mini M4 Pro: 309 tok/s on 2B Models, and the Limits of Unified Memory
The Mac mini M4 Pro delivers 64 GB of unified memory in a $1,400 package. Our 6,854-row benchmark dataset reveals where it excels and where its shared-bandwidth architecture becomes a ceiling.
Feature Wishlist: What Would Make PPB + ppb-mcp Actually Great?
An inside look at the software gaps in the PPB benchmarking stack — from the runner itself, through the dataset pipeline, to the ppb-mcp query interface. Not about missing data, but about missing features that could be built right now.
gpt-oss-20b at 1,491 tok/s: What OpenAI's Open Weights Can Do Locally
OpenAI's first open-weight release hits 1,491 tok/s on RTX 5090 — the highest throughput we've ever recorded for a frontier-quality model. Here's what that actually means in practice.
Gemma 4 Benchmarked: Google's E2B and E4B MoE Models on RTX 5090
Gemma 4 uses an Extreme-MoE architecture with only 2B or 4B active parameters at inference time. We benchmarked both variants on RTX 5090 and compared them against same-activation-size dense models.
The GB10 Grace Blackwell: 120 GB of Unified Memory for Local LLMs
NVIDIA's GB10 Grace Blackwell Superchip in the DGX Spark puts 120+ GB of HBM3e unified memory at your disposal. We benchmarked it against the RTX 5090 and Mac mini to find out when the memory advantage actually matters.
Dual RTX 4060 Ti: Does 2×16 GB Actually Help for LLMs?
We benchmarked a dual RTX 4060 Ti setup (32 GB combined VRAM via tensor split) against single-GPU alternatives. The results are more nuanced than 'more VRAM = better'.
DeepSeek-R1-Distill at 469 tok/s: Reasoning Models Are Finally Fast
DeepSeek-R1-Distill-Qwen-32B hits 469 tok/s on RTX 5090 at Q2_K — which means even thinking tokens come fast enough to not be annoying. We compare it against non-reasoning Qwen models at the same size.
Context Length Scaling: How Throughput Holds Up from 2K to 130K Tokens
One of the underrated dimensions in local LLM benchmarking is context window performance. We charted throughput degradation across 2K, 8K, 32K, 65K, and 130K tokens on every machine — and found surprisingly little penalty on some hardware.
Unsloth Dynamic vs Standard GGUF: When Mixed-Precision Quantization Pays Off
Unsloth Dynamic (UD) quants promise better quality at low bit widths through mixed-precision. We benchmarked every UD variant against its standard counterpart on the RTX 5090.
RTX 5090 Quantization Guide: Every Qwen3.5 Variant Benchmarked
22 quantization formats across 5 model sizes on the RTX 5090 — a comprehensive guide to finding the right quant for your workload.
Heavy Lifting: Qwen3.5-9B and 27B — Where Architecture Really Matters
At 9B and 27B parameters, hardware choices stop being preferences and start being constraints. Here's how three platforms cope with serious models.
Three GPUs, One Model: Qwen3.5-0.8B Across RTX 5090, GB10, and M4 Pro
A head-to-head comparison of Qwen3.5-0.8B inference performance across three architectures — NVIDIA RTX 5090, NVIDIA GB10, and Apple M4 Pro.
Scaling Up: Qwen3.5-2B and 4B Across Three Architectures
How do the RTX 5090, GB10, and M4 Pro handle Qwen3.5 at 2B and 4B parameters? The answer depends on more than just raw speed.
First Benchmark Roundup: Qwen3.5-0.8B on RTX 5090 vs M4 Pro
Our first look at the data — comparing inference throughput and latency for Qwen3.5-0.8B Q8_0 on an NVIDIA RTX 5090 and Apple M4 Pro.
Welcome to poorpaul.dev
Introducing the official companion site for Poor Paul's Benchmark — the easiest way to explore open LLM inference benchmarks.