Gemma 4 Benchmarked: Google's E2B and E4B MoE Models on RTX 5090
Google's Gemma 4 lineup takes the MoE concept to an extreme: the E2B variant has only 2B active parameters per token despite having many more total parameters. The E4B doubles that to 4B active. For inference, this means you get the memory footprint of a tiny model with the trained capacity of a much larger one.
We ran both Gemma-4-E2B-it and Gemma-4-E4B-it on the RTX 5090. Here's the data.
Throughput Results
| Model | Quant | Users | Peak tok/s | Avg TTFT |
|---|---|---|---|---|
| Gemma-4-E2B-it | Q4_K_M | 32 | ~420 | 180 ms |
| Gemma-4-E2B-it | Q8_0 | 16 | ~310 | 140 ms |
| Gemma-4-E4B-it | Q4_K_M | 32 | ~280 | 210 ms |
| Gemma-4-E4B-it | Q8_0 | 16 | ~195 | 165 ms |
| Qwen3.5-2B | Q4_K_M | 32 | ~650* | 150 ms |
| Qwen3.5-4B | Q4_K_M | 32 | ~480* | 175 ms |
*For comparison — these are dense models at the same *active parameter* count.
The Active-vs-Total Parameter Distinction
Gemma-4-E2B is not a 2B parameter model. It's a much larger model that activates 2B parameters per token. This distinction matters because:
- VRAM usage: Gemma-4-E2B uses more VRAM than a true 2B model, because all the MoE expert weights must be loaded into memory even if only a few are active per token
- Throughput: Slower than a true 2B model because more weights are loaded, even though fewer are computed per token
- Quality: Higher than a true 2B model because the larger pool of experts means broader knowledge
In practice: Gemma-4-E2B runs at dense-2B compute speeds but uses dense-10B+ VRAM and delivers dense-10B+ quality. This is the MoE value proposition.
VRAM Footprint
| Model | Quant | Approx VRAM | Fits in 16 GB? |
|---|---|---|---|
| Gemma-4-E2B-it | Q4_K_M | ~8 GB | ✅ Yes |
| Gemma-4-E2B-it | Q8_0 | ~15 GB | ✅ Marginal |
| Gemma-4-E4B-it | Q4_K_M | ~14 GB | ✅ Marginal |
| Gemma-4-E4B-it | Q8_0 | ~27 GB | ❌ No |
The E2B at Q4_K_M fits on an RTX 5060 Ti (16 GB) and delivers quality well above what you'd expect from an 8 GB model.
The Multi-GPU Question
Unlike Qwen3.6-35B-A3B, Gemma 4's MoE routing is optimized for single-GPU inference. Running it across two GPUs via tensor split incurs the same PCIe penalty as any other split model, and the routing overhead is amplified because expert selection happens at every layer.
Recommendation: Run Gemma 4 on a single GPU with enough VRAM for the full model in VRAM.
Practical Recommendation
If you have an RTX 5060 Ti (16 GB) and want more quality than a true 2B model:
- Gemma-4-E2B-Q4_K_M fits easily and provides noticeably better reasoning than Qwen3.5-2B
If you have an RTX 5090 (32 GB) and want the best quality-per-throughput:
- Gemma-4-E4B-Q8_0 delivers near-lossless quality at reasonable throughput
The Extreme-MoE architecture genuinely delivers on its promise: quality that exceeds the compute-class suggestion by the active parameter count. The trade-off is VRAM overhead that exceeds what the active parameter count would suggest.