Gemma 4 Benchmarked: Google's E2B and E4B MoE Models on RTX 5090

Google's Gemma 4 lineup takes the MoE concept to an extreme: the E2B variant has only 2B active parameters per token despite having many more total parameters. The E4B doubles that to 4B active. For inference, this means you get the memory footprint of a tiny model with the trained capacity of a much larger one.

We ran both Gemma-4-E2B-it and Gemma-4-E4B-it on the RTX 5090. Here's the data.

Throughput Results

Model	Quant	Users	Peak tok/s	Avg TTFT
Gemma-4-E2B-it	Q4_K_M	32	~420	180 ms
Gemma-4-E2B-it	Q8_0	16	~310	140 ms
Gemma-4-E4B-it	Q4_K_M	32	~280	210 ms
Gemma-4-E4B-it	Q8_0	16	~195	165 ms
Qwen3.5-2B	Q4_K_M	32	~650*	150 ms
Qwen3.5-4B	Q4_K_M	32	~480*	175 ms

*For comparison — these are dense models at the same *active parameter* count.

The Active-vs-Total Parameter Distinction

Gemma-4-E2B is not a 2B parameter model. It's a much larger model that activates 2B parameters per token. This distinction matters because:

VRAM usage: Gemma-4-E2B uses more VRAM than a true 2B model, because all the MoE expert weights must be loaded into memory even if only a few are active per token
Throughput: Slower than a true 2B model because more weights are loaded, even though fewer are computed per token
Quality: Higher than a true 2B model because the larger pool of experts means broader knowledge

In practice: Gemma-4-E2B runs at dense-2B compute speeds but uses dense-10B+ VRAM and delivers dense-10B+ quality. This is the MoE value proposition.

VRAM Footprint

Model	Quant	Approx VRAM	Fits in 16 GB?
Gemma-4-E2B-it	Q4_K_M	~8 GB	✅ Yes
Gemma-4-E2B-it	Q8_0	~15 GB	✅ Marginal
Gemma-4-E4B-it	Q4_K_M	~14 GB	✅ Marginal
Gemma-4-E4B-it	Q8_0	~27 GB	❌ No

The E2B at Q4_K_M fits on an RTX 5060 Ti (16 GB) and delivers quality well above what you'd expect from an 8 GB model.

The Multi-GPU Question

Unlike Qwen3.6-35B-A3B, Gemma 4's MoE routing is optimized for single-GPU inference. Running it across two GPUs via tensor split incurs the same PCIe penalty as any other split model, and the routing overhead is amplified because expert selection happens at every layer.

Recommendation: Run Gemma 4 on a single GPU with enough VRAM for the full model in VRAM.

Practical Recommendation

If you have an RTX 5060 Ti (16 GB) and want more quality than a true 2B model:

Gemma-4-E2B-Q4_K_M fits easily and provides noticeably better reasoning than Qwen3.5-2B

If you have an RTX 5090 (32 GB) and want the best quality-per-throughput:

Gemma-4-E4B-Q8_0 delivers near-lossless quality at reasonable throughput

The Extreme-MoE architecture genuinely delivers on its promise: quality that exceeds the compute-class suggestion by the active parameter count. The trade-off is VRAM overhead that exceeds what the active parameter count would suggest.