What is Gemma 4? Google most powerful open AI model running from phones to servers

Google DeepMind released Gemma 4 on April 2, 2026. This is the latest generation of Google’s open AI model family, built on the same technology as Gemini 3. It comes in 4 versions ranging from 2B to 31B parameters, runs on everything from phones to servers, and for the first time uses the Apache 2.0 license.

With over 400 million total downloads from the Gemma community and more than 100,000 model variants, this is currently Google’s most popular open model family. Gemma 4 continues that momentum with performance that significantly surpasses Gemma 3, competing directly with Qwen 3.5 and Llama 4 in the same segment.

Gemma 4 overview

Gemma 4 has 4 model sizes, serving everything from edge/mobile to workstation use cases. All support multimodal input (text, image), and some smaller models can also handle audio and video.

The biggest change from Gemma 3: the license has switched to Apache 2.0. Previous Gemma versions used a custom license with some commercial restrictions. Now you can use it for any purpose, including commercial products, without worrying about licensing issues.

Technology-wise, Gemma 4 is built on the same architecture as Gemini 3. Sundar Pichai described it as “packing incredible amount of intelligence per parameter,” while Demis Hassabis called these “best open models in the world for their respective sizes.”

Four Gemma 4 variants

The table below summarizes the key specs of all 4 models:

Model	Parameters	Active params	Context	Modalities
E2B	5.1B (effective 2.3B)	2.3B	128K	Text, Image, Audio
E4B	8B (effective 4.5B)	4.5B	128K	Text, Image, Audio, Video
26B MoE (A4B)	25.2B	3.8B	256K	Text, Image
31B Dense	30.7B	30.7B	256K	Text, Image

E2B and E4B: compact, runs offline

E2B (Effective 2B) has 5.1B total parameters but only 2.3B effective. It runs on phones, Raspberry Pi, and Jetson Nano. Supports text, image, and audio (via USM-style conformer encoder). With a 128K context window, this model is ideal for offline applications and embedded AI.

E4B (Effective 4B) is slightly larger: 8B total, 4.5B effective. Also runs on small devices. The notable addition is video processing (up to 60 seconds at 1fps), on top of text, image, and audio.

26B MoE (A4B): high performance, low resources

This is a Mixture-of-Experts model with 25.2B total parameters, but each inference only uses 3.8B active parameters. The architecture consists of 128 experts, selecting 8 experts plus 1 shared expert per forward pass.

Context window is 256K tokens. In simple terms: you get a 25B model with inference speed close to a 4B model. On a MacBook M4 with 38GB RAM, this model runs at about 42-43 tokens/second. On M2 Ultra with llama.cpp, it reaches up to 300 t/s.

31B Dense: the most powerful Gemma

A 30.7B dense model with no MoE, so all parameters are active. Currently ranked #3 on Arena AI for open models (as of April 2, 2026). Context window is 256K tokens, supporting text and image.

Running 31B unquantized requires 1x 80GB H100. But with quantization, it runs on regular consumer GPUs.

Key strengths and capabilities

Reasoning and thinking mode

Gemma 4 supports chain-of-thought reasoning via the <|think|> token. When thinking mode is enabled, the model “thinks” before responding, significantly improving results on complex problems. For example, on AIME 2026, Gemma 4 31B scores 89.2% while Gemma 3 27B only managed 20.8%.

Agentic workflows

Gemma 4 has native function-calling, structured JSON output, and system instructions. This means you can use the model as an AI agent: calling APIs, extracting structured data, handling multi-step tasks without complex external frameworks.

Multimodal: vision, audio, video

All 4 models process images. Vision capabilities include OCR, chart understanding, object detection with JSON bounding boxes, and variable image token budgets (70 to 1120 tokens depending on image complexity).

E2B and E4B also include an audio encoder (USM-style conformer), supporting speech recognition (ASR), up to 30 seconds per clip. E4B can also process video, up to 60 seconds at 1fps.

Code generation

Gemma 4 31B achieves 2150 Codeforces ELO and 80% on LiveCodeBench. These numbers show the model writes code quite well and can serve as an offline code assistant. Even the smallest E2B model scores 633 ELO on Codeforces, sufficient for basic coding tasks.

140+ languages

Gemma 4’s vocabulary reaches 262K tokens, much larger than the typical 32K-150K range in other models. This helps the model handle over 140 native languages more effectively, including Vietnamese.

Benchmarks: how strong is Gemma 4?

Here’s a benchmark comparison between Gemma 4 variants and Gemma 3 27B:

Model	MMLU-Pro	GPQA Diamond	LiveCodeBench	Codeforces ELO	MMMLU
Gemma 4 31B	85.2%	84.3%	80.0%	2150	88.4%
Gemma 4 26B A4B	82.6%	82.3%	77.1%	1718	86.3%
Gemma 4 E4B	69.4%	58.6%	52.0%	940	76.6%
Gemma 4 E2B	60.0%	43.4%	44.0%	633	67.4%
Gemma 3 27B	67.6%	42.4%	29.1%	110	70.7%

Looking at the table, Gemma 4 31B outperforms Gemma 3 27B across every metric. The largest gaps are in code (Codeforces ELO 2150 vs 110) and reasoning (GPQA Diamond 84.3% vs 42.4%). This jump is partly thanks to the new thinking mode.

Notably, the 26B MoE (A4B) is also very strong despite using only 3.8B active params per inference. GPQA Diamond reaches 82.3%, close to the 31B Dense version. This is a great choice if you want to balance performance and resources.

Comparison with competitors

Vs Qwen 3.5 27B: Qwen has a slight edge in reasoning (GPQA Diamond 85.5% vs 84.3%). Qwen also has a larger context window (over 1M tokens). But Gemma 4 wins on multimodal (built-in audio + vision) and the more permissive Apache 2.0 license.

Vs DeepSeek: DeepSeek 671B is still stronger, but requires much more hardware. Gemma 4 31B delivers “genuinely close” results on many tasks with just a fraction of the resources.

Vs Llama 4 Scout: Llama has a context window up to 10M tokens, but Gemma 4’s 256K is sufficient for most real-world use cases. Gemma 4 currently dominates the budget/edge segment on LMArena.

Benchmark sources: main table from Google Gemma 4 Model Card. Cross-model comparisons from Artificial Analysis (independent testing). Arena AI rankings from arena.ai as of April 2, 2026. Detailed Gemma 4 information from the official Google blog.

Technical architecture

Gemma 4 introduces several notable architectural changes compared to standard transformers. Here are the key points, explained concisely.

Hybrid Attention

Instead of using all global attention (memory-heavy, slow with long contexts), Gemma 4 alternates between sliding-window attention (looking at only the nearest 512 or 1024 tokens) and global full-context attention. The last layer is always global so the model can still “see” the entire context when needed.

This approach saves significant memory while maintaining the ability to handle long contexts.

Per-Layer Embeddings (PLE)

In standard models, all layers share a single input embedding vector. Gemma 4 gives each layer its own conditioning vector. Essentially, each layer receives an additional “signal” to know where it sits in the network and what to process. This technique helps the model learn more efficiently with the same number of parameters.

Shared KV Cache

The model’s later layers reuse Key/Value cache from previous layers instead of computing their own. Put simply: instead of each layer consuming separate memory for KV cache, some layers “borrow” cache from earlier layers. The result: reduced memory and compute with nearly identical output quality.

Dual RoPE and GQA

Gemma 4 uses two different base frequencies for RoPE (Rotary Position Embedding): sliding layers use base 10K, global layers use base 1M. The larger base helps global layers encode positions more accurately for long contexts.

Grouped Query Attention (GQA) is also configured differently between local and global layers: local uses 2 queries per 1 KV head, global uses 8 queries per 1 KV head. This compresses memory efficiently, especially at global layers that must process the entire context.

How to run Gemma 4 on your machine

Gemma 4 has day-0 support on most popular tools. Here are the quickest ways to get started.

Ollama

The simplest approach. Just one command:

ollama pull gemma4

By default, this downloads the 26B MoE (A4B) version. To run other models:

# Run the 31B Dense version
ollama pull gemma4:31b
# Run smaller versions
ollama pull gemma4:e2b
ollama pull gemma4:e4b

llama.cpp

Day-0 support with excellent performance. Download the GGUF model from Hugging Face and run:

./llama-server -m gemma-4-26b-a4b-Q4_K_M.gguf -c 8192 -ngl 99

On M2 Ultra, the 26B MoE hits about 300 tokens/second. On M4 with 38GB RAM, it runs at about 42-43 t/s.

MLX (Apple Silicon)

If you’re on a Mac with Apple Silicon, MLX is a solid choice with TurboQuant support:

pip install mlx-lm
mlx_lm.generate --model google/gemma-4-26b-a4b-mlx --prompt "Hello"

Other platforms

Gemma 4 is also available on Hugging Face, Kaggle, Google AI Studio, and NVIDIA NIM. You can use it directly via API or download it to run locally depending on your needs.

💡 If your machine has less than 16GB RAM, start with E2B or E4B. With 32GB or more, the 26B MoE runs quite smoothly.

Gemma 4 on mobile and edge devices

Google designed E2B and E4B specifically for edge/mobile. These two models run on phones, Raspberry Pi, Jetson Nano, completely offline. Compared to previous versions, Google reports 4x faster speed and 60% less battery consumption.

Gemma 4 is also the foundation for the upcoming Gemini Nano 4 on Google Pixel devices. Google is partnering with Qualcomm and MediaTek to optimize for mobile chips. Android developers can access it via the AICore Developer Preview.

ℹ️ E2B and E4B can run completely offline, no internet connection needed. Suitable for applications requiring data privacy or areas without network access.

When should you choose Gemma 4?

Gemma 4 fits well in several specific scenarios:

Need to run AI locally/offline: E2B, E4B run on small devices, 26B MoE runs on laptops/workstations with good speed.
License matters: Apache 2.0 allows unrestricted commercial use, unlike some other models.
Need multimodal: Vision, audio, video built-in. No need to bolt on additional models.
Agentic AI: Function-calling, JSON output, system instructions are natively supported.
Edge/mobile deployment: Optimized for phones, IoT, Raspberry Pi.
Multilingual: 140+ languages, 262K token vocabulary.

If you need an extremely long context window (over 256K), Qwen 3.5 or Llama 4 Scout may be more suitable. If you need absolute top performance regardless of hardware, DeepSeek 671B remains the top choice.

Frequently asked questions

How is Gemma 4 different from Gemini?

Gemma 4 is Google’s open-weight model family, using the same technology as Gemini 3 but in smaller sizes. Gemini is a closed-source model running on the cloud, while Gemma can be downloaded and run locally.

What are the minimum hardware requirements to run Gemma 4?

E2B runs on Raspberry Pi and phones. E4B is similar. 26B MoE needs about 16-32GB RAM (quantized). 31B Dense unquantized requires an 80GB GPU, but quantized versions run on consumer GPUs with 24GB.

Is Gemma 4 free?

Yes. Apache 2.0 license, free for both personal and commercial use. Download from Hugging Face, Kaggle, or Ollama.

Should I choose 26B MoE or 31B Dense?

If you prioritize speed and resource efficiency, choose 26B MoE. With only 3.8B active params, it’s very fast. If you need the highest possible performance and have strong enough hardware, choose 31B Dense.

Does Gemma 4 support Vietnamese?

Yes. Gemma 4 supports over 140 languages, including Vietnamese. The 262K token vocabulary also helps tokenize Vietnamese more efficiently compared to models with smaller vocabularies.

How much has Gemma 4 improved over Gemma 3?

Significantly. For example: AIME 2026 went from 20.8% to 89.2%, GPQA Diamond from 42.4% to 84.3%, Codeforces ELO from 110 to 2150. This is a major leap, thanks to the new architecture and thinking mode.

Can Gemma 4 be used as a code assistant?

Yes. Gemma 4 31B scores 80% on LiveCodeBench and 2150 Codeforces ELO. Even the 26B MoE reaches 77.1% on LiveCodeBench. You can use it as an offline code assistant via Ollama or llama.cpp, paired with editors like VS Code through compatible extensions.

What is Gemma 4? Google most powerful open AI model running from phones to servers

Gemma 4 overview

Four Gemma 4 variants

E2B and E4B: compact, runs offline

26B MoE (A4B): high performance, low resources

31B Dense: the most powerful Gemma

Key strengths and capabilities

Reasoning and thinking mode

Agentic workflows

Multimodal: vision, audio, video

Code generation

140+ languages

Benchmarks: how strong is Gemma 4?

Comparison with competitors

Technical architecture

Hybrid Attention

Per-Layer Embeddings (PLE)

Shared KV Cache

Dual RoPE and GQA

How to run Gemma 4 on your machine

Ollama

llama.cpp

MLX (Apple Silicon)

Other platforms

Gemma 4 on mobile and edge devices

When should you choose Gemma 4?

Frequently asked questions

How is Gemma 4 different from Gemini?

What are the minimum hardware requirements to run Gemma 4?

Is Gemma 4 free?

Should I choose 26B MoE or 31B Dense?

Does Gemma 4 support Vietnamese?

How much has Gemma 4 improved over Gemma 3?

Can Gemma 4 be used as a code assistant?

You might also like

About the author

Trần Thắng

Start your web project with AZDIGI