What is Mixture of Experts (MoE) in LLMs?

Mixture of Experts (MoE) is a neural network architecture where only a subset of specialized sub-networks (experts) activate for each token, rather than the entire model. A router decides which experts handle each input. DeepSeek V3 has 671B total parameters but only ~37B activate per token, making it compute-efficient but memory-intensive.

How do I configure vLLM for MoE models like DeepSeek?

MoE models require conservative vLLM configuration compared to dense models of equivalent active parameter count. Key settings: set gpu_memory_utilization to 0.75–0.80 (lower than the 0.90 default), start max_num_seqs at 4–8 rather than the higher values used for dense models, set max_model_len to your actual p99 request length rather than the model maximum, and enable expert parallelism via --tensor-parallel-size to distribute experts across GPUs.

Why does DeepSeek V3 require so much GPU memory if it only activates 37B parameters?

MoE models must load all experts into GPU VRAM even though only a fraction activate per token. DeepSeek V3 has 671B total parameters across 256 experts. All 671B parameters must be loaded for inference, even though each token only uses ~37B worth of computation. This is why the GPU memory requirement is determined by total parameters (671B) while compute cost is closer to a 37B dense model.

What is the difference between total and active parameters in MoE models?

Total parameters is the full size of the model including all experts — for DeepSeek V3 this is 671B. Active parameters is how many parameters actually compute on each forward pass — for DeepSeek V3 this is ~37B. GPU memory requirement is set by total parameters. Compute cost (FLOPS per token, latency) is set by active parameters. This distinction is critical for GPU sizing: you need hardware that fits 671B in memory even though the compute workload resembles a 37B model.

What GPU tier do I need for DeepSeek V3?

DeepSeek V3 at 671B parameters in BF16 requires approximately 1.3TB of GPU memory. This means a minimum of 16× A100-80GB GPUs or 8× H100-80GB GPUs just to load the model weights. Additional VRAM is needed for KV cache and activations. Quantized versions (INT4/INT8) reduce this significantly — a 4-bit quantized DeepSeek V3 can fit on 8× A100-80GB. GPU tier selection should be based on total parameters and quantization level, not active parameters.

AI Infrastructure

Why MoE Models Break Your vLLM Configuration Rules

By Sam Hosseini·June 3, 2026·7 min read

The configuration rules that work for dense models fall apart with Mixture of Experts. A DeepSeek-scale MoE model needs the memory of a 671B model but the compute of a 37B one — and most teams configure it wrong.

If you've been running LLM inference with dense models — Llama, Mistral, Qwen — you've built up intuitions about how to configure vLLM. Match GPU tier to parameter count. Set maxnumseqs based on available VRAM after model weights load. Size KV cache to your typical context length.

Those rules break when you move to Mixture of Experts (MoE) models like DeepSeek V3.

Here's why — and what to do instead.

---

What MoE Actually Means for Inference

A standard dense model activates every parameter on every token. A 70B dense model does 70B parameters worth of computation per forward pass.

A MoE model works differently. Instead of one monolithic network, it has a collection of specialized sub-networks called experts, plus a lightweight router that decides which experts handle each token. Only a small subset of experts activate for any given input.

DeepSeek V3 has 671B total parameters but only ~37B activate per token. The router picks 8 experts out of 256 for each token position.

This creates an unusual and counterintuitive profile:

	Dense 70B	DeepSeek V3 (MoE)
Total parameters	70B	671B
Active per token	70B	~37B
Compute per token	High	Lower than size suggests
Memory required	~140GB (BF16)	~1.3TB (BF16)
GPU bottleneck	Compute	Memory bandwidth
Minimum GPUs (BF16)	2× A100-80GB	16× A100-80GB

The key insight: MoE models are memory-bound, not compute-bound. All 671B parameters must be loaded into VRAM even though only a fraction activate per token.

---

Why Your Standard vLLM Config Rules Break

Rule 1: Match GPU tier to parameter count

For dense models, parameter count is a reliable proxy for GPU requirements. A 7B model fits on an A10G. A 70B model needs multiple A100s.

For MoE, this rule fails entirely. DeepSeek V3 at 671B parameters needs 16+ A100-80GB GPUs just to load the weights — but its per-token compute is closer to a 37B model. You can't use a smaller GPU tier just because compute demand is lower. The memory requirement is non-negotiable.

Rule 2: Set max_num_seqs based on available VRAM after weights load

For dense models, this is straightforward: VRAM minus model weights equals KV cache budget, which determines how many concurrent sequences you can run.

For MoE, the model weights consume almost all available VRAM. The KV cache budget per sequence is extremely tight. Setting maxnumseqs too high will immediately OOM — the model leaves almost no headroom.

A practical starting point for DeepSeek V3 on an 8× H100 setup: maxnumseqs of 4–8, not the 32–64 you might use for a dense 7B model.

Rule 3: KV cache scales with context length

This is still true — but the problem is dramatically worse for MoE. With almost no VRAM headroom after model weights, every token of context competes directly with other sequences for the scraps of remaining memory.

At 1M context length (relevant for long-context agent workloads), a single sequence can consume the entire remaining KV cache budget on a multi-GPU setup. There is no room for concurrent requests.

---

The 1M Context Agent Problem

MoE models like DeepSeek V3 are increasingly being used for long-context agentic workflows — tasks where the model needs to reason over very long documents, codebases, or conversation histories.

At 1M token context, the KV cache for a single sequence is enormous:

KV cache per token ≈ 2 × num_layers × num_heads × head_dim × bytes_per_element
At 1M tokens: this becomes tens of gigabytes per sequence

Combined with the already-constrained VRAM budget after MoE weights load, this means:

One active 1M context sequence can block all other requests — it holds the entire remaining KV cache
Preemption cascades are more likely — when KV cache fills, vLLM evicts sequences and recomputes them, but recomputation on a MoE model is expensive
Throughput collapses at high concurrency — the straggler effect is magnified because each long-context sequence holds proportionally more of a tighter resource

This is the inference challenge that makes serving MoE at scale genuinely hard — and why platforms like Together AI have invested heavily in custom inference infrastructure for models like DeepSeek.

---

How to Configure vLLM for MoE

Step 1: Accept the memory constraint first

Before touching any other parameter, accept that the GPU requirement is determined by total parameters, not active parameters. There is no configuration trick that lets a 671B MoE model run on hardware sized for a 37B dense model.

Step 2: Set gpu_memory_utilization conservatively

For dense models, 0.85–0.90 is typical. For MoE, start lower — 0.80 or even 0.75 — because the model weights leave less predictable headroom and spikes are harder to recover from.

Step 3: Set max_num_seqs aggressively low

Start at 4–8 for large MoE models and benchmark before increasing. The cost of an OOM on a 16-GPU setup is much higher than the cost of slightly lower throughput from conservative concurrency.

Step 4: Match max_model_len to your actual workload

If you're not running 1M context requests, don't set maxmodellen to 1M. Set it to your actual p99 request length. The VRAM savings are significant given how tight the budget already is.

Step 5: Enable expert parallelism

For large MoE models, expert parallelism distributes different experts across different GPUs, reducing the memory requirement per device and improving throughput. This requires specific tensor parallel configuration in vLLM.

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 16 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 8 \
  --max-model-len 32768

Step 6: Isolate long-context requests

If you're running both standard and long-context workloads, route them to separate serving instances. A single 1M context request on a shared cluster degrades everyone else. This is not a configuration fix — it's an architectural decision.

---

MoE Support in the vLLM Calculator

Our vLLM Configuration Calculator now supports MoE models natively. Select DeepSeek V3, Mixtral 8×7B, or Mixtral 8×22B from the model dropdown and the calculator will:

Size GPU memory requirements against total parameters, not active parameters
Apply a conservative gpu_memory_utilization of 0.75 instead of the dense-model default of 0.90
Cap the starting max_num_seqs recommendation at 8 to account for expert routing pressure
Disable speculative decoding (not applicable to MoE architecture)
Show a per-model breakdown of total vs. active parameters

The rule of thumb holds: size for total parameters when planning hardware, size for active parameters when estimating compute cost.

---

The Bottom Line

MoE models are some of the most capable models available. They're also the easiest to misconfigure because the intuitions built on dense models don't transfer.

The core insight is simple: MoE models are defined by the gap between what they load (total parameters) and what they use (active parameters). Every configuration decision flows from understanding that gap.

Get the memory sizing right first. Everything else is tunable from there.

Paralleliq tracks GPU tier fit, KV cache pressure, and configuration drift across your inference fleet in real time. [Try piqc →](https://github.com/paralleliq/piqc)

Why MoE Models Break Your vLLM Configuration Rules

What MoE Actually Means for Inference

Why Your Standard vLLM Config Rules Break

The 1M Context Agent Problem

How to Configure vLLM for MoE

MoE Support in the vLLM Calculator

The Bottom Line

More articles

15 Foundation Models, 15 Different vLLM Configs

How to Configure vLLM for Production

Why GPU Fleet Management Needs a Tenant Model

Get more from the cluster you already have.