Does the same vLLM configuration work for different models?

No. vLLM configuration is model-specific. Key parameters — max_num_seqs, gpu_memory_utilization, tensor_parallel_size, and max_model_len — depend directly on the model's parameter count, layer structure, KV head configuration, and whether it uses a dense or Mixture of Experts (MoE) architecture. A configuration tuned for Llama 3 8B will produce OOM errors or severe underutilization if applied to DeepSeek V3 or Mixtral 8x22B.

How does model architecture affect vLLM configuration?

Model architecture determines three critical vLLM inputs: how much VRAM the model weights consume (setting the floor for GPU requirements), how much VRAM remains for KV cache (setting the ceiling for concurrency), and what throughput the hardware can sustain (setting the bound on requests per second). MoE models require special handling — they must load all expert weights into VRAM even though only a fraction activate per token, making the VRAM requirement far larger than the active compute profile would suggest.

What vLLM settings need to change when switching from a dense model to a MoE model?

Three settings must change when moving from a dense model to a MoE model in vLLM: (1) gpu_memory_utilization should drop from 0.90 to 0.75 — expert routing creates less predictable memory pressure; (2) max_num_seqs should start at 4–8 rather than higher values — the tight KV budget makes OOM recovery harder; (3) tensor_parallel_size must be sized for total parameters, not active parameters. A MoE model with 37B active parameters still requires the VRAM of its full 671B parameter count.

How do I manage vLLM configuration across multiple models in production?

Managing vLLM configuration across multiple models requires treating each model as a distinct infrastructure profile. Key practices: maintain per-model configuration baselines (not a shared default), monitor KV cache utilization and OOM rates per model separately, and automate configuration drift detection — a config that was correct at deployment can become wrong when traffic patterns shift. Tools like piqc scan your running cluster and surface model-specific configuration gaps without requiring write access.

AI Infrastructure

15 Foundation Models, 15 Different vLLM Configs

By Sam Hosseini·June 4, 2026·7 min read

The open-weight model zoo now has 15+ production-grade options. Each one has a different architecture, memory profile, and vLLM configuration requirement. That's not a model selection problem — it's an ops problem.

A year ago, most teams running self-hosted LLM inference had one model. Maybe two.

Today the realistic answer to "which model are you running?" is often "it depends on the use case" — and the fleet reflects that. Llama 3.3 70B for general chat. DeepSeek V3 for coding. Qwen 2.5 for multilingual. Phi-3.5 for latency-sensitive edge cases.

The open-weight model zoo now has 15+ production-grade options across Meta, Mistral, DeepSeek, Alibaba, Microsoft, Google, NVIDIA, IBM, and others. New families drop every quarter. Most teams are evaluating at least two or three simultaneously.

This is not primarily a model selection problem. It's an ops problem.

---

Why Configuration Doesn't Transfer

The intuition that vLLM configuration is mostly portable — tune it once, adjust for scale — breaks immediately when you look at what configuration actually depends on.

Every key vLLM parameter is a function of model architecture:

max_num_seqs — the concurrency ceiling — depends on how much VRAM is left after model weights load. That depends on the model's size, quantization, and layer count.

gpu_memory_utilization — the KV cache budget fraction — needs to be conservative for MoE architectures (0.75) versus dense models (0.90) because expert routing creates less predictable memory pressure.

tensor_parallel_size — the number of GPUs the model shards across — must be sized for total parameters, not active ones. A MoE model with 37B active parameters still needs the VRAM of its full parameter count.

max_model_len — the context window limit — directly multiplies KV cache per request. At 128K context, a single sequence can consume the entire remaining VRAM budget on a multi-GPU setup.

None of these transfer between model families. Here's what that looks like concretely:

Model	Type	VRAM (BF16)	Min GPUs (H100)	gpumemutil	Starting maxnumseqs
Llama 3 8B	Dense	16 GB	1	0.90	32–64
Llama 3 70B	Dense	140 GB	2	0.90	16–32
Mixtral 8×7B	MoE	94 GB	2	0.75	4–8
Mixtral 8×22B	MoE	282 GB	4 (INT8)	0.75	4–8
DeepSeek V3	MoE	1,342 GB	17 (BF16) / 9 (INT8)	0.75	4–8
Qwen 2.5 72B	Dense	144 GB	2	0.90	16–32
Phi-3.5 Mini	Dense	8 GB	1	0.90	32–64

Applying a Llama 8B configuration to DeepSeek V3 doesn't produce slightly wrong results. It produces immediate OOM errors before the first request is served.

---

The MoE Configuration Gap

The Mixture of Experts models deserve special attention because the misconfiguration risk is highest and the intuition failure is most complete.

The instinct when looking at DeepSeek V3 is to treat it like a very large dense model — configure it the way you'd configure a 671B parameter model. That's partially right on memory but completely wrong on concurrency and throughput.

DeepSeek V3 has 671B total parameters but only ~37B activate per token. The router selects 8 experts out of 256 for each token position. This creates a profile that looks nothing like a dense model of equivalent size:

Memory requirement: determined by total parameters (671B → 1,342 GB in BF16)
Compute cost: determined by active parameters (~37B → similar throughput to a 37B dense model per GPU)
KV cache headroom: extremely tight — almost all VRAM is consumed by model weights before KV cache even starts

A correct DeepSeek V3 configuration on 16× H100 INT8 looks like:

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 16 \
  --gpu-memory-utilization 0.75 \
  --max-num-seqs 8 \
  --max-model-len 32768 \
  --quantization int8

A naively ported dense-model configuration on the same hardware — with gpumemoryutilization at 0.90 and maxnumseqs at 32 — will OOM within the first few concurrent requests.

---

The Org Problem

The configuration complexity compounds when you have more than one model in production simultaneously, which is increasingly the default state.

Who owns the vLLM configuration for each model? In most teams, the answer is "whoever deployed it" — which means configurations are set once at deploy time and rarely revisited. Traffic patterns shift. Models get upgraded. New quantization options become available. The configuration drifts from optimal.

At one model, this is manageable. At five, it becomes a background tax on every team that touches inference infrastructure. The models are evolving faster than the operations practice around them.

---

What Model-Aware Infrastructure Looks Like

The response to model proliferation isn't to become an expert in every model architecture. It's to build or adopt infrastructure that is.

Concretely, that means:

Per-model configuration baselines — not a shared vLLM default applied to every deployment, but model-specific starting points that account for architecture, quantization, and GPU tier.

Model-aware observability — KV cache utilization, OOM rates, and TPOT tracked per model, not just per cluster. A spike in KV cache pressure on your DeepSeek V3 instance needs a different response than the same spike on Llama 8B.

Configuration drift detection — automated alerting when observed fleet behavior diverges from the expected profile for a given model. A config that was correct at deploy time becomes wrong as traffic patterns change.

Upgrade impact analysis — before switching from Mixtral 8×22B to DeepSeek V3, know what happens to your GPU requirement, KV cache budget, and p95 latency. Not estimated from first principles each time — derived from the model's known architecture profile.

---

The Trend Line

The model zoo is not going to shrink. The teams building foundation models are well-funded, technically differentiated, and releasing on quarterly cadences. Each release cycle adds new architecture variations — new quantization approaches, new context length capabilities, new MoE configurations — that require fresh configuration work.

The teams that treat this as a one-time configuration problem will spend increasing engineering time re-solving it. The teams that build model-aware infrastructure around it will absorb new model releases without the ops tax.

The calculator we've built is a starting point — select your model and GPU tier and get a baseline configuration that accounts for architecture. For production fleets running multiple models simultaneously, piqc scans your running cluster and surfaces model-specific configuration gaps in real time.

The model proliferation problem is only getting harder. The infrastructure layer needs to keep up.

15 Foundation Models, 15 Different vLLM Configs

Why Configuration Doesn't Transfer

The MoE Configuration Gap

The Org Problem

What Model-Aware Infrastructure Looks Like

The Trend Line

More articles

Why MoE Models Break Your vLLM Configuration Rules

How to Configure vLLM for Production

The Two Business Models Running AI Inference — And Why They Have Completely Different GPU Problems

Get more from the cluster you already have.