ParallelIQ
AI Infrastructure

15 Foundation Models, 15 Different vLLM Configs

By Sam Hosseini·June 4, 2026·7 min read
15 Foundation Models, 15 Different vLLM Configs

The open-weight model zoo now has 15+ production-grade options. Each one has a different architecture, memory profile, and vLLM configuration requirement. That's not a model selection problem — it's an ops problem.

A year ago, most teams running self-hosted LLM inference had one model. Maybe two.

Today the realistic answer to "which model are you running?" is often "it depends on the use case" — and the fleet reflects that. Llama 3.3 70B for general chat. DeepSeek V3 for coding. Qwen 2.5 for multilingual. Phi-3.5 for latency-sensitive edge cases.

The open-weight model zoo now has 15+ production-grade options across Meta, Mistral, DeepSeek, Alibaba, Microsoft, Google, NVIDIA, IBM, and others. New families drop every quarter. Most teams are evaluating at least two or three simultaneously.

This is not primarily a model selection problem. It's an ops problem.

---

Why Configuration Doesn't Transfer

The intuition that vLLM configuration is mostly portable — tune it once, adjust for scale — breaks immediately when you look at what configuration actually depends on.

Every key vLLM parameter is a function of model architecture:

max_num_seqs — the concurrency ceiling — depends on how much VRAM is left after model weights load. That depends on the model's size, quantization, and layer count.

gpu_memory_utilization — the KV cache budget fraction — needs to be conservative for MoE architectures (0.75) versus dense models (0.90) because expert routing creates less predictable memory pressure.

tensor_parallel_size — the number of GPUs the model shards across — must be sized for total parameters, not active ones. A MoE model with 37B active parameters still needs the VRAM of its full parameter count.

max_model_len — the context window limit — directly multiplies KV cache per request. At 128K context, a single sequence can consume the entire remaining VRAM budget on a multi-GPU setup.

None of these transfer between model families. Here's what that looks like concretely:

ModelTypeVRAM (BF16)Min GPUs (H100)gpumemutilStarting maxnumseqs
Llama 3 8BDense16 GB10.9032–64
Llama 3 70BDense140 GB20.9016–32
Mixtral 8×7BMoE94 GB20.754–8
Mixtral 8×22BMoE282 GB4 (INT8)0.754–8
DeepSeek V3MoE1,342 GB17 (BF16) / 9 (INT8)0.754–8
Qwen 2.5 72BDense144 GB20.9016–32
Phi-3.5 MiniDense8 GB10.9032–64

Applying a Llama 8B configuration to DeepSeek V3 doesn't produce slightly wrong results. It produces immediate OOM errors before the first request is served.

---

The MoE Configuration Gap

The Mixture of Experts models deserve special attention because the misconfiguration risk is highest and the intuition failure is most complete.

The instinct when looking at DeepSeek V3 is to treat it like a very large dense model — configure it the way you'd configure a 671B parameter model. That's partially right on memory but completely wrong on concurrency and throughput.

DeepSeek V3 has 671B total parameters but only ~37B activate per token. The router selects 8 experts out of 256 for each token position. This creates a profile that looks nothing like a dense model of equivalent size:

  • Memory requirement: determined by total parameters (671B → 1,342 GB in BF16)
  • Compute cost: determined by active parameters (~37B → similar throughput to a 37B dense model per GPU)
  • KV cache headroom: extremely tight — almost all VRAM is consumed by model weights before KV cache even starts

A correct DeepSeek V3 configuration on 16× H100 INT8 looks like:

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 16 \
  --gpu-memory-utilization 0.75 \
  --max-num-seqs 8 \
  --max-model-len 32768 \
  --quantization int8

A naively ported dense-model configuration on the same hardware — with gpumemoryutilization at 0.90 and maxnumseqs at 32 — will OOM within the first few concurrent requests.

---

The Org Problem

The configuration complexity compounds when you have more than one model in production simultaneously, which is increasingly the default state.

Who owns the vLLM configuration for each model? In most teams, the answer is "whoever deployed it" — which means configurations are set once at deploy time and rarely revisited. Traffic patterns shift. Models get upgraded. New quantization options become available. The configuration drifts from optimal.

At one model, this is manageable. At five, it becomes a background tax on every team that touches inference infrastructure. The models are evolving faster than the operations practice around them.

---

What Model-Aware Infrastructure Looks Like

The response to model proliferation isn't to become an expert in every model architecture. It's to build or adopt infrastructure that is.

Concretely, that means:

Per-model configuration baselines — not a shared vLLM default applied to every deployment, but model-specific starting points that account for architecture, quantization, and GPU tier.

Model-aware observability — KV cache utilization, OOM rates, and TPOT tracked per model, not just per cluster. A spike in KV cache pressure on your DeepSeek V3 instance needs a different response than the same spike on Llama 8B.

Configuration drift detection — automated alerting when observed fleet behavior diverges from the expected profile for a given model. A config that was correct at deploy time becomes wrong as traffic patterns change.

Upgrade impact analysis — before switching from Mixtral 8×22B to DeepSeek V3, know what happens to your GPU requirement, KV cache budget, and p95 latency. Not estimated from first principles each time — derived from the model's known architecture profile.

---

The Trend Line

The model zoo is not going to shrink. The teams building foundation models are well-funded, technically differentiated, and releasing on quarterly cadences. Each release cycle adds new architecture variations — new quantization approaches, new context length capabilities, new MoE configurations — that require fresh configuration work.

The teams that treat this as a one-time configuration problem will spend increasing engineering time re-solving it. The teams that build model-aware infrastructure around it will absorb new model releases without the ops tax.

The calculator we've built is a starting point — select your model and GPU tier and get a baseline configuration that accounts for architecture. For production fleets running multiple models simultaneously, piqc scans your running cluster and surfaces model-specific configuration gaps in real time.

The model proliferation problem is only getting harder. The infrastructure layer needs to keep up.

More articles

Get more from the cluster you already have.

Start for Free