ParallelIQ
GPU Ops Field Guide

GPU Right-Sizing: Matching Tier to Workload

By Sam Hosseini·May 16, 2026·6 min read
GPU Right-Sizing: Matching Tier to Workload

Running a 7B model on an H100 is as wasteful as running a 70B model on an A10G. Right-sizing GPU tiers is one of the highest-leverage cost optimizations in inference — and most teams get it wrong.

The Two Directions of Mismatch

GPU tier mismatches run in both directions — and both are expensive.

Over-tiered: A small model on a high-end GPU. The model fits easily, runs fast, but consumes a fraction of the available VRAM and compute. You're paying for an H100 and getting A10G-level workload density.

Under-tiered: A large model crammed onto a GPU with insufficient VRAM. The model barely fits, KV cache is constrained, batch sizes are tiny, and the system runs at the edge of OOM. Latency suffers and stability is fragile.

Most teams discover mismatches reactively — after a cost audit or an OOM incident. The goal is to catch them proactively.

---

GPU Tier Reference for LLM Inference

GPUVRAMBest Fit
A10G24 GB7B–13B models, moderate concurrency
L40S48 GB13B–34B models, higher concurrency
A100 40GB40 GB13B–34B models, training and inference
A100 80GB80 GB34B–70B models, high concurrency
H100 80GB80 GB70B models, maximum throughput
H100 NVL94 GB70B+ models, long context

These are starting points. Actual fit depends on quantization, batch size, context length, and concurrency targets.

---

How to Right-Size a Workload

Step 1 — Measure actual VRAM consumption

Don't estimate — measure. Deploy the model with realistic traffic and record peak VRAM usage:

nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

Add 15–20% headroom for KV cache growth under peak load.

Step 2 — Calculate effective VRAM requirement

Required VRAM = Model weights + KV cache (peak) + Activation memory + 15% headroom

For a 70B model in FP16: ~140GB weights alone. That requires tensor parallelism across 2x H100 80GB or quantization to fit on a single node.

For a 7B model in INT8: ~7GB weights. An A10G has substantial headroom for concurrent requests.

Step 3 — Check SM utilization on the current tier

If SM utilization is consistently below 40% on an A100 or H100, the workload doesn't justify the tier. Move down.

If SM utilization is above 90% and latency is suffering, the workload has outgrown the tier. Move up or scale horizontally.

Step 4 — Factor in concurrency

A single 7B model on an A10G might run at 30% SM utilization. But with 8 concurrent requests, that same GPU might hit 85% — making the tier correct at scale even if it looks oversized at low traffic.

Right-sizing is a function of concurrent load, not just model size.

---

Common Mismatches and Their Cost

ScenarioSymptomAnnual Waste (est.)
7B model on H100 (low concurrency)SM util < 20%$40K–$80K per GPU
70B model on A100 40GBConstant OOM, tiny batchesLatency + reliability cost
13B model on A10G at high concurrencyKV cache pressure, slowThroughput ceiling hit

---

Quantization as a Right-Sizing Tool

Quantization reduces model weight size without significant accuracy loss, enabling a larger model to fit on a smaller (cheaper) GPU tier:

  • INT8 (bitsandbytes, LLM.int8()): ~50% VRAM reduction, minimal quality loss
  • AWQ / GPTQ (INT4): ~75% VRAM reduction, small quality trade-off
  • FP8 (H100-native): ~50% VRAM reduction, near-zero quality loss on supported hardware

Quantizing a 70B model to INT4 brings it from ~140GB to ~35GB — fitting comfortably on a single A100 80GB instead of requiring a multi-GPU setup.

---

Right-Sizing at Scale

Manual right-sizing works for a handful of models. At fleet scale — dozens of models, multiple clusters, mixed providers — it becomes untenable. Models get deployed and forgotten. Traffic patterns shift. New model versions change memory profiles.

Continuous right-sizing requires automated monitoring of VRAM headroom, SM utilization, and concurrency patterns — with alerts when a workload drifts outside its optimal tier range.

See how Paralleliq detects tier mismatches across your inference fleet →

---

Next in the GPU Ops Field Guide: [KV Cache Pressure: Symptoms, Causes, and Fixes →](/blog/gpu-ops-kv-cache-pressure)

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free