How do I right-size GPU tiers for LLM inference?

Measure actual VRAM consumption under realistic traffic, add 15–20% headroom for KV cache growth, and compare against available GPU tiers. Check SM utilization on the current tier — consistently below 40% means the workload is over-tiered, above 90% with latency problems means it has outgrown the tier. Right-sizing is a function of concurrent load, not just model size — a 7B model at low traffic looks over-tiered on an A10G but perfectly sized at 8 concurrent requests.

What GPU tier do I need for a 70B model?

A 70B model in FP16 requires approximately 140GB of VRAM for weights alone, which requires tensor parallelism across 2x H100 80GB nodes or quantization. With INT4 quantization (AWQ or GPTQ), the 70B model fits in roughly 35GB — a single A100 80GB with headroom for KV cache. The right tier also depends on your concurrency target and context length requirements.

What is GPU tier misplacement and what does it cost?

GPU tier misplacement means a model is running on a GPU that is either too expensive for its workload (over-tiered) or too constrained (under-tiered). Over-tiering is the more common and costly pattern — a 7B model on an H100 when an A10G would suffice wastes $40K–$80K per GPU annually. Under-tiering causes constant OOM events, tiny batch sizes, and reliability problems that cost through latency and engineering time.

How does quantization change which GPU tier I need?

Quantization reduces model weight size without significant accuracy loss, enabling larger models to run on smaller GPU tiers. INT8 reduces VRAM by ~50%, INT4 (AWQ/GPTQ) by ~75%, and FP8 on H100 by ~50% with near-zero quality loss. Quantizing a 70B model to INT4 brings it from ~140GB to ~35GB — potentially fitting on a single A100 80GB instead of a multi-GPU setup and cutting infrastructure cost by 60–70%.

GPU Ops Field Guide

GPU Right-Sizing: Matching Tier to Workload

By Sam Hosseini·May 16, 2026·6 min read

Running a 7B model on an H100 is as wasteful as running a 70B model on an A10G. Right-sizing GPU tiers is one of the highest-leverage cost optimizations in inference — and most teams get it wrong.

The Two Directions of Mismatch

GPU tier mismatches run in both directions — and both are expensive.

Over-tiered: A small model on a high-end GPU. The model fits easily, runs fast, but consumes a fraction of the available VRAM and compute. You're paying for an H100 and getting A10G-level workload density.

Under-tiered: A large model crammed onto a GPU with insufficient VRAM. The model barely fits, KV cache is constrained, batch sizes are tiny, and the system runs at the edge of OOM. Latency suffers and stability is fragile.

Most teams discover mismatches reactively — after a cost audit or an OOM incident. The goal is to catch them proactively.

---

GPU Tier Reference for LLM Inference

GPU	VRAM	Best Fit
A10G	24 GB	7B–13B models, moderate concurrency
L40S	48 GB	13B–34B models, higher concurrency
A100 40GB	40 GB	13B–34B models, training and inference
A100 80GB	80 GB	34B–70B models, high concurrency
H100 80GB	80 GB	70B models, maximum throughput
H100 NVL	94 GB	70B+ models, long context

These are starting points. Actual fit depends on quantization, batch size, context length, and concurrency targets.

---

How to Right-Size a Workload

Step 1 — Measure actual VRAM consumption

Don't estimate — measure. Deploy the model with realistic traffic and record peak VRAM usage:

nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

Add 15–20% headroom for KV cache growth under peak load.

Step 2 — Calculate effective VRAM requirement

Required VRAM = Model weights + KV cache (peak) + Activation memory + 15% headroom

For a 70B model in FP16: ~140GB weights alone. That requires tensor parallelism across 2x H100 80GB or quantization to fit on a single node.

For a 7B model in INT8: ~7GB weights. An A10G has substantial headroom for concurrent requests.

Step 3 — Check SM utilization on the current tier

If SM utilization is consistently below 40% on an A100 or H100, the workload doesn't justify the tier. Move down.

If SM utilization is above 90% and latency is suffering, the workload has outgrown the tier. Move up or scale horizontally.

Step 4 — Factor in concurrency

A single 7B model on an A10G might run at 30% SM utilization. But with 8 concurrent requests, that same GPU might hit 85% — making the tier correct at scale even if it looks oversized at low traffic.

Right-sizing is a function of concurrent load, not just model size.

---

Common Mismatches and Their Cost

Scenario	Symptom	Annual Waste (est.)
7B model on H100 (low concurrency)	SM util < 20%	$40K–$80K per GPU
70B model on A100 40GB	Constant OOM, tiny batches	Latency + reliability cost
13B model on A10G at high concurrency	KV cache pressure, slow	Throughput ceiling hit

---

Quantization as a Right-Sizing Tool

Quantization reduces model weight size without significant accuracy loss, enabling a larger model to fit on a smaller (cheaper) GPU tier:

INT8 (bitsandbytes, LLM.int8()): ~50% VRAM reduction, minimal quality loss
AWQ / GPTQ (INT4): ~75% VRAM reduction, small quality trade-off
FP8 (H100-native): ~50% VRAM reduction, near-zero quality loss on supported hardware

Quantizing a 70B model to INT4 brings it from ~140GB to ~35GB — fitting comfortably on a single A100 80GB instead of requiring a multi-GPU setup.

---

Right-Sizing at Scale

Manual right-sizing works for a handful of models. At fleet scale — dozens of models, multiple clusters, mixed providers — it becomes untenable. Models get deployed and forgotten. Traffic patterns shift. New model versions change memory profiles.

Continuous right-sizing requires automated monitoring of VRAM headroom, SM utilization, and concurrency patterns — with alerts when a workload drifts outside its optimal tier range.

See how Paralleliq detects tier mismatches across your inference fleet →

---

Next in the GPU Ops Field Guide: [KV Cache Pressure: Symptoms, Causes, and Fixes →](/blog/gpu-ops-kv-cache-pressure)

GPU Right-Sizing: Matching Tier to Workload

The Two Directions of Mismatch

GPU Tier Reference for LLM Inference

How to Right-Size a Workload

Common Mismatches and Their Cost

Quantization as a Right-Sizing Tool

Right-Sizing at Scale

More articles

How to Reduce LLM Inference Costs Without Sacrificing SLA

Multi-Cluster GPU Visibility Across Providers

How to Detect GPU Underutilization in a Kubernetes Inference Cluster

Get more from the cluster you already have.