ParallelIQ
GPU Ops Field Guide

OOM Root Cause for Inference Workloads

By Sam Hosseini·May 16, 2026·7 min read
OOM Root Cause for Inference Workloads

Out of memory errors in LLM inference are rarely random. They follow predictable patterns — KV cache overflow, batch size misconfiguration, memory fragmentation. Here's how to diagnose which one you're dealing with.

OOM Is Not a Random Event

When a GPU runs out of memory during inference, the instinct is to throw more VRAM at it — upgrade the GPU tier, reduce concurrency, restart the pod. These fixes work temporarily but miss the root cause.

OOM errors in LLM inference follow predictable patterns. Once you know which pattern you're dealing with, the fix is usually surgical, not expensive.

---

The Four Root Causes

1. KV Cache Overflow

The KV (key-value) cache stores intermediate attention states across tokens. For long-context requests — a 32K or 128K context window — the KV cache alone can consume more memory than the model weights.

Signature: OOM happens on long requests, not short ones. Memory usage grows linearly with context length. Short requests succeed; long ones fail.

Fix: Enable KV cache offloading to CPU RAM (vLLM supports this via --swap-space), reduce --max-model-len, or implement sliding window attention.

2. Batch Size Misconfiguration

Static batching allocates memory per request multiplied by max batch size upfront. If max_batch_size=32 and each request needs 4GB, you need 128GB before a single token is generated.

Signature: OOM at startup or immediately on first batch, not during long sessions. Memory usage is flat and high from the beginning.

Fix: Switch to continuous batching (vLLM, TGI, SGLang all support it). This allocates memory dynamically per request rather than reserving for the full batch.

3. Memory Fragmentation

PyTorch's CUDA memory allocator can fragment over time. After many allocations and deallocations, there may be enough total free memory but not enough contiguous free memory for a new allocation.

Signature: OOM happens after extended uptime, not at startup. Restarting the pod temporarily resolves it. Memory usage climbs gradually before the crash.

Fix: Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation. Use torch.cuda.empty_cache() periodically. Consider scheduled pod restarts during low-traffic windows.

4. Model + Activations Exceed VRAM

The simplest cause: the model weights plus the activation memory required during inference exceed available VRAM. Common when quantization is not applied or when the wrong GPU tier is selected.

Signature: OOM immediately on model load or first inference call, regardless of request length or batch size.

Fix: Apply quantization (AWQ, GPTQ, or bitsandbytes INT8), use tensor parallelism across multiple GPUs, or right-size to a higher-VRAM tier.

---

How to Diagnose Which One You Have

Run this sequence before changing any configuration:

Step 1 — Check when the OOM occurs

TimingLikely Cause
At model loadModel too large for VRAM
On first batchBatch size misconfiguration
On long requests onlyKV cache overflow
After hours of uptimeMemory fragmentation

Step 2 — Read the CUDA error message

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 20.00 GiB
(GPU 0; 79.20 GiB total capacity;
 71.45 GiB already allocated;
 3.81 GiB free; 73.12 GiB reserved)

The gap between already allocated and reserved is fragmented memory. If this gap is large, fragmentation is your problem.

Step 3 — Monitor memory over time

watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free --format=csv

A memory usage graph that climbs steadily over hours points to fragmentation or a memory leak. A graph that spikes on specific request types points to KV cache or batch size issues.

---

Prevention at the Fleet Level

OOM errors that crash pods are expensive — not just because of the downtime, but because they often go undetected until a user reports a timeout. By the time ops is aware, the pod has already restarted and the evidence is gone.

Fleet-level OOM prevention requires catching the risk before the crash: tracking KV cache pressure, memory fragmentation trends, and per-model memory headroom as continuous signals — not post-mortem logs.

See how Paralleliq surfaces OOM risk before pods crash →

---

Next in the GPU Ops Field Guide: [GPU Right-Sizing: Matching Tier to Workload →](/blog/gpu-ops-right-sizing-gpu-tiers)

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free