OOM Root Cause for Inference Workloads

Out of memory errors in LLM inference are rarely random. They follow predictable patterns — KV cache overflow, batch size misconfiguration, memory fragmentation. Here's how to diagnose which one you're dealing with.
OOM Is Not a Random Event
When a GPU runs out of memory during inference, the instinct is to throw more VRAM at it — upgrade the GPU tier, reduce concurrency, restart the pod. These fixes work temporarily but miss the root cause.
OOM errors in LLM inference follow predictable patterns. Once you know which pattern you're dealing with, the fix is usually surgical, not expensive.
---
The Four Root Causes
1. KV Cache Overflow
The KV (key-value) cache stores intermediate attention states across tokens. For long-context requests — a 32K or 128K context window — the KV cache alone can consume more memory than the model weights.
Signature: OOM happens on long requests, not short ones. Memory usage grows linearly with context length. Short requests succeed; long ones fail.
Fix: Enable KV cache offloading to CPU RAM (vLLM supports this via --swap-space), reduce --max-model-len, or implement sliding window attention.
2. Batch Size Misconfiguration
Static batching allocates memory per request multiplied by max batch size upfront. If max_batch_size=32 and each request needs 4GB, you need 128GB before a single token is generated.
Signature: OOM at startup or immediately on first batch, not during long sessions. Memory usage is flat and high from the beginning.
Fix: Switch to continuous batching (vLLM, TGI, SGLang all support it). This allocates memory dynamically per request rather than reserving for the full batch.
3. Memory Fragmentation
PyTorch's CUDA memory allocator can fragment over time. After many allocations and deallocations, there may be enough total free memory but not enough contiguous free memory for a new allocation.
Signature: OOM happens after extended uptime, not at startup. Restarting the pod temporarily resolves it. Memory usage climbs gradually before the crash.
Fix: Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation. Use torch.cuda.empty_cache() periodically. Consider scheduled pod restarts during low-traffic windows.
4. Model + Activations Exceed VRAM
The simplest cause: the model weights plus the activation memory required during inference exceed available VRAM. Common when quantization is not applied or when the wrong GPU tier is selected.
Signature: OOM immediately on model load or first inference call, regardless of request length or batch size.
Fix: Apply quantization (AWQ, GPTQ, or bitsandbytes INT8), use tensor parallelism across multiple GPUs, or right-size to a higher-VRAM tier.
---
How to Diagnose Which One You Have
Run this sequence before changing any configuration:
Step 1 — Check when the OOM occurs
| Timing | Likely Cause |
|---|---|
| At model load | Model too large for VRAM |
| On first batch | Batch size misconfiguration |
| On long requests only | KV cache overflow |
| After hours of uptime | Memory fragmentation |
Step 2 — Read the CUDA error message
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 20.00 GiB
(GPU 0; 79.20 GiB total capacity;
71.45 GiB already allocated;
3.81 GiB free; 73.12 GiB reserved)The gap between already allocated and reserved is fragmented memory. If this gap is large, fragmentation is your problem.
Step 3 — Monitor memory over time
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.free --format=csvA memory usage graph that climbs steadily over hours points to fragmentation or a memory leak. A graph that spikes on specific request types points to KV cache or batch size issues.
---
Prevention at the Fleet Level
OOM errors that crash pods are expensive — not just because of the downtime, but because they often go undetected until a user reports a timeout. By the time ops is aware, the pod has already restarted and the evidence is gone.
Fleet-level OOM prevention requires catching the risk before the crash: tracking KV cache pressure, memory fragmentation trends, and per-model memory headroom as continuous signals — not post-mortem logs.
See how Paralleliq surfaces OOM risk before pods crash →
---
Next in the GPU Ops Field Guide: [GPU Right-Sizing: Matching Tier to Workload →](/blog/gpu-ops-right-sizing-gpu-tiers)