Question 1

What is KV cache in LLM inference?

Accepted Answer

KV cache (key-value cache) stores the intermediate attention computations for each token in a request's context window. Instead of recomputing these values for every new token generated, the model reads them from cache. KV cache lives in GPU VRAM and is the largest variable consumer of GPU memory in inference workloads — it grows linearly with context length and scales with the number of concurrent requests.

Question 2

How does context length affect GPU memory in LLM inference?

Accepted Answer

Context length has a direct linear relationship with KV cache memory. Every 2x increase in context length doubles the KV cache requirement per request. For a 7B model at FP16, each request requires approximately 0.54 GB of KV cache at 4K context and 1.07 GB at 8K context. This is why long-context workloads are so GPU-memory-intensive — the KV cache can exceed the model weights themselves at context lengths above 32K.

Question 3

Why does my LLM inference run out of memory with long contexts?

Accepted Answer

OOM errors with long contexts happen when the KV cache required for concurrent requests exceeds the VRAM available after model weights load. GPU VRAM is split between fixed model weights and variable KV cache. At long context lengths, the KV cache per request becomes very large, and serving even a small number of concurrent requests exhausts the remaining VRAM. The solution is to reduce max_model_len to your actual p99 context length, increase GPU count, or switch to a smaller model tier with lower weight overhead.

Question 4

How do I reduce KV cache memory usage in vLLM?

Accepted Answer

Three settings control KV cache memory in vLLM: (1) max_model_len — set this to your actual p99 request length rather than the model maximum. This directly caps the KV cache budget per sequence. (2) max_num_seqs — fewer concurrent sequences means less total KV cache consumption. (3) gpu_memory_utilization — controls what fraction of VRAM is reserved for the KV cache after weights load. For MoE models, use 0.75 rather than the 0.90 default. FP8 KV cache quantization can also halve KV memory requirements with minimal quality impact.

Ctx	KV/req	vs 4K
2K	687 MB	N/A
4K	1.34 GB	—
8K	2.68 GB	N/A
16K ←	5.37 GB	N/A
32K	10.74 GB	N/A
64K	21.47 GB	N/A
128K	42.95 GB	N/A
256K	85.90 GB	N/A
1M	343.60 GB	N/A

KV Cache & Context Window Calculator.

Running long-context workloads in production?

More Calculators

$/Token vs. GPU Utilization

Procurement Deferral Calculator

Capacity Risk Calculator

GPU Waste Calculator

GPU Inference TCO Calculator

Build vs. Buy: GPU Control Plane

GPU Sizing Calculator

Inference Capacity Planner

GPU Fleet Cost Optimizer

CPU:GPU Ratio Calculator

Get more from the cluster you already have.