ParallelIQ
GPU Ops Field Guide

KV Cache Pressure: Symptoms, Causes, and Fixes

By Sam Hosseini·May 16, 2026·6 min read
KV Cache Pressure: Symptoms, Causes, and Fixes

KV cache pressure is the hidden performance killer in LLM inference. When the cache fills up, throughput collapses and latency spikes — often without a clear error message. Here's how to detect and fix it.

What Is the KV Cache?

During transformer inference, each token attends to all previous tokens. The key and value matrices from each attention layer are computed once and cached — so they don't need to be recomputed on every forward pass. This is the KV cache.

Without it, inference would be orders of magnitude slower. With it, inference scales with context length in a manageable way.

The problem: the KV cache lives in VRAM. As context windows grow — 8K, 32K, 128K tokens — the cache grows with them. On a busy inference server handling many concurrent requests, KV cache can consume 60–80% of available VRAM, leaving little room for anything else.

---

Symptoms of KV Cache Pressure

Latency spikes on long requests Short requests complete in normal time. Requests with long context windows (or long conversation histories) take significantly longer than expected — not because the model is slower, but because cache evictions are forcing recomputation.

Throughput collapse at high concurrency As concurrent requests increase, each one competes for KV cache space. When the cache is full, new requests either wait or force eviction of cached tokens from other requests. Throughput drops non-linearly.

Frequent cache evictions in logs vLLM logs cache hit rate and eviction events. A hit rate below 80% under normal load is a signal. Frequent evictions under moderate concurrency is a red flag.

INFO: Avg cache hit rate: 62.3%
WARN: KV cache eviction triggered for request_id=a3f9

GPU memory usage plateaued near capacity The GPU isn't OOMing, but memory sits at 90–95% utilization constantly. There's no room for cache growth, so the system is in a continuous eviction loop.

---

Root Causes

CauseDescription
Context window too large--max-model-len set higher than VRAM can support at target concurrency
Too many concurrent requestsEach request holds KV cache; more concurrency = less cache per request
No prefix cachingRepeated system prompts recompute cache every request instead of reusing
Inefficient block sizeKV cache block size mismatched to typical request length, causing fragmentation
CPU offloading not enabledCache has nowhere to go when VRAM fills; eviction is the only option

---

Fixes

1. Enable prefix caching

If your requests share a common system prompt or context prefix, vLLM's prefix caching reuses the cached KV blocks across requests rather than recomputing them:

vllm serve <model> --enable-prefix-caching

For workloads with consistent system prompts, this can reduce KV cache consumption by 30–50%.

2. Enable CPU offloading

When GPU VRAM fills, vLLM can swap KV cache blocks to CPU RAM rather than evicting them:

vllm serve <model> --swap-space 16  # GB of CPU RAM for KV cache

Swapping adds latency (~5–10ms per swap), but avoids the full recomputation cost of eviction.

3. Reduce max context length

If your workload doesn't actually need 128K context, don't reserve VRAM for it:

vllm serve <model> --max-model-len 8192

Matching max-model-len to your actual p99 request length frees significant VRAM for concurrent requests.

4. Tune KV cache block size

vLLM's default block size is 16 tokens. For workloads with very long or very short requests, tuning this can reduce fragmentation:

vllm serve <model> --block-size 32  # for long-context workloads

5. Limit max concurrent requests

Set an explicit concurrency ceiling that matches your VRAM budget:

vllm serve <model> --max-num-seqs 32

Better to queue requests than to thrash the cache.

---

Monitoring KV Cache Health

Track these metrics continuously:

MetricHealthyInvestigate
Cache hit rate> 85%< 70%
Cache evictions/minNear zero> 10/min
GPU memory at steady state< 85%> 92%
P99 latency vs P50 latency< 3x> 5x

A widening gap between P50 and P99 latency is often the first visible symptom of cache pressure — long-tail requests are the ones getting their cache evicted.

See how Paralleliq tracks KV cache pressure across your inference fleet →

---

Next in the GPU Ops Field Guide: [CPU vs GPU Bottlenecks in Agentic AI →](/blog/gpu-ops-cpu-gpu-bottlenecks)

More articles

Don't let performance bottlenecks slow you down. Optimize your stack and accelerate your AI outcomes.

Start for Free