Too Hot, Too Cold: Finding the Goldilocks Zone in AI Serving

Every AI inference system operates between two extremes: maintaining numerous active workers delivers excellent response times but inflates GPU costs, while keeping few or no workers eliminates expenses but introduces cold-start delays.
Overview
Every AI inference system operates between two extremes: maintaining numerous active workers delivers excellent response times but inflates GPU costs, while keeping few or no workers eliminates expenses but introduces cold-start delays and timeouts. The optimal approach — the "Goldilocks zone" — balances responsive latency with cost efficiency.
The Cost of Cold Starts
Cold starts occur when a serving system must initialize a new worker before processing a request. This involves:
- Container launch
- CUDA context initialization
- Loading model weights into VRAM
- Runtime graph compilation (TensorRT, ONNX Runtime, etc.)
Large models can require 5–10 seconds for startup, creating unacceptable delays for interactive applications. Beyond performance degradation, cold starts represent a cost allocation problem — startup duration either increases expenses or diminishes user experience quality.
Defining Success: Latency as an SLO
Before optimization, establish clear success metrics. A practical target includes:
- p95 latency ≤ 800 ms
Requests exceeding this threshold represent SLO debt — the hidden expense of cold-start latency. Tracking this metric reveals whether the warm pool size and intelligence prove adequate.
Tiered Warmth Architecture
Effective serving systems employ tiered warmth levels aligned with traffic patterns:
- Hot tier: always-ready pods handling primary requests
- Warm tier: partially initialized capacity
- Cold tier: dormant resources activated during demand spikes
Transitions between tiers should follow actual usage patterns via rolling QPS averages or recent activity measurements, not static timers.
Smarter Autoscaling Knobs
Traditional CPU metrics prove ineffective for GPU-bound inference. Instead, prioritize queue depth, inflight requests, GPU utilization, and queries per second (QPS).
Autoscaling configuration example:
- min_replicas: 1–3 per hot model
- scale_metrics: QPS, inflight_requests, gpu_util, queue_depth
- prewarm_triggers: deployment events, predicted peaks, queue buildup
- cooldown_period: 30 seconds
Accelerating Spin-Up
When cold starts remain unavoidable, minimize their impact:
- Prebuild inference engines (TensorRT, ORT) and cache locally to eliminate runtime compilation delays
- Cache model artifacts locally to prevent network retrieval during scale-up
- Optimize container images for smaller size and faster deployment
- Enable CUDA Graphs and pinned memory to reduce kernel-launch overhead
- Keep tokenizers and featurizers resident to avoid initialization costs during preprocessing
These optimizations commonly reduce cold-start latency from seconds to hundreds of milliseconds.
VRAM Residency and Model Eviction
VRAM functions as the modern L3 cache — eviction policy significantly impacts efficiency. Maintain a top-K resident model set based on recent traffic or business priority. Implement least-recently-used eviction when VRAM pressure increases.
Routing and Batching
Latency reflects traffic steering decisions:
- Sticky sessions: route repeat users to their existing warm pod
- Dynamic batching: enable in Triton or TF-Serving with maximum queue delays ≤ 10 ms
- Admission control: temporarily throttle low-priority traffic during queue depth spikes
Budgeted Trade-Offs
Warm worker count represents an economic decision rather than an arbitrary choice. Total cost calculation:
Total Cost = Cwarm + Clatency_penalty
Where Cwarm is steady GPU expenses for maintaining warm replicas, and Clatency_penalty is user-facing costs from latency, timeouts, or SLO violations.
Plotting this relationship reveals a U-shaped cost curve. Insufficient warmth increases latency penalties; excessive warmth inflates GPU expenses. The equilibrium point — the Goldilocks zone — minimizes total cost.
Staying Just Warm Enough
Optimal AI serving avoids chasing either zero latency or zero cost. Instead, operators identify equilibrium where latency remains within SLO bounds, utilization stays high, and GPU waste disappears.