Operators

Too Hot, Too Cold: Finding the Goldilocks Zone in AI Serving

By Sam Hosseini·October 16, 2025·6 min read

Overview

Every AI inference system operates between two extremes: maintaining numerous active workers delivers excellent response times but inflates GPU costs, while keeping few or no workers eliminates expenses but introduces cold-start delays and timeouts. The optimal approach — the "Goldilocks zone" — balances responsive latency with cost efficiency.

The Cost of Cold Starts

Cold starts occur when a serving system must initialize a new worker before processing a request. This involves:

Container launch
CUDA context initialization
Loading model weights into VRAM
Runtime graph compilation (TensorRT, ONNX Runtime, etc.)

Large models can require 5–10 seconds for startup, creating unacceptable delays for interactive applications. Beyond performance degradation, cold starts represent a cost allocation problem — startup duration either increases expenses or diminishes user experience quality.

Defining Success: Latency as an SLO

Before optimization, establish clear success metrics. A practical target includes:

p95 latency ≤ 800 ms

Requests exceeding this threshold represent SLO debt — the hidden expense of cold-start latency. Tracking this metric reveals whether the warm pool size and intelligence prove adequate.

Tiered Warmth Architecture

Effective serving systems employ tiered warmth levels aligned with traffic patterns:

Hot tier: always-ready pods handling primary requests
Warm tier: partially initialized capacity
Cold tier: dormant resources activated during demand spikes

Transitions between tiers should follow actual usage patterns via rolling QPS averages or recent activity measurements, not static timers.

Smarter Autoscaling Knobs

Traditional CPU metrics prove ineffective for GPU-bound inference. Instead, prioritize queue depth, inflight requests, GPU utilization, and queries per second (QPS).

Autoscaling configuration example:

min_replicas: 1–3 per hot model
scale_metrics: QPS, inflightrequests, gpuutil, queue_depth
prewarm_triggers: deployment events, predicted peaks, queue buildup
cooldown_period: 30 seconds

Accelerating Spin-Up

When cold starts remain unavoidable, minimize their impact:

Prebuild inference engines (TensorRT, ORT) and cache locally to eliminate runtime compilation delays
Cache model artifacts locally to prevent network retrieval during scale-up
Optimize container images for smaller size and faster deployment
Enable CUDA Graphs and pinned memory to reduce kernel-launch overhead
Keep tokenizers and featurizers resident to avoid initialization costs during preprocessing

These optimizations commonly reduce cold-start latency from seconds to hundreds of milliseconds.

VRAM Residency and Model Eviction

VRAM functions as the modern L3 cache — eviction policy significantly impacts efficiency. Maintain a top-K resident model set based on recent traffic or business priority. Implement least-recently-used eviction when VRAM pressure increases.

Routing and Batching

Latency reflects traffic steering decisions:

Sticky sessions: route repeat users to their existing warm pod
Dynamic batching: enable in Triton or TF-Serving with maximum queue delays ≤ 10 ms
Admission control: temporarily throttle low-priority traffic during queue depth spikes

Budgeted Trade-Offs

Warm worker count represents an economic decision rather than an arbitrary choice. Total cost calculation:

Total Cost = Cwarm + Clatency_penalty

Where Cwarm is steady GPU expenses for maintaining warm replicas, and Clatency_penalty is user-facing costs from latency, timeouts, or SLO violations.

Plotting this relationship reveals a U-shaped cost curve. Insufficient warmth increases latency penalties; excessive warmth inflates GPU expenses. The equilibrium point — the Goldilocks zone — minimizes total cost.

Staying Just Warm Enough

Optimal AI serving avoids chasing either zero latency or zero cost. Instead, operators identify equilibrium where latency remains within SLO bounds, utilization stays high, and GPU waste disappears.

See how Paralleliq helps →

Too Hot, Too Cold: Finding the Goldilocks Zone in AI Serving

Overview

The Cost of Cold Starts

Defining Success: Latency as an SLO

Tiered Warmth Architecture

Smarter Autoscaling Knobs

Accelerating Spin-Up

VRAM Residency and Model Eviction

Routing and Batching

Budgeted Trade-Offs

Staying Just Warm Enough

More articles

Why ML Model Deployment Needs Its Own Best Practices

The Invisible AI Deployment Footprint: Why MLOps Teams Lose Visibility as They Scale

The Hidden Costs of Manual Inference Services: Why Model Deployment Still Feels Like a Ticket Queue

Get more from the cluster you already have.