How do I fix KServe cold start latency issues?

Cold start latency in KServe and serverless GPU inference has five main causes: weights stored in remote object storage (move to NVMe or a shared network filesystem), large container images (use image pre-pulling and keep base images lean), no instance pre-warming (maintain a minimum of one warm replica during business hours), inefficient weight loading (use frameworks with parallel tensor loading), and pure scale-to-zero configuration. The fastest fix for most teams is keeping a minimum of one warm replica — the cost of one idle GPU is predictable and usually far less than the user experience cost of a 3-minute cold start.

How long does a cold start take for a 70B LLM?

A 70B model cold start typically takes 3–8 minutes end-to-end: 10–60 seconds for instance provisioning, 5–30 seconds for container image pull, 70–300 seconds for model weight download (140GB at 0.5–2GB/s), 30–60 seconds for weight loading to VRAM, and 5–15 seconds for server initialization. Weight download is the dominant factor — storing weights on a shared network filesystem instead of object storage can reduce this from minutes to seconds.

Should I scale serverless GPU inference to zero replicas?

For most production workloads, pure scale-to-zero is too aggressive. The user experience cost of a 3-5 minute cold start on a 70B model outweighs the GPU cost saved during idle periods. A better pattern is tiered scale-to-zero: keep one warm replica at all times for the first tier of traffic, use predictive scaling to pre-warm before anticipated traffic spikes, and only scale additional replicas on demand. Reserve full scale-to-zero for batch or async workloads where latency is not time-sensitive.

What SLO should I set for serverless GPU inference cold starts?

A practical SLO for serverless GPU inference is: less than 2% of requests experience cold start latency greater than 10 seconds. This gives a concrete target for pre-warming configuration and minimum replica counts. Track cold start frequency (requests hitting cold replicas), scale-up trigger to ready time via Kubernetes event logs, and warm replica availability percentage as a custom metric from replica count.

GPU Ops Field Guide

Serverless GPU Cold Start Latency: Causes and Solutions

By Sam Hosseini·May 16, 2026·6 min read

Serverless GPU inference promises zero idle cost. The hidden trade-off is cold start latency — which for large LLMs can range from 30 seconds to several minutes. Here's what causes it and how to manage it.

The Serverless Promise and Its Cost

Serverless GPU inference is compelling: scale to zero when idle, pay only for active compute, no reserved capacity sitting unused overnight. For bursty or unpredictable workloads, it's a significant cost reduction.

The catch is cold start latency. When a request arrives and no warm replica exists, the system must:

Provision a GPU instance
Pull the container image
Download model weights
Load weights into VRAM
Initialize the inference server
Process the first request

For a 7B model, this sequence takes 15–45 seconds. For a 70B model, it can take 3–5 minutes. That first request — and every request during scale-up — waits.

---

Breaking Down Cold Start Time

Phase	Typical Duration	Main Variable
Instance provisioning	10–60s	Cloud provider, GPU availability
Container image pull	5–30s	Image size, registry proximity
Model weight download	10–300s	Model size, storage location
Weight loading to VRAM	5–60s	Model size, NVMe speed
Server initialization	5–15s	Framework, configuration
Total (7B model)	35–165s
Total (70B model)	180–465s

The dominant factor for large models is weight download and loading. A 70B model in FP16 is ~140GB. Even at 2GB/s storage throughput, that's 70 seconds just for I/O.

---

Causes of Excessive Cold Start

Weights stored in remote object storage

If model weights are pulled from S3 or GCS on every cold start, cold start time is dominated by network transfer. Object storage bandwidth to a fresh GPU instance is often 200–500MB/s — making a 70B model pull take 4–10 minutes.

Fix: Pre-load weights to NVMe-attached local storage or use a shared network filesystem (EFS, Filestore) that stays warm between cold starts.

Large container images

A container image with PyTorch, CUDA libraries, and model dependencies can easily reach 20–30GB. Pulling this on every cold start adds significant time even before weights are considered.

Fix: Use image caching at the node level. Most Kubernetes-based serverless platforms support image pre-pulling on nodes. Keep base images lean and layer model weights separately.

No instance pre-warming

Pure scale-to-zero means every cold start starts from scratch. If there's no mechanism to pre-warm instances before traffic arrives, the first users after an idle period always absorb the cold start penalty.

Fix: Maintain a minimum of one warm replica during business hours. Use predictive scaling to pre-warm before anticipated traffic based on historical patterns.

Inefficient weight loading

Loading weights sequentially from disk to VRAM is slower than loading in parallel. Some inference servers also run model validation or compilation steps on startup that add unnecessary time.

Fix: Use frameworks that support tensor parallelism loading (weights loaded in parallel across GPUs). Use compiled/cached model formats where available (TensorRT-LLM, vLLM's built-in caching).

---

Strategies to Manage Cold Start

Strategy 1 — Minimum warm replicas

The simplest fix: never scale to zero. Keep at least one replica warm at all times, or during hours when traffic is likely. The cost of one idle replica is predictable and usually much less than the user experience cost of multi-minute cold starts.

Strategy 2 — Predictive pre-warming

Use historical traffic patterns to pre-warm replicas before demand arrives. If traffic spikes every weekday at 9am, begin scaling up at 8:45am. This eliminates cold starts for the majority of traffic at the cost of 15 minutes of pre-warming.

Strategy 3 — Tiered scale-to-zero

Don't scale all replicas to zero simultaneously. Keep one replica warm for the first tier of traffic and only scale additional replicas on demand. New replicas cold-start in the background while the warm replica handles the initial burst.

Strategy 4 — Weight caching on warm nodes

Pre-pull model weights to nodes that will be used for serverless inference. When a cold start occurs on a pre-loaded node, only instance provisioning and server initialization are needed — skipping the weight download phase.

Strategy 5 — Smaller models for latency-sensitive paths

For endpoints where cold start latency is unacceptable, route to a smaller, always-warm model. Use the larger model for batch or async paths where cold start time is tolerable.

---

Monitoring Cold Start in Production

Track these metrics to understand your actual cold start exposure:

Metric	How to Collect
Time from request arrival to first token (cold)	Trace instrumentation
Cold start frequency (requests hitting cold replicas)	Inference server logs
Scale-up trigger to ready time	Kubernetes event logs
Warm replica availability %	Custom metric from replica count

A useful SLO for serverless inference: "Less than 2% of requests experience cold start latency > 10s." This gives a concrete target for pre-warming and minimum replica configuration.

See how Paralleliq detects serverless thrashing and cold start patterns in your fleet →

---

Next in the GPU Ops Field Guide: [Audit Trails for AI Infrastructure Changes →](/blog/gpu-ops-audit-trails)

Serverless GPU Cold Start Latency: Causes and Solutions

The Serverless Promise and Its Cost

Breaking Down Cold Start Time

Causes of Excessive Cold Start

Strategies to Manage Cold Start

Monitoring Cold Start in Production

More articles

KV Cache Pressure: Symptoms, Causes, and Fixes

vLLM OOM Errors: Root Cause Diagnosis Guide

How to Detect GPU Waste in a Kubernetes Cluster

Get more from the cluster you already have.