Serverless GPU Cold Start Latency: Causes and Solutions

Serverless GPU inference promises zero idle cost. The hidden trade-off is cold start latency — which for large LLMs can range from 30 seconds to several minutes. Here's what causes it and how to manage it.
The Serverless Promise and Its Cost
Serverless GPU inference is compelling: scale to zero when idle, pay only for active compute, no reserved capacity sitting unused overnight. For bursty or unpredictable workloads, it's a significant cost reduction.
The catch is cold start latency. When a request arrives and no warm replica exists, the system must:
- Provision a GPU instance
- Pull the container image
- Download model weights
- Load weights into VRAM
- Initialize the inference server
- Process the first request
For a 7B model, this sequence takes 15–45 seconds. For a 70B model, it can take 3–5 minutes. That first request — and every request during scale-up — waits.
---
Breaking Down Cold Start Time
| Phase | Typical Duration | Main Variable |
|---|---|---|
| Instance provisioning | 10–60s | Cloud provider, GPU availability |
| Container image pull | 5–30s | Image size, registry proximity |
| Model weight download | 10–300s | Model size, storage location |
| Weight loading to VRAM | 5–60s | Model size, NVMe speed |
| Server initialization | 5–15s | Framework, configuration |
| Total (7B model) | 35–165s | |
| Total (70B model) | 180–465s |
The dominant factor for large models is weight download and loading. A 70B model in FP16 is ~140GB. Even at 2GB/s storage throughput, that's 70 seconds just for I/O.
---
Causes of Excessive Cold Start
Weights stored in remote object storage
If model weights are pulled from S3 or GCS on every cold start, cold start time is dominated by network transfer. Object storage bandwidth to a fresh GPU instance is often 200–500MB/s — making a 70B model pull take 4–10 minutes.
Fix: Pre-load weights to NVMe-attached local storage or use a shared network filesystem (EFS, Filestore) that stays warm between cold starts.
Large container images
A container image with PyTorch, CUDA libraries, and model dependencies can easily reach 20–30GB. Pulling this on every cold start adds significant time even before weights are considered.
Fix: Use image caching at the node level. Most Kubernetes-based serverless platforms support image pre-pulling on nodes. Keep base images lean and layer model weights separately.
No instance pre-warming
Pure scale-to-zero means every cold start starts from scratch. If there's no mechanism to pre-warm instances before traffic arrives, the first users after an idle period always absorb the cold start penalty.
Fix: Maintain a minimum of one warm replica during business hours. Use predictive scaling to pre-warm before anticipated traffic based on historical patterns.
Inefficient weight loading
Loading weights sequentially from disk to VRAM is slower than loading in parallel. Some inference servers also run model validation or compilation steps on startup that add unnecessary time.
Fix: Use frameworks that support tensor parallelism loading (weights loaded in parallel across GPUs). Use compiled/cached model formats where available (TensorRT-LLM, vLLM's built-in caching).
---
Strategies to Manage Cold Start
Strategy 1 — Minimum warm replicas
The simplest fix: never scale to zero. Keep at least one replica warm at all times, or during hours when traffic is likely. The cost of one idle replica is predictable and usually much less than the user experience cost of multi-minute cold starts.
Strategy 2 — Predictive pre-warming
Use historical traffic patterns to pre-warm replicas before demand arrives. If traffic spikes every weekday at 9am, begin scaling up at 8:45am. This eliminates cold starts for the majority of traffic at the cost of 15 minutes of pre-warming.
Strategy 3 — Tiered scale-to-zero
Don't scale all replicas to zero simultaneously. Keep one replica warm for the first tier of traffic and only scale additional replicas on demand. New replicas cold-start in the background while the warm replica handles the initial burst.
Strategy 4 — Weight caching on warm nodes
Pre-pull model weights to nodes that will be used for serverless inference. When a cold start occurs on a pre-loaded node, only instance provisioning and server initialization are needed — skipping the weight download phase.
Strategy 5 — Smaller models for latency-sensitive paths
For endpoints where cold start latency is unacceptable, route to a smaller, always-warm model. Use the larger model for batch or async paths where cold start time is tolerable.
---
Monitoring Cold Start in Production
Track these metrics to understand your actual cold start exposure:
| Metric | How to Collect |
|---|---|
| Time from request arrival to first token (cold) | Trace instrumentation |
| Cold start frequency (requests hitting cold replicas) | Inference server logs |
| Scale-up trigger to ready time | Kubernetes event logs |
| Warm replica availability % | Custom metric from replica count |
A useful SLO for serverless inference: "Less than 2% of requests experience cold start latency > 10s." This gives a concrete target for pre-warming and minimum replica configuration.
See how Paralleliq detects serverless thrashing and cold start patterns in your fleet →
---
Next in the GPU Ops Field Guide: [Audit Trails for AI Infrastructure Changes →](/blog/gpu-ops-audit-trails)