How do I monitor a GPU inference fleet at scale?

GPU fleet observability requires four layers: hardware metrics per GPU (SM utilization, memory bandwidth, VRAM, power draw via NVIDIA DCGM), inference server metrics per model (throughput, TTFT, inter-token latency, KV cache hit rate via vLLM or TGI metrics endpoints), workload metrics per deployment (pod restarts, replica count vs. traffic, cold start frequency via Kubernetes), and fleet-level aggregations that answer cross-cluster questions — which models are underutilizing their tier, which clusters have systematic waste, what the fleet-wide KV cache pressure trend is. Most teams have the first three layers but miss the fourth.

What metrics should I track for GPU fleet observability?

The most important metrics for inference fleet health: SM utilization (actual compute usage vs. capacity), memory bandwidth utilization (whether the GPU is memory-bandwidth-bound), VRAM used and free (headroom before OOM or KV cache pressure), request throughput and queue depth (capacity vs. demand), time to first token (prefill efficiency), KV cache hit rate (prefix caching effectiveness), and pod restart count (OOM or crash frequency). At fleet scale, these should be aggregated across clusters and correlated with model identity — not viewed per-node.

How is fleet-level GPU monitoring different from single-node monitoring?

Single-node GPU monitoring tells you whether a specific GPU or pod is healthy. Fleet-level observability answers cross-cluster questions: which models across the whole fleet are underutilizing their GPU tier, which provider is delivering the worst GPU-to-cost ratio this week, what the fleet-wide trend in KV cache pressure is. This requires aggregated recording rules in Prometheus and fleet-level Grafana dashboards — not more per-node dashboards. The most expensive fleet problems (systematic tier misplacement, fleet-wide waste patterns) are invisible without the aggregated view.

What Prometheus alerts should I set for a GPU inference fleet?

Alert on sustained trends rather than momentary spikes. Key alerts: fleet-wide SM utilization below 45% for 30 minutes by cluster (avg(gpu_sm_utilization) by (cluster) < 0.45 for 30m), H100 or A100 running below 35% SM utilization for one hour (possible tier mismatch), sustained GPU memory pressure above 88% averaged over 15 minutes, and allocated GPU with zero inference requests for more than 15 minutes. Cost-signal alerts — not just technical thresholds — are what make fleet observability actionable.

GPU Ops Field Guide

GPU Fleet Observability: What to Monitor and Why

By Sam Hosseini·May 16, 2026·7 min read

A single GPU dashboard is not fleet observability. At scale, the metrics that matter are aggregated, correlated, and surfaced as actionable signals — not raw telemetry. Here's what to build.

Why Single-GPU Metrics Aren't Enough

Most GPU monitoring starts with nvidia-smi or a per-node Grafana dashboard. For a single GPU or a small cluster, that's sufficient. For a fleet — multiple clusters, mixed GPU tiers, dozens of models, multiple cloud providers — per-node metrics create more noise than signal.

The questions that matter at fleet scale are different:

Which models are underutilizing their GPU tier across the whole fleet?
Which clusters have systematic waste patterns vs. one-off anomalies?
What is the fleet-wide trend in KV cache pressure over the past 7 days?
Which provider is delivering the worst GPU-to-cost ratio this week?

Answering these requires aggregated, correlated observability — not more dashboards.

---

The Four Layers of GPU Fleet Observability

Layer 1 — Hardware Metrics (per GPU)

The foundation. Collected via NVIDIA DCGM and exposed through Prometheus.

Metric	Why It Matters
SM Utilization	Actual compute usage vs. capacity
Memory Bandwidth Utilization	Whether the GPU is memory-bandwidth-bound
VRAM Used / Free	Headroom before OOM or KV cache pressure
GPU Temperature	Thermal throttling risk
Power Draw	Cost correlation and thermal headroom
PCIe Throughput	Data transfer bottlenecks

Layer 2 — Inference Server Metrics (per model)

Collected from vLLM, TGI, SGLang, or Triton metrics endpoints.

Metric	Why It Matters
Request throughput (req/s)	Capacity vs. demand
Time to first token (TTFT)	Prefill efficiency
Inter-token latency	Decode efficiency
KV cache hit rate	Prefix caching effectiveness
Queue depth	Whether the server is keeping up
Batch size distribution	Continuous batching effectiveness

Layer 3 — Workload Metrics (per deployment)

Collected from your orchestration layer (Kubernetes, Ray, custom scheduler).

Metric	Why It Matters
Pod restart count	OOM or crash frequency
Replica count vs. traffic	Autoscaling efficiency
Request error rate	Model or infrastructure health
Cold start frequency	Scale-to-zero configuration effectiveness

Layer 4 — Fleet-Level Aggregations

This is what most teams are missing. Aggregating layers 1–3 across the whole fleet to answer fleet-scale questions.

Aggregation	Signal
Fleet-wide GPU utilization distribution	What % of GPUs are under 50% SM util?
Tier mismatch rate	How many models are on the wrong GPU tier?
Provider cost efficiency	Cost per useful GPU-hour by provider
KV cache pressure by model	Which models are cache-constrained?

---

Instrumentation Stack

A practical fleet observability stack:

NVIDIA DCGM Exporter (per node)
    → Prometheus (metrics aggregation)
    → Grafana (dashboards)
    → Alertmanager (threshold alerts)

vLLM / TGI metrics endpoint (per model)
    → Prometheus

Kubernetes metrics (per pod/deployment)
    → kube-state-metrics → Prometheus

Fleet aggregation layer
    → Recording rules in Prometheus
    → Custom fleet dashboard in Grafana

The key is recording rules — pre-computed aggregations that answer fleet-scale questions without running expensive ad-hoc queries against raw telemetry.

---

Alert Design Principles

Most GPU alert setups generate too many alerts on transient spikes and miss the slow-burn patterns that actually cost money.

Alert on trends, not spikes:

# Bad: alerts on momentary spike
alert: HighGPUMemory
expr: gpu_memory_used_bytes > 0.9 * gpu_memory_total_bytes

# Better: alerts on sustained pressure
alert: SustainedGPUMemoryPressure
expr: avg_over_time(gpu_memory_used_ratio[15m]) > 0.88

Alert on fleet patterns, not individual nodes:

alert: FleetWideUnderutilization
expr: avg(gpu_sm_utilization) by (cluster) < 0.45
for: 30m

Alert on cost signals, not just technical ones:

alert: ExpensiveTierUnderutilized
expr: gpu_sm_utilization{tier="h100"} < 0.35
for: 1h
annotations:
  summary: "H100 running below 35% SM util for 1 hour — possible tier mismatch"

---

The Visibility Gap at Scale

The most dangerous fleet observability failure mode isn't missing metrics — it's having metrics but no one looking at the right level. Per-node dashboards exist but fleet-level patterns go undetected for weeks.

The discipline of fleet observability is about designing the system so that the signals that matter — tier mismatches, systematic waste, KV cache pressure trends — surface automatically as actionable findings, not buried in dashboards that require human interpretation.

See how Paralleliq aggregates fleet-level GPU observability into actionable findings →

---

Next in the GPU Ops Field Guide: [Serverless GPU Cold Start Latency: Causes and Solutions →](/blog/gpu-ops-serverless-cold-start)

GPU Fleet Observability: What to Monitor and Why

Why Single-GPU Metrics Aren't Enough

The Four Layers of GPU Fleet Observability

Instrumentation Stack

Alert Design Principles

The Visibility Gap at Scale

More articles

How to Detect GPU Waste in a Kubernetes Cluster

Audit Trails for AI Infrastructure Changes

Multi-Cluster GPU Visibility Across Providers

Get more from the cluster you already have.