GPU Fleet Observability: What to Monitor and Why

A single GPU dashboard is not fleet observability. At scale, the metrics that matter are aggregated, correlated, and surfaced as actionable signals — not raw telemetry. Here's what to build.
Why Single-GPU Metrics Aren't Enough
Most GPU monitoring starts with nvidia-smi or a per-node Grafana dashboard. For a single GPU or a small cluster, that's sufficient. For a fleet — multiple clusters, mixed GPU tiers, dozens of models, multiple cloud providers — per-node metrics create more noise than signal.
The questions that matter at fleet scale are different:
- Which models are underutilizing their GPU tier across the whole fleet?
- Which clusters have systematic waste patterns vs. one-off anomalies?
- What is the fleet-wide trend in KV cache pressure over the past 7 days?
- Which provider is delivering the worst GPU-to-cost ratio this week?
Answering these requires aggregated, correlated observability — not more dashboards.
---
The Four Layers of GPU Fleet Observability
Layer 1 — Hardware Metrics (per GPU)
The foundation. Collected via NVIDIA DCGM and exposed through Prometheus.
| Metric | Why It Matters |
|---|---|
| SM Utilization | Actual compute usage vs. capacity |
| Memory Bandwidth Utilization | Whether the GPU is memory-bandwidth-bound |
| VRAM Used / Free | Headroom before OOM or KV cache pressure |
| GPU Temperature | Thermal throttling risk |
| Power Draw | Cost correlation and thermal headroom |
| PCIe Throughput | Data transfer bottlenecks |
Layer 2 — Inference Server Metrics (per model)
Collected from vLLM, TGI, SGLang, or Triton metrics endpoints.
| Metric | Why It Matters |
|---|---|
| Request throughput (req/s) | Capacity vs. demand |
| Time to first token (TTFT) | Prefill efficiency |
| Inter-token latency | Decode efficiency |
| KV cache hit rate | Prefix caching effectiveness |
| Queue depth | Whether the server is keeping up |
| Batch size distribution | Continuous batching effectiveness |
Layer 3 — Workload Metrics (per deployment)
Collected from your orchestration layer (Kubernetes, Ray, custom scheduler).
| Metric | Why It Matters |
|---|---|
| Pod restart count | OOM or crash frequency |
| Replica count vs. traffic | Autoscaling efficiency |
| Request error rate | Model or infrastructure health |
| Cold start frequency | Scale-to-zero configuration effectiveness |
Layer 4 — Fleet-Level Aggregations
This is what most teams are missing. Aggregating layers 1–3 across the whole fleet to answer fleet-scale questions.
| Aggregation | Signal |
|---|---|
| Fleet-wide GPU utilization distribution | What % of GPUs are under 50% SM util? |
| Tier mismatch rate | How many models are on the wrong GPU tier? |
| Provider cost efficiency | Cost per useful GPU-hour by provider |
| KV cache pressure by model | Which models are cache-constrained? |
---
Instrumentation Stack
A practical fleet observability stack:
NVIDIA DCGM Exporter (per node)
→ Prometheus (metrics aggregation)
→ Grafana (dashboards)
→ Alertmanager (threshold alerts)
vLLM / TGI metrics endpoint (per model)
→ Prometheus
Kubernetes metrics (per pod/deployment)
→ kube-state-metrics → Prometheus
Fleet aggregation layer
→ Recording rules in Prometheus
→ Custom fleet dashboard in GrafanaThe key is recording rules — pre-computed aggregations that answer fleet-scale questions without running expensive ad-hoc queries against raw telemetry.
---
Alert Design Principles
Most GPU alert setups generate too many alerts on transient spikes and miss the slow-burn patterns that actually cost money.
Alert on trends, not spikes:
# Bad: alerts on momentary spike
alert: HighGPUMemory
expr: gpu_memory_used_bytes > 0.9 * gpu_memory_total_bytes
# Better: alerts on sustained pressure
alert: SustainedGPUMemoryPressure
expr: avg_over_time(gpu_memory_used_ratio[15m]) > 0.88Alert on fleet patterns, not individual nodes:
alert: FleetWideUnderutilization
expr: avg(gpu_sm_utilization) by (cluster) < 0.45
for: 30mAlert on cost signals, not just technical ones:
alert: ExpensiveTierUnderutilized
expr: gpu_sm_utilization{tier="h100"} < 0.35
for: 1h
annotations:
summary: "H100 running below 35% SM util for 1 hour — possible tier mismatch"---
The Visibility Gap at Scale
The most dangerous fleet observability failure mode isn't missing metrics — it's having metrics but no one looking at the right level. Per-node dashboards exist but fleet-level patterns go undetected for weeks.
The discipline of fleet observability is about designing the system so that the signals that matter — tier mismatches, systematic waste, KV cache pressure trends — surface automatically as actionable findings, not buried in dashboards that require human interpretation.
See how Paralleliq aggregates fleet-level GPU observability into actionable findings →
---
Next in the GPU Ops Field Guide: [Serverless GPU Cold Start Latency: Causes and Solutions →](/blog/gpu-ops-serverless-cold-start)