Can you change a GPU tier without downtime?

Yes, with the right Kubernetes configuration. The key settings are maxUnavailable: 0 in the Deployment rolling update strategy and a working readiness probe on the vLLM /health endpoint. With these in place, the old pod stays up and serves traffic until the new pod on the target tier has fully loaded the model and passed its health check. For single-replica deployments, a blue/green rollout — spin up the new deployment fully, then switch the Service selector — eliminates the loading gap entirely.

What happens to in-flight requests during a Kubernetes rolling update?

Kubernetes sends a SIGTERM to the pod being terminated. vLLM handles graceful shutdown by draining in-flight requests before exiting. Requests that arrived before SIGTERM complete normally. New requests are routed to the new pod once it passes its readiness probe. The overlap window is typically 30–120 seconds depending on model size.

How does GPU optimization actuation work in a GitOps environment?

Paralleliq opens a pull request against the config repository with the specific change — nodeSelector, resource limits, or vLLM startup arguments. The PR contains only the modified field. A second engineer reviews and merges the PR. ArgoCD or Flux detects the merge and applies the change to the cluster. The Paralleliq audit trail records the approval; the Git history records the PR and merge.

Who reviews a GPU configuration change before it reaches the cluster?

In GitOps environments: two people — the operator who approves the recommendation in Paralleliq, and the engineer who reviews and merges the PR. In Helm or direct kubectl environments: one person — the operator whose approval triggers the actuation. Which pattern is appropriate depends on the criticality of the workload and the team's change management requirements.

Does Paralleliq own the rollout when applying GPU optimization changes?

No. Paralleliq tells you what to change and verifies it worked. The rollout — rolling update strategy, PodDisruptionBudgets, canary weights, blue/green transitions — is the operator's CI/CD pipeline's responsibility. Paralleliq runs pre-flight checks before dispatching a change (single replica warning, maxUnavailable check, PDB check) but delegates execution to whatever deployment tooling the cluster already uses.

What is the difference between 'approved' and 'actuated' in a GPU ops audit trail?

Approved means an operator has authorized the change and accepted accountability for it. Actuated means the change was applied to the cluster. In GitOps environments these can be separated by minutes — the time it takes for a PR to be reviewed and merged. In direct kubectl environments they happen in close sequence. The separation matters for compliance: approval is a human decision event; actuation is a system execution event. Both need to be in the record.

How do you verify that a GPU optimization recommendation actually worked?

Post-actuation verification watches the specific metrics that triggered the original recommendation and confirms they moved in the expected direction after the change was applied — OOM kills dropped to zero, GPU utilization normalized, KV cache pressure resolved. If the metrics do not improve, the verified event is not written and the recommendation surfaces again for re-evaluation. This is what closes the loop between recommendation and outcome.

What happens if a GPU configuration change makes things worse?

Rollback. Because Paralleliq captures the pre-change state at actuation time, a rollback triggers a new approval workflow with the inverse change, creates a new audit entry, and restores the previous configuration through the same deployment integration path. With canary rollout enabled, rollback can happen automatically when a metric threshold is breached — before the change ever reaches 100% of the fleet. Without canary, rollback is a human-approved event initiated from the dashboard.

Does Paralleliq support canary rollouts for GPU optimization changes?

Yes. Each rule can define a canaryPolicy that controls the traffic split percentage, observation window, and rollback triggers. If you are already using Argo Rollouts or Flagger, Paralleliq integrates with those tools and monitors the canary outcome. If not, Paralleliq can manage the traffic split directly. Either way, every canary event — started, metric alert, promoted, rolled back — is written to the audit trail with timestamps and actor identity.

GPU Ops Field Guide

From GPU Waste Finding to Production Change: What Actually Happens in Between

By Sam Hosseini·June 9, 2026·8 min read

Every GPU optimization tool will tell you what's wrong. Almost none of them tell you what happens next — between the moment an engineer agrees with a recommendation and the moment the fleet actually changes.

The approval is an intent, not a change

When an engineer clicks approve on a GPU optimization recommendation, they have expressed intent. Nothing in the cluster has changed yet. The pod is still on the wrong tier. The KV cache is still saturated. The OOM kills are still happening.

What approval does is authorize actuation — it says "this recommendation is correct and I am accountable for it." That authorization needs to be captured with an immutable record: who approved, when, under what identity, for which workload. Without that record you have a button, not a workflow.

The change itself requires a separate step: getting the new configuration into the cluster through whatever mechanism that cluster is managed by.

---

Three deployment integration patterns

How a configuration change reaches a Kubernetes cluster depends entirely on how that cluster is managed. There is no universal answer. Optimization tools that assume a single path fail in production environments.

GitOps (ArgoCD / Flux)

The most common pattern in mature AI infrastructure teams. The desired state of every deployment lives in a Git repository. ArgoCD or Flux continuously reconciles the cluster against that repository.

In this model, an optimization recommendation becomes a pull request. Paralleliq opens a PR against the config repository with the specific change: a nodeSelector update, a vLLM argument flag, a replica count. A second engineer reviews the diff and merges it. ArgoCD detects the merge and applies the change to the cluster automatically.

The audit chain has two humans: the operator who approved the recommendation in Paralleliq, and the engineer who reviewed and merged the PR in Git. Both are in the record.

Helm-managed deployments

Helm stores the current release state — values, chart version, computed manifests — as Kubernetes secrets. Changes go through helm upgrade. Paralleliq reads the release metadata to identify which values key controls the setting being changed (nodeSelector, resource limits, vLLM startup args), then generates the precise upgrade command:

helm upgrade vllm-mistral-7b ./charts/vllm \
  --reuse-values \
  --set gpu.nodeSelector.accelerator=nvidia-l4

The --reuse-values flag is non-negotiable — it scopes the change to exactly what was recommended and leaves everything else untouched.

Direct Kubernetes API

Smaller teams or clusters managed without a GitOps layer use direct kubectl or the Kubernetes API. Paralleliq's in-cluster actuator applies a targeted JSON patch to the Deployment spec. The change is immediate once approved. The audit trail lives in Paralleliq rather than in a Git commit.

---

Rollout safety is the operator's responsibility, not the tool's

Changing a Deployment in Kubernetes triggers a rolling update. A new pod starts on the target GPU tier, the serving framework loads the model into memory, the pod passes its readiness probe, and then — only then — the old pod is terminated and traffic shifts.

For stateless web services, this takes seconds. For inference workloads, the model loading window changes the calculus entirely. A 7B model at FP16 takes 15–30 seconds to load. A 70B model takes 60–120 seconds. During that window, capacity is reduced.

Two Kubernetes settings control whether this goes smoothly:

`maxUnavailable: 0` — the old pod is never terminated until the new one is ready. This is not the Kubernetes default.
A working readiness probe — vLLM exposes a /health endpoint. Until it returns 200, Kubernetes will not send traffic to the new pod and will not count it as available.

Without both settings, a rolling update on an inference workload can briefly serve zero replicas.

Paralleliq's role at this boundary is pre-flight checking, not pipeline ownership.

Before dispatching a change, Paralleliq checks: Is this a single-replica deployment? Is maxUnavailable set correctly? Does a PodDisruptionBudget exist? It surfaces these as warnings in the approval flow — not blockers, but signals the operator should see before authorizing the change. The CD pipeline executes the rollout. Paralleliq makes sure the operator has the full picture first.

---

Canary rollout: applying recommendations safely

A rolling update applies a change to 100% of pods sequentially. For most configuration changes that is fine. For GPU tier migrations on production inference workloads, it is a bet — you are committing the entire deployment to the new configuration before you know whether it holds under real traffic.

Canary rollout changes that calculus. Instead of applying the recommendation fleet-wide immediately, you route a fraction of traffic — typically 10% — to the new configuration and hold for an observation window before promoting.

In practice this means:

The recommendation is approved — intent authorized, audit entry written
The change is dispatched to 10% of replicas via your deployment integration (Argo Rollouts, Flagger, or a weighted Kubernetes Service)
Paralleliq watches the metrics that triggered the original finding: latency p95, error rate, OOM kills
If metrics hold for the observation window → the change is promoted to 100% and a verified entry is written
If any metric breaches its threshold → rollback triggers automatically, a canary_alert entry is written, and the fleet is restored to its pre-change state

The observation window and rollback thresholds are defined per rule:

canaryPolicy:
  trafficSplitPct: 10
  observationWindowMinutes: 30
  rollbackTriggers:
    - metric: latency.p95_ms
      threshold: 2000
      comparison: gt
    - metric: errors.rate
      threshold: 0.05
      comparison: gt
    - metric: stability.oomKills
      threshold: 1
      comparison: ge

Paralleliq does not replace your CD pipeline. If you are using Argo Rollouts or Flagger, Paralleliq dispatches the change and those tools manage the traffic split. Paralleliq monitors the outcome and writes the canary events to the audit trail. Either way, the full sequence — canary started, metrics observed, promoted or rolled back — is captured with timestamps and actor identity.

---

The loop only closes with verification

This is the most commonly missing piece in GPU optimization workflows.

A recommendation is a hypothesis: "if you move this model to an L4, OOM kills will stop." Approving and applying the recommendation tests that hypothesis. But most tools treat the actuation event as the end of the story. The recommendation moves to "approved," the dashboard clears it, and the team moves on.

The problem: recommendations can be correct in analysis and wrong in outcome. A model moved to a larger tier might still OOM if the workload grew while the change was being rolled out. A replica scale-up might not resolve KV cache pressure if the root cause is prompt length, not pod count. Without a verification signal, you have a decision audit trail but not a feedback loop.

Post-actuation verification means watching the specific metrics that triggered the recommendation and confirming they moved in the right direction:

OOM kills: 0 in the 24 hours following a tier upgrade
GPU utilization: normalized from 11% to 80%+ after a tier downgrade
KV cache utilization: below 85% following a replica scale-out

That verified signal is what turns an audit trail into a learning system. Each verified recommendation becomes a data point: this model, this cluster configuration, this traffic profile, this change — outcome confirmed.

---

What the full audit chain should look like

A complete actuation record for a GitOps environment has six entries, not two:

Time	Event	Actor
10:04	Approved	Sam Hassan — intent authorized
10:05	PR opened	system — PR #47 → github.com/acme/k8s-configs
10:09	PR merged	Karthik Rajan — diff reviewed
10:10	Actuated	system — ArgoCD applied
10:14	Verified	system — GPU utilization 81% (was 11%) · throughput nominal

This chain answers every compliance question before it is asked: who decided, what changed, who reviewed the diff, when it hit the cluster, and whether it worked. It is also the chain that makes rollback unambiguous — if verification fails, you know exactly what to revert and when the pre-change state was last healthy.

---

Paralleliq is the model-aware GPU fleet optimization layer for AI infrastructure. It surfaces recommendations, routes them through your deployment integration, and verifies outcomes — without owning your cluster or replacing your control plane.

[Request a demo](https://paralleliq.ai)

From GPU Waste Finding to Production Change: What Actually Happens in Between

The approval is an intent, not a change

Three deployment integration patterns

Rollout safety is the operator's responsibility, not the tool's

Canary rollout: applying recommendations safely

The loop only closes with verification

What the full audit chain should look like

More articles

How to Detect GPU Waste in a Kubernetes Cluster

CPU vs GPU Bottlenecks in Agentic AI Workloads

How to Detect GPU Underutilization in a Kubernetes Inference Cluster

Get more from the cluster you already have.