Driver upgrade and self-healing
This page applies only when Driver is enabled on the NPUOperatorCtl instance. If you set spec.driver.enabled=false (pre-installed driver mode), the operator does not touch the host driver and neither flow below runs — manage the driver version and any chip-recovery reboots through whatever process installed the .run package.
Starting with v1.2.4 the NPU Operator manages the driver lifecycle on each NPU node end-to-end. This page covers the two state-driven flows it exposes:
- Driver upgrade — rolling out a new driver version cluster-wide.
- Chip self-healing — automatically recovering a node whose chip has wedged at runtime.
Both flows go through the same node-reboot path. Ascend chips cannot tolerate rmmod of a running driver — attempting an in-place driver swap leaves the chip in an unrecoverable dev_num=0 state. The operator therefore always loads new modules from a fresh boot rather than reloading them inside a running kernel.
TOC
Driver upgradeStep 1: Prepare before triggering1.1 Pre-upgrade checklist1.2 Stop NPU workloadsStep 2: Trigger the upgradeStep 3: Approve each node's rebootStep 4: Verify the new driver is in placeChip self-healingAuto vs. manual recoveryManual recovery walk-throughFalse-positive suppressionDriver upgrade
Stop all NPU workloads before starting a driver upgrade. The upgrade reboots every NPU node in sequence and is not designed to be transparent to running training jobs or inference services. Scale every NPU-requesting Deployment / StatefulSet / training job to zero, drain in-flight requests, and persist any model checkpoints before patching spec.driver.version. Restore the workloads after the rollout finishes — see Step 4 for how to detect that.
The upgrade walks through four ordered steps: prepare the cluster, trigger the version bump, approve each node's reboot, and verify the new driver. Do not start triggering until Step 1 is done — that is the single most common cause of a stuck rollout.
Step 1: Prepare before triggering
Two things must be in place before you touch spec.driver.version. Both are mandatory; skipping either is how upgrades get stuck.
1.1 Pre-upgrade checklist
-
New driver image present in the cluster registry for every NPU node profile. Reuse the
skopeo copystep from Step 1.1 of installation for each(chip, kernel, OS)combination:If no tag in the source Docker Hub repo matches your node's kernel, contact Customer Support — see the installation note on missing tags.
-
ImageWhiteListupdated to include the new tag(s):
1.2 Stop NPU workloads
Scale every NPU-requesting Deployment / StatefulSet / training job to zero before continuing. Restore them once the rollout finishes — see Step 4.
Step 2: Trigger the upgrade
Edit your NPUOperatorCtl instance (created at the end of Step 5.3 of installation) and update two fields in a single save:
spec.driver.version— the new driver version (e.g.25.5.0).spec.driver.upgradePolicy.autoUpgrade— keep the defaultfalseto approve each node reboot manually; settrueto let the operator roll the whole cluster automatically. Automatic mode is recommended only for routine version bumps on bare-metal clusters.
Via the platform UI (recommended):
- Administrator > Marketplace > OperatorHub > Installed Operators > Alauda Build of NPU Operator.
- Open the NPUOperatorCtl tab and click the instance (default name
npuoperatorctl-sample). - Edit YAML, change both fields, then Save.
Via kubectl:
(Replace npu-operator with your install namespace and npuoperatorctl-sample with your instance name if either differs.)
Step 3: Approve each node's reboot
Skip this step if you set autoUpgrade: true in Step 2 — the operator reboots each node automatically.
In manual mode, the operator waits for an admin to annotate each NPU node with npu.openfuyao.com/approve-reboot=true before it will reboot that node. The annotation can be set at any time — including before Step 2 — and is consumed by the operator once per reboot. The simplest pattern is to approve every NPU node up front in a single command:
Or approve one node at a time, in any order:
Step 4: Verify the new driver is in place
The rollout is finished when every node's DRIVER-UPGRADE-STATE value is empty. Check the node labels first:
Then confirm each driver pod is running the new image:
Finally, run npu-smi on each NPU node host (or enter the node through your platform's node-debug / SSH workflow) and confirm the host reports the new version:
You can now restore the NPU workloads you stopped in Step 1.2.
Chip self-healing
Experimental feature. Chip self-healing is shipped as a tech preview in v1.2.4 and disabled by default. Validate it in a staging environment before turning it on in production, and keep autoRecover set to false until you have signed off on the behaviour for your hardware profile.
Some Ascend deployment scenarios — most notably 910B in VM with PCI passthrough — can occasionally leave the chip in a state where the kernel modules are loaded but the driver itself reports dev_num=0 and refuses DCMI calls. From userspace this looks like a "healthy" host (all /dev/davinciN exist, modules loaded) but every npu-smi info returns an error. The only recovery is a node reboot.
v1.2.4 detects this case automatically:
- The driver pod runs a health-watch loop that probes each NPU with
npu-smi infoevery 60 seconds. - After three consecutive failures (≈3 minutes), the loop declares a runtime wedge: it drops the driver-ready marker (so the validator pod sees
NotReadyand the device plugin reports the chip asUnhealthyimmediately) and writes arecoverymarker for the rebooter. - A rebooter pod observes the marker on its next 30-second poll cycle.
Auto vs. manual recovery
The behaviour at this point is controlled by spec.driver.recoveryPolicy.autoRecover (independent of autoUpgrade):
When auto-recovery is on the chip self-heals without operator intervention — typically 3 minutes of detection plus a few minutes of reboot. When it is off, the same approve-reboot annotation used for upgrades authorizes the recovery reboot.
A typical recovery configuration:
Manual recovery walk-through
If autoRecover: false and the health-watch loop has detected a wedge:
-
Find the affected nodes and read the marker reason:
For self-healing, the Event message contains
reason=recovery. (Driver upgrades usereason=upgrade.) -
Approve the reboot when convenient:
-
The rebooter reboots the node; on fresh boot the driver pod loads the modules cleanly and the validator pod re-confirms readiness.
False-positive suppression
If the chip recovers on its own before the rebooter acts, the health-watch loop clears its own marker so the node is not rebooted unnecessarily.