Driver upgrade and self-healing

INFO

This page applies only when Driver is enabled on the NPUOperatorCtl instance. If you set spec.driver.enabled=false (pre-installed driver mode), the operator does not touch the host driver and neither flow below runs — manage the driver version and any chip-recovery reboots through whatever process installed the .run package.

Starting with v1.2.4 the NPU Operator manages the driver lifecycle on each NPU node end-to-end. This page covers the two state-driven flows it exposes:

  • Driver upgrade — rolling out a new driver version cluster-wide.
  • Chip self-healing — automatically recovering a node whose chip has wedged at runtime.

Both flows go through the same node-reboot path. Ascend chips cannot tolerate rmmod of a running driver — attempting an in-place driver swap leaves the chip in an unrecoverable dev_num=0 state. The operator therefore always loads new modules from a fresh boot rather than reloading them inside a running kernel.

Driver upgrade

WARNING

Stop all NPU workloads before starting a driver upgrade. The upgrade reboots every NPU node in sequence and is not designed to be transparent to running training jobs or inference services. Scale every NPU-requesting Deployment / StatefulSet / training job to zero, drain in-flight requests, and persist any model checkpoints before patching spec.driver.version. Restore the workloads after the rollout finishes — see Step 4 for how to detect that.

The upgrade walks through four ordered steps: prepare the cluster, trigger the version bump, approve each node's reboot, and verify the new driver. Do not start triggering until Step 1 is done — that is the single most common cause of a stuck rollout.

Step 1: Prepare before triggering

Two things must be in place before you touch spec.driver.version. Both are mandatory; skipping either is how upgrades get stuck.

1.1 Pre-upgrade checklist

  1. New driver image present in the cluster registry for every NPU node profile. Reuse the skopeo copy step from Step 1.1 of installation for each (chip, kernel, OS) combination:

    # On a machine with internet access:
    NEW_VERSION=25.5.0
    
    for COMBO in 910b-6.6.0-145.0.4.135-oe2403sp3 310p-6.6.0-145.0.4.135-oe2403sp3; do
      skopeo copy --all \
        docker://docker.io/alaudadockerhub/ascend-driver:${NEW_VERSION}-$COMBO \
        docker://<your-cluster-registry>/mlops/ascend-driver:${NEW_VERSION}-$COMBO
    done

    If no tag in the source Docker Hub repo matches your node's kernel, contact Customer Support — see the installation note on missing tags.

  2. ImageWhiteList updated to include the new tag(s):

    kubectl edit imagewhitelist ascend-driver -n cpaas-system
    # add each new tag under spec.repoList

1.2 Stop NPU workloads

Scale every NPU-requesting Deployment / StatefulSet / training job to zero before continuing. Restore them once the rollout finishes — see Step 4.

# Examples — scale every NPU-using workload to zero
kubectl scale -n serving inferenceservice/my-model --replicas=0
kubectl scale -n training statefulset/llama-trainer --replicas=0

Step 2: Trigger the upgrade

Edit your NPUOperatorCtl instance (created at the end of Step 5.3 of installation) and update two fields in a single save:

  • spec.driver.version — the new driver version (e.g. 25.5.0).
  • spec.driver.upgradePolicy.autoUpgrade — keep the default false to approve each node reboot manually; set true to let the operator roll the whole cluster automatically. Automatic mode is recommended only for routine version bumps on bare-metal clusters.

Via the platform UI (recommended):

  1. Administrator > Marketplace > OperatorHub > Installed Operators > Alauda Build of NPU Operator.
  2. Open the NPUOperatorCtl tab and click the instance (default name npuoperatorctl-sample).
  3. Edit YAML, change both fields, then Save.

Via kubectl:

NEW_VERSION=25.5.0

kubectl -n npu-operator patch npuoperatorctl npuoperatorctl-sample --type=merge \
  -p "{\"spec\":{\"driver\":{\"version\":\"${NEW_VERSION}\",\"upgradePolicy\":{\"autoUpgrade\":false}}}}"

(Replace npu-operator with your install namespace and npuoperatorctl-sample with your instance name if either differs.)

Step 3: Approve each node's reboot

Skip this step if you set autoUpgrade: true in Step 2 — the operator reboots each node automatically.

In manual mode, the operator waits for an admin to annotate each NPU node with npu.openfuyao.com/approve-reboot=true before it will reboot that node. The annotation can be set at any time — including before Step 2 — and is consumed by the operator once per reboot. The simplest pattern is to approve every NPU node up front in a single command:

kubectl get nodes -l openfuyao.com/npu.present -o name \
  | xargs -I{} kubectl annotate {} npu.openfuyao.com/approve-reboot=true --overwrite

Or approve one node at a time, in any order:

kubectl annotate node ${nodeName} npu.openfuyao.com/approve-reboot=true

Step 4: Verify the new driver is in place

The rollout is finished when every node's DRIVER-UPGRADE-STATE value is empty. Check the node labels first:

kubectl get nodes -l openfuyao.com/npu.present \
  -o custom-columns=NAME:.metadata.name,DRIVER-UPGRADE-STATE:.metadata.labels.npu\.openfuyao\.com/driver-upgrade-state

Then confirm each driver pod is running the new image:

kubectl get pod -n npu-operator -l app=npu-driver-daemonset \
  -o jsonpath='{range .items[*]}{.spec.nodeName}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

Finally, run npu-smi on each NPU node host (or enter the node through your platform's node-debug / SSH workflow) and confirm the host reports the new version:

LD_LIBRARY_PATH=/var/lib/Ascend/driver/lib64/driver:/var/lib/Ascend/driver/lib64/common \
  /var/lib/Ascend/driver/tools/npu-smi info | head -2

You can now restore the NPU workloads you stopped in Step 1.2.

Chip self-healing

CAUTION

Experimental feature. Chip self-healing is shipped as a tech preview in v1.2.4 and disabled by default. Validate it in a staging environment before turning it on in production, and keep autoRecover set to false until you have signed off on the behaviour for your hardware profile.

Some Ascend deployment scenarios — most notably 910B in VM with PCI passthrough — can occasionally leave the chip in a state where the kernel modules are loaded but the driver itself reports dev_num=0 and refuses DCMI calls. From userspace this looks like a "healthy" host (all /dev/davinciN exist, modules loaded) but every npu-smi info returns an error. The only recovery is a node reboot.

v1.2.4 detects this case automatically:

  • The driver pod runs a health-watch loop that probes each NPU with npu-smi info every 60 seconds.
  • After three consecutive failures (≈3 minutes), the loop declares a runtime wedge: it drops the driver-ready marker (so the validator pod sees NotReady and the device plugin reports the chip as Unhealthy immediately) and writes a recovery marker for the rebooter.
  • A rebooter pod observes the marker on its next 30-second poll cycle.

Auto vs. manual recovery

The behaviour at this point is controlled by spec.driver.recoveryPolicy.autoRecover (independent of autoUpgrade):

autoRecoverBehaviour after wedge detection
false (default)The rebooter emits a RebootRequired Event (with reason=recovery) and waits for an administrator to annotate the node.
trueThe rebooter reboots the node after the uptime guard and cluster-wide lock are satisfied.

When auto-recovery is on the chip self-heals without operator intervention — typically 3 minutes of detection plus a few minutes of reboot. When it is off, the same approve-reboot annotation used for upgrades authorizes the recovery reboot.

A typical recovery configuration:

spec:
  driver:
    recoveryPolicy:
      autoRecover: true   # rebooter reboots automatically on detected wedge

Manual recovery walk-through

If autoRecover: false and the health-watch loop has detected a wedge:

  1. Find the affected nodes and read the marker reason:

    kubectl get nodes -l npu.openfuyao.com/reboot-required=true
    kubectl get events -A --field-selector reason=RebootRequired --sort-by=.lastTimestamp | tail

    For self-healing, the Event message contains reason=recovery. (Driver upgrades use reason=upgrade.)

  2. Approve the reboot when convenient:

    kubectl annotate node ${nodeName} npu.openfuyao.com/approve-reboot=true
  3. The rebooter reboots the node; on fresh boot the driver pod loads the modules cleanly and the validator pod re-confirms readiness.

False-positive suppression

If the chip recovers on its own before the rebooter acts, the health-watch loop clears its own marker so the node is not rebooted unnecessarily.