Release Notes
v1.2.4
Based on the openFuyao npu-operator 1.2.0 community release. v1.2.4 is the first release built for the KubeOS immutable operating system and reworks how the driver is delivered, how devices reach workloads, and how the operator manages the driver lifecycle.
Breaking changes
-
Delivery model changed from cluster plugin to operator. Earlier versions of Alauda Build of NPU Operator shipped as a cluster plugin, installed from Marketplace > Cluster Plugins. Starting with v1.2.4 the project is packaged as an OLM operator bundle and installed from Marketplace > OperatorHub.
WARNINGIn-place upgrade from v1.1.3 (or any earlier cluster-plugin release) is not supported. To move to v1.2.4, uninstall the old cluster plugin first and then install the new operator from scratch:
- Uninstall the existing Alauda Build of NPU Operator cluster plugin from Marketplace > Cluster Plugins.
- If the driver is no longer needed on the host, run on each NPU node:
- Install v1.2.4 from Marketplace > OperatorHub following the Installation page.
-
Driver is now delivered as a container image, not a
.runpackage. By default the driver, firmware, and runtime tooling ship inmlops/ascend-driver:<HDK>-<chip>-<kernel>-<os-stem>and are loaded into the host kernel from the running pod. The legacy DKMS / kernel-headers /makehost prerequisites listed in v1.1.x are no longer needed (and no longer used even if present). For air-gapped clusters, the matching driver image must be pre-staged to the cluster registry before installation — see Step 1: Sync driver images. Clusters whose nodes already carry a host-installed.rundriver can opt out by settingspec.driver.enabled=false(see New features below). -
OS support model changed: kernel-keyed, not distro-keyed. v1.2.4 does not restrict to specific distros. Any arm64 Linux running containerd is supported, as long as a driver image matching the node's kernel exists at
docker.io/alaudadockerhub/ascend-driver— the driver image is shipped independently from the operator bundle and is built on demand. If your kernel isn't on the tag list yet, contact Customer Support to have a new tag built and published. Smoke-tested in this release: 310P + 910B on KubeOS 6.6, 910B on openEuler 22.03 LTS SP3 bare-metal. The v1.1.x DKMS host-side flow is gone — the container-image model replaces it. -
Workloads no longer need
runtimeClassName: ascend. Device injection in v1.2.4 is driven by the Container Device Interface (CDI). Requestinghuawei.com/Ascend910(orAscend310P) is enough — containerd binds/dev/davinciNand the driver userspace libraries into the container automatically. Existing pods that still setruntimeClassName: ascendcontinue to work; the legacy RuntimeClass is retained for backwards compatibility.
New features
-
KubeOS / immutable-OS support. NPU nodes running KubeOS are now fully supported. The operator stages the driver tree under
/var/lib/Ascend//var/lib/ascend//home/bios/driver(all on writable partitions on KubeOS), inserts kernel modules directly from the driver pod, and never writes to/usror relies on a package manager. -
Pre-installed-driver passthrough (
spec.driver.enabled=false). Clusters whose NPU hosts already carry a Huawei.rundriver — typically bare-metal sites running an independent driver lifecycle — can now opt the operator out of driver management entirely. With Driver disabled at install time (or set tofalseon theNPUOperatorCtllater), the operator skips the driver image sync (installation Step 1), does not deploy the driver / rebooter DaemonSets, and points CDI / device-plugin / runtime at the host's existing driver tree at/var/lib/Ascend/driveror/usr/local/Ascend/driver(auto-detected by the toolkit). Driver upgrades and chip-recovery reboots are then the host operator's responsibility. -
Driver upgrade lifecycle. Editing
NPUOperatorCtl.spec.driver.versionnow triggers a per-node rolling upgrade. The operator walks each node throughupgrade-required → cordon-required → wait-for-jobs-required → pod-deletion-required → drain-required → node-reboot-required → pod-restart-required → validation-required → uncordon-required → upgrade-done, with the rollout pace gated by:spec.driver.upgradePolicy.autoUpgrade—truefor fully automatic rolling upgrades,false(default) to wait for a per-nodenpu.openfuyao.com/approve-reboot=trueannotation before each reboot.spec.driver.upgradePolicy.maxParallelUpgradesandmaxUnavailable— concurrency caps.spec.driver.upgradePolicy.drainSpec,podDeletion,waitForCompletion— drain and eviction policy fields, modelled on the upstreamk8s-operator-libsschema.
The node reboot is always required because
rmmodof a running Ascend driver leaves the chip in an unrecoverable state. The new rebooter component performs the reboot usingsysrq b(panic-reboot) to avoid VFIO/AER deadlocks on PCI-passthrough VMs. See Driver upgrade and self-healing for the full walk-through. -
Chip self-healing. A
health-watchloop inside the driver pod runsnpu-smi infoevery 60 seconds. After three consecutive failures (≈3 minutes) the loop detects a runtime chip-wedge condition, drops the driver-ready marker (causing the device plugin to report the chip asUnhealthy), and writes a recovery marker for the rebooter. Withspec.driver.recoveryPolicy.autoRecover: truethe rebooter automatically reboots the node to recover the chip; withfalse(default) it waits for the same approve-reboot annotation used for upgrades. -
CDI device injection. Replaces the v1.1.x
ascend-docker-hookmechanism. TheAscend Docker Runtimeform toggle now provisionsnpu-container-toolkit generate-cdi --watchas a sidecar that emits the CDI spec at/var/run/cdi/ascend.com-npu.yaml. The CDI kind ishuawei.com/ascend; the pod annotation key the device plugin attaches iscdi.k8s.io/ascend-device-plugin. -
Validator DaemonSet. A new
npu-validatorDaemonSet runs alongside the driver. The upgrade state machine waits for both the driver pod and the validator pod to reportPodReadybefore transitioning pastvalidation-required, so a driver pod that self-reports ready but whose ready file isn't actually visible to other host consumers no longer fools the upgrade. -
Device-plugin drain-aware. The MindCluster device plugin now watches for the
npu.openfuyao.com/driver-upgrade-stateandreboot-requirednode labels, and reports every NPU on the node asUnhealthywhile either is set. The scheduler therefore stops placing new NPU workloads on a node about to be torn down. -
Per-node
npu-smiaccess on the host. The driver pod stagesnpu-smito/var/lib/Ascend/driver/tools/npu-smi. The operator does not add it to the hostPATH(KubeOS keeps/usrread-only); call it with the matchingLD_LIBRARY_PATHor wrap it in a script under/opt/bin/. See the Installation FAQ. -
MindCluster / Ascend component stack upgraded to v7.3.0. Picks up the v7.3.0 train of
ascend-docker-runtime,ascend-operator,ascend-k8sdeviceplugin,noded,vc-controller-manager,vc-scheduler,clusterd, andnpu-exporter. The driver and firmware default HDK version is bumped to25.5.0.25.3.RC1is shipped as a co-supported alternative.
Bug fixes
-
Fixed
npu-exporterServiceMonitornot taking effect: it was created in the wrong namespace and missed Prometheus selector labels, so the platform's Prometheus never scraped the exporter. The operator now ships aServiceMonitorin its own namespace with the correctprometheus: kube-prometheuslabel andhonorLabels: true. No manualServiceMonitorcreation is required. -
Fixed the operator never advancing the upgrade state machine past
validation-requiredwhen only the driver-pod readinessProbe was used as the signal: a separate validator pod now confirms the driver-ready marker is visible to other host consumers. -
Fixed
runtime-init.shfast-skipping in a half-installed state on nodes whose chip had wedged while the driver pod was running: the fast-skip predicate now includes a chip-health probe (davinci_dev_num > 0via PCI sysfs) so a wedged-but-loaded chip falls through to the recovery path. -
(Community) Fixed installation / detection logic on nodes that do not carry NPU cards.
-
(Community) Fixed environment-variable indicators not being deduplicated.
v1.1.3
Based on the openFuyao npu-operator 1.1.1 community release. Backed by the MindCluster / Ascend v7.2.RC1 component stack and delivered as a cluster plugin.