Introduction
Kubernetes exposes specialized hardware such as Ascend NPUs through the Device Plugin interface, but configuring an NPU node end-to-end is genuinely complex: the driver and firmware have to match the kernel, the container runtime has to know how to inject devices into workload containers, and a device plugin has to report the chips to the scheduler.
The Alauda Build of NPU Operator folds all of that into a single operator that reconciles the full software stack — Ascend driver and firmware, container runtime configuration, MindCluster device plugin, and the optional NPU exporter — from a single NPUOperatorCtl custom resource. With it installed, AI training and inference workloads can request huawei.com/Ascend910 (or Ascend310P) the same way they would any other Kubernetes resource.
What's new in v1.2.4
v1.2.4 is the first release built for the KubeOS immutable OS, and is delivered as an OLM operator bundle rather than a cluster plugin. The key user-visible changes are:
- KubeOS / immutable-OS support. NPU worker nodes are no longer restricted to traditional Linux distros. The operator handles KubeOS's read-only
/usrand absent package-manager constraints transparently — there is no need to pre-install kernel headers, DKMS, or any host package. - Pre-compiled driver image. The driver, firmware, and runtime tooling are shipped as a container image keyed on
<HDK>-<chip>-<kernel>-<os>and loaded into the host kernel from the running pod. The legacy.runpackage + DKMS pipeline is gone. See Installation for the image prerequisite. - Pre-installed-driver passthrough. For hosts that already have a Huawei
.rundriver installed out-of-band (e.g. bare-metal clusters under independent driver lifecycle management), setspec.driver.enabled=false: the operator skips driver staging and upgrades entirely and configures CDI / device-plugin / runtime against the host's existing/var/lib/Ascend/driveror/usr/local/Ascend/drivertree. - CDI device injection. Workload pods no longer need
runtimeClassName: ascend. Requestinghuawei.com/Ascend910(orAscend310P) is enough — the operator sets up the Container Device Interface so that containerd binds/dev/davinciNand the driver userspace libraries into the container automatically. - Driver upgrade lifecycle. Bumping
NPUOperatorCtl.spec.driver.versionnow drives a per-node rolling upgrade: cordon → drain → node reboot → reload new driver → validate → uncordon. The reboot phase is required because Ascend chips cannot toleratermmodof a running driver. Auto-roll and manual-approval modes are both supported. See Driver upgrade and self-healing. - Chip self-healing. A health-watch loop inside the driver pod probes each NPU every minute; if a chip enters a wedged state at runtime it writes a recovery marker so the node can be rebooted to recover. Opt-in via
spec.driver.recoveryPolicy.autoRecover. - MindCluster v7.3.0 stack. Picks up the v7.3.0 train of
ascend-k8sdeviceplugin,ascend-operator,ascend-docker-runtime,noded,clusterd, andnpu-exporter. Default driver version bumps to25.5.0. - Bug fix: NPU exporter ServiceMonitor. The exporter is now scraped automatically by the platform's Prometheus — no manual
ServiceMonitorneeded.
See Release notes for the full change list, and Installation to get started.
For more details on the upstream project, refer to openFuyao NPU Operator.