Introduction

Kubernetes exposes specialized hardware such as Ascend NPUs through the Device Plugin interface, but configuring an NPU node end-to-end is genuinely complex: the driver and firmware have to match the kernel, the container runtime has to know how to inject devices into workload containers, and a device plugin has to report the chips to the scheduler.

The Alauda Build of NPU Operator folds all of that into a single operator that reconciles the full software stack — Ascend driver and firmware, container runtime configuration, MindCluster device plugin, and the optional NPU exporter — from a single NPUOperatorCtl custom resource. With it installed, AI training and inference workloads can request huawei.com/Ascend910 (or Ascend310P) the same way they would any other Kubernetes resource.

KubeOS / immutable-OS support. NPU worker nodes are no longer restricted to traditional Linux distros. The operator handles KubeOS's read-only /usr and absent package-manager constraints transparently — there is no need to pre-install kernel headers, DKMS, or any host package.
Pre-compiled driver image. The driver, firmware, and runtime tooling are shipped as a container image keyed on <HDK>-<chip>-<kernel>-<os> and loaded into the host kernel from the running pod. The legacy .run package + DKMS pipeline is gone. See Installation for the image prerequisite.
Pre-installed-driver passthrough. For hosts that already have a Huawei .run driver installed out-of-band (e.g. bare-metal clusters under independent driver lifecycle management), set spec.driver.enabled=false: the operator skips driver staging and upgrades entirely and configures CDI / device-plugin / runtime against the host's existing /var/lib/Ascend/driver or /usr/local/Ascend/driver tree.
CDI device injection. Workload pods no longer need runtimeClassName: ascend. Requesting huawei.com/Ascend910 (or Ascend310P) is enough — the operator sets up the Container Device Interface so that containerd binds /dev/davinciN and the driver userspace libraries into the container automatically.
Driver upgrade lifecycle. Bumping NPUOperatorCtl.spec.driver.version now drives a per-node rolling upgrade: cordon → drain → node reboot → reload new driver → validate → uncordon. The reboot phase is required because Ascend chips cannot tolerate rmmod of a running driver. Auto-roll and manual-approval modes are both supported. See Driver upgrade and self-healing.
Chip self-healing. A health-watch loop inside the driver pod probes each NPU every minute; if a chip enters a wedged state at runtime it writes a recovery marker so the node can be rebooted to recover. Opt-in via spec.driver.recoveryPolicy.autoRecover.
MindCluster v7.3.0 stack. Picks up the v7.3.0 train of ascend-k8sdeviceplugin, ascend-operator, ascend-docker-runtime, noded, clusterd, and npu-exporter. Default driver version bumps to 25.5.0.
Bug fix: NPU exporter ServiceMonitor. The exporter is now scraped automatically by the platform's Prometheus — no manual ServiceMonitor needed.

See Release notes for the full change list, and Installation to get started.

For more details on the upstream project, refer to openFuyao NPU Operator.

Introduction

TOC

What's new in v1.2.4

#Introduction

#TOC

#What's new in v1.2.4

Introduction

TOC

What's new in v1.2.4