Alauda Build of NVIDIA DRA Driver for GPUs

Introduction Prerequisites Installation Install the NVIDIA driver on GPU nodes Install the NVIDIA Container Toolkit Enable CDI in containerd Enable DRA in Kubernetes Download the cluster plugin Upload the cluster plugin Install Alauda Build of NVIDIA DRA Driver for GPUs Verify the DRA setup Validate the installation Run a validation workload

Introduction

Dynamic Resource Allocation (DRA) is a Kubernetes feature that provides a more flexible and extensible way to request and allocate hardware resources such as GPUs. Unlike traditional device plugins, which only support simple counting of identical resources, DRA enables fine-grained device selection based on device attributes and capabilities.

Alauda Build of NVIDIA DRA Driver for GPUs is delivered as a cluster plugin that brings the upstream NVIDIA DRA driver to your ACP cluster, allowing workloads to claim GPUs through ResourceClaim and ResourceClaimTemplate objects.

Prerequisites

NVIDIA driver v565+ installed on every GPU node.
Kubernetes v1.32+.
ACP v4.1+.
Cluster administrator access to the target ACP cluster.
CDI enabled in the underlying container runtime (such as containerd).
DRA and the corresponding API groups enabled on the cluster.

The sections below walk through enabling CDI and DRA if they are not yet configured.

Installation

Install the NVIDIA driver on GPU nodes

Refer to the NVIDIA CUDA Installation Guide for Linux.

Install the NVIDIA Container Toolkit

Refer to the NVIDIA Container Toolkit installation guide.

Enable CDI in containerd

CDI (Container Device Interface) provides a standard mechanism for device vendors to describe everything required to provide access to a specific resource — such as a GPU — beyond a simple device name.

CDI is enabled by default in containerd 2.0 and later. For earlier versions (from 1.7.0), it must be activated manually.

INFO

The following steps are only required on GPU nodes running containerd v1.7.x.

Edit the containerd configuration file:
vi /etc/containerd/config.toml
Add or modify the following section:
[plugins."io.containerd.grpc.v1.cri"] enable_cdi = true
NOTE
Setting enable_cdi = true is sufficient. containerd's default cdi_spec_dirs already include /etc/cdi and /var/run/cdi, which is where the NVIDIA Container Toolkit writes its CDI specs. Only set cdi_spec_dirs explicitly if your toolkit is configured to emit specs to a different location.

Restart containerd and confirm it is running correctly:

systemctl restart containerd
systemctl status containerd

Verify that CDI is enabled:
journalctl -u containerd | grep "EnableCDI:true"
If matching log lines appear, CDI was enabled successfully.

Enable DRA in Kubernetes

DRA is enabled by default in Kubernetes 1.34 and later. For earlier versions (from 1.32), it must be activated manually.

INFO

The following steps apply to Kubernetes 1.32–1.33. Apply the control-plane changes on all master nodes and the kubelet change on all nodes.

Edit the kube-apiserver manifest at /etc/kubernetes/manifests/kube-apiserver.yaml.

For Kubernetes 1.32:

spec:
  containers:
    - command:
        - kube-apiserver
        - --feature-gates=DynamicResourceAllocation=true # required
        - --runtime-config=resource.k8s.io/v1beta1=true # required
      # ... other flags

For Kubernetes 1.33:

spec:
  containers:
    - command:
        - kube-apiserver
        - --feature-gates=DynamicResourceAllocation=true # required
        - --runtime-config=resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true # required
      # ... other flags

Edit the kube-controller-manager manifest at /etc/kubernetes/manifests/kube-controller-manager.yaml:

spec:
  containers:
    - command:
        - kube-controller-manager
        - --feature-gates=DynamicResourceAllocation=true # required
      # ... other flags

Edit the kube-scheduler manifest at /etc/kubernetes/manifests/kube-scheduler.yaml:

spec:
  containers:
    - command:
        - kube-scheduler
        - --feature-gates=DynamicResourceAllocation=true
      # ... other flags

Edit the kubelet configuration at /var/lib/kubelet/config.yaml on all nodes:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  DynamicResourceAllocation: true

Restart the kubelet:

sudo systemctl restart kubelet

Download the cluster plugin

The Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin can be retrieved from the Customer Portal. Contact Customer Support for more information.

Upload the cluster plugin

Upload the downloaded package with the violet command-line tool. For details, see Upload Packages.

Install Alauda Build of NVIDIA DRA Driver for GPUs

Label each GPU node so the nvidia-dra-driver-gpu-kubelet-plugin is scheduled onto it:
kubectl label nodes {node-name} nvidia-device-enable=pgpu-dra
WARNING
On the same node you can set only one of the following labels: gpu=on, nvidia-device-enable=pgpu, or nvidia-device-enable=pgpu-dra.
Navigate to Administrator > Marketplace > Cluster Plugins, switch to the target cluster, and deploy the Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin.

Verify the DRA setup

Check the DRA driver and controller pods:

kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu"

The output should be similar to:

nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4   1/1     Running   0              18h
nvidia-dra-driver-gpu-kubelet-plugin-65fjt          2/2     Running   0              18h

Verify the ResourceSlice objects:

kubectl get resourceslices -o yaml

For a GPU node, the output should be similar to:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  generateName: 192.168.140.59-gpu.nvidia.com-
  name: 192.168.140.59-gpu.nvidia.com-gbl46
  ownerReferences:
    - apiVersion: v1
      controller: true
      kind: Node
      name: 192.168.140.59
      uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c
spec:
  devices:
    - basic:
        attributes:
          architecture:
            string: Pascal
          brand:
            string: Tesla
          cudaComputeCapability:
            version: 6.0.0
          cudaDriverVersion:
            version: 12.8.0
          driverVersion:
            version: 570.124.6
          pcieBusID:
            string: 0000:00:0b.0
          productName:
            string: Tesla P100-PCIE-16GB
          resource.kubernetes.io/pcieRoot:
            string: pci0000:00
          type:
            string: gpu
          uuid:
            string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66
        capacity:
          memory:
            value: 16Gi
      name: gpu-0
  driver: gpu.nvidia.com
  nodeName: 192.168.140.59
  pool:
    generation: 1
    name: 192.168.140.59
    resourceSliceCount: 1

Validate the installation

This section assumes that you have completed the installation steps above and that all relevant GPU components are running and in a Ready state. The following workload confirms that the Alauda Build of NVIDIA DRA Driver for GPUs is working end to end.

Run a validation workload

Create the workload spec. Adjust the selector expression to match a productName reported in your own ResourceSlice output:

cat <<EOF > dra-gpu-test.yaml
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        selectors:
        - cel:
            expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'"
---
apiVersion: v1
kind: Pod
metadata:
  name: dra-gpu-workload
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
  restartPolicy: OnFailure
  resourceClaims:
  - name: gpu-claim
    resourceClaimTemplateName: gpu-template
  containers:
  - name: cuda-container
    image: "ubuntu:22.04"
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu-claim
EOF

Apply the spec:
kubectl apply -f dra-gpu-test.yaml

Inspect the container logs:

kubectl logs dra-gpu-workload -f

The output should show the GPU UUID from inside the container, for example:

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66)

#Alauda Build of NVIDIA DRA Driver for GPUs

#TOC

#Introduction

#Prerequisites

#Installation

#Install the NVIDIA driver on GPU nodes

#Install the NVIDIA Container Toolkit

#Enable CDI in containerd

#Enable DRA in Kubernetes

#Download the cluster plugin

#Upload the cluster plugin

#Install Alauda Build of NVIDIA DRA Driver for GPUs

#Verify the DRA setup

#Validate the installation

#Run a validation workload