Alauda Build of NVIDIA DRA Driver for GPUs

Introduction

Dynamic Resource Allocation (DRA) is a Kubernetes feature that provides a more flexible and extensible way to request and allocate hardware resources such as GPUs. Unlike traditional device plugins, which only support simple counting of identical resources, DRA enables fine-grained device selection based on device attributes and capabilities.

Alauda Build of NVIDIA DRA Driver for GPUs is delivered as a cluster plugin that brings the upstream NVIDIA DRA driver to your ACP cluster, allowing workloads to claim GPUs through ResourceClaim and ResourceClaimTemplate objects.

Prerequisites

  • NVIDIA driver v565+ installed on every GPU node.
  • Kubernetes v1.32+.
  • ACP v4.1+.
  • Cluster administrator access to the target ACP cluster.
  • CDI enabled in the underlying container runtime (such as containerd).
  • DRA and the corresponding API groups enabled on the cluster.

The sections below walk through enabling CDI and DRA if they are not yet configured.

Installation

Install the NVIDIA driver on GPU nodes

Refer to the NVIDIA CUDA Installation Guide for Linux.

Install the NVIDIA Container Toolkit

Refer to the NVIDIA Container Toolkit installation guide.

Enable CDI in containerd

CDI (Container Device Interface) provides a standard mechanism for device vendors to describe everything required to provide access to a specific resource — such as a GPU — beyond a simple device name.

CDI is enabled by default in containerd 2.0 and later. For earlier versions (from 1.7.0), it must be activated manually.

INFO

The following steps are only required on GPU nodes running containerd v1.7.x.

  1. Edit the containerd configuration file:

    vi /etc/containerd/config.toml

    Add or modify the following section:

    [plugins."io.containerd.grpc.v1.cri"]
      enable_cdi = true
    NOTE

    Setting enable_cdi = true is sufficient. containerd's default cdi_spec_dirs already include /etc/cdi and /var/run/cdi, which is where the NVIDIA Container Toolkit writes its CDI specs. Only set cdi_spec_dirs explicitly if your toolkit is configured to emit specs to a different location.

  2. Restart containerd and confirm it is running correctly:

    systemctl restart containerd
    systemctl status containerd
  3. Verify that CDI is enabled:

    journalctl -u containerd | grep "EnableCDI:true"

    If matching log lines appear, CDI was enabled successfully.

Enable DRA in Kubernetes

DRA is enabled by default in Kubernetes 1.34 and later. For earlier versions (from 1.32), it must be activated manually.

INFO

The following steps apply to Kubernetes 1.32–1.33. Apply the control-plane changes on all master nodes and the kubelet change on all nodes.

  1. Edit the kube-apiserver manifest at /etc/kubernetes/manifests/kube-apiserver.yaml.

    For Kubernetes 1.32:

    spec:
      containers:
        - command:
            - kube-apiserver
            - --feature-gates=DynamicResourceAllocation=true # required
            - --runtime-config=resource.k8s.io/v1beta1=true # required
          # ... other flags

    For Kubernetes 1.33:

    spec:
      containers:
        - command:
            - kube-apiserver
            - --feature-gates=DynamicResourceAllocation=true # required
            - --runtime-config=resource.k8s.io/v1beta1=true,resource.k8s.io/v1beta2=true # required
          # ... other flags
  2. Edit the kube-controller-manager manifest at /etc/kubernetes/manifests/kube-controller-manager.yaml:

    spec:
      containers:
        - command:
            - kube-controller-manager
            - --feature-gates=DynamicResourceAllocation=true # required
          # ... other flags
  3. Edit the kube-scheduler manifest at /etc/kubernetes/manifests/kube-scheduler.yaml:

    spec:
      containers:
        - command:
            - kube-scheduler
            - --feature-gates=DynamicResourceAllocation=true
          # ... other flags
  4. Edit the kubelet configuration at /var/lib/kubelet/config.yaml on all nodes:

    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    featureGates:
      DynamicResourceAllocation: true

    Restart the kubelet:

    sudo systemctl restart kubelet

Download the cluster plugin

The Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin can be retrieved from the Customer Portal. Contact Customer Support for more information.

Upload the cluster plugin

Upload the downloaded package with the violet command-line tool. For details, see Upload Packages.

Install Alauda Build of NVIDIA DRA Driver for GPUs

  1. Label each GPU node so the nvidia-dra-driver-gpu-kubelet-plugin is scheduled onto it:

    kubectl label nodes {node-name} nvidia-device-enable=pgpu-dra
    WARNING

    On the same node you can set only one of the following labels: gpu=on, nvidia-device-enable=pgpu, or nvidia-device-enable=pgpu-dra.

  2. Navigate to Administrator > Marketplace > Cluster Plugins, switch to the target cluster, and deploy the Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin.

Verify the DRA setup

  1. Check the DRA driver and controller pods:

    kubectl get pods -n kube-system | grep "nvidia-dra-driver-gpu"

    The output should be similar to:

    nvidia-dra-driver-gpu-controller-675644bfb5-c2hq4   1/1     Running   0              18h
    nvidia-dra-driver-gpu-kubelet-plugin-65fjt          2/2     Running   0              18h
  2. Verify the ResourceSlice objects:

    kubectl get resourceslices -o yaml

    For a GPU node, the output should be similar to:

    apiVersion: resource.k8s.io/v1beta1
    kind: ResourceSlice
    metadata:
      generateName: 192.168.140.59-gpu.nvidia.com-
      name: 192.168.140.59-gpu.nvidia.com-gbl46
      ownerReferences:
        - apiVersion: v1
          controller: true
          kind: Node
          name: 192.168.140.59
          uid: 4ab2c24c-fc35-4c75-bcaf-db038356575c
    spec:
      devices:
        - basic:
            attributes:
              architecture:
                string: Pascal
              brand:
                string: Tesla
              cudaComputeCapability:
                version: 6.0.0
              cudaDriverVersion:
                version: 12.8.0
              driverVersion:
                version: 570.124.6
              pcieBusID:
                string: 0000:00:0b.0
              productName:
                string: Tesla P100-PCIE-16GB
              resource.kubernetes.io/pcieRoot:
                string: pci0000:00
              type:
                string: gpu
              uuid:
                string: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66
            capacity:
              memory:
                value: 16Gi
          name: gpu-0
      driver: gpu.nvidia.com
      nodeName: 192.168.140.59
      pool:
        generation: 1
        name: 192.168.140.59
        resourceSliceCount: 1

Validate the installation

This section assumes that you have completed the installation steps above and that all relevant GPU components are running and in a Ready state. The following workload confirms that the Alauda Build of NVIDIA DRA Driver for GPUs is working end to end.

Run a validation workload

  1. Create the workload spec. Adjust the selector expression to match a productName reported in your own ResourceSlice output:

    cat <<EOF > dra-gpu-test.yaml
    ---
    apiVersion: resource.k8s.io/v1beta1
    kind: ResourceClaimTemplate
    metadata:
      name: gpu-template
    spec:
      spec:
        devices:
          requests:
          - name: gpu
            deviceClassName: gpu.nvidia.com
            selectors:
            - cel:
                expression: "device.attributes['gpu.nvidia.com'].productName == 'Tesla P100-PCIE-16GB'"
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: dra-gpu-workload
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      restartPolicy: OnFailure
      resourceClaims:
      - name: gpu-claim
        resourceClaimTemplateName: gpu-template
      containers:
      - name: cuda-container
        image: "ubuntu:22.04"
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: gpu-claim
    EOF
  2. Apply the spec:

    kubectl apply -f dra-gpu-test.yaml
  3. Inspect the container logs:

    kubectl logs dra-gpu-workload -f

    The output should show the GPU UUID from inside the container, for example:

    GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-b87512d7-c8a6-5f4b-8d3f-68183df62d66)