simplify cuda guide

3 years ago · 4b1b14034c
parent a2305bd87a
commit 4b1b14034c
6 changed files with 85 additions and 166 deletions
--- a/docs/usage/guides/cuda.md
+++ b/docs/usage/guides/cuda.md
@ -11,22 +11,15 @@ To get the NVIDIA container runtime in the K3s image you need to build your own
 The native K3s image is based on Alpine but the NVIDIA container runtime is not supported on Alpine yet.  
 To get around this we need to build the image with a supported base image.

-### Dockerfiles
-  
-[Dockerfile.base](cuda/Dockerfile.base):
+### Dockerfile

-```Dockerfile
-{% include "cuda/Dockerfile.base" %}
-
-```  
-  
-[Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu):  
+[Dockerfile](cuda/Dockerfile):  

 ```Dockerfile
-{% include "cuda/Dockerfile.k3d-gpu" %}
+{% include "cuda/Dockerfile" %}
 ```

-These Dockerfiles are based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
+This Dockerfile is based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
 The following changes are applied:

 1. Change the base images to nvidia/cuda:11.2.0-base-ubuntu18.04 so the NVIDIA Container Runtime can be installed. The version of `cuda:xx.x.x` must match the one you're planning to use.
@ -50,7 +43,7 @@ To enable NVIDIA GPU support on Kubernetes you also need to install the [NVIDIA
 * Run GPU enabled containers in your Kubernetes cluster.

 ```yaml
-{% include "cuda/gpu.yaml" %}
+{% include "cuda/device-plugin-daemonset.yaml" %}
 ```

 ### Build the K3s image
@ -59,20 +52,13 @@ To build the custom image we need to build K3s because we need the generated out

 Put the following files in a directory:

-* [Dockerfile.base](cuda/Dockerfile.base)
-* [Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu)
+* [Dockerfile](cuda/Dockerfile)
 * [config.toml.tmpl](cuda/config.toml.tmpl)
-* [gpu.yaml](cuda/gpu.yaml)
+* [device-plugin-daemonset.yaml](cuda/device-plugin-daemonset.yaml)
 * [build.sh](cuda/build.sh)
 * [cuda-vector-add.yaml](cuda/cuda-vector-add.yaml)

-The `build.sh` script is configured using exports & defaults to `v1.21.2+k3s1`. Please set your CI_REGISTRY_IMAGE! The script performs the following steps:
-
-* pulls K3s
-* builds K3s
-* build the custom K3D Docker image
-
-The resulting image is tagged as k3s-gpu:&lt;version tag&gt;. The version tag is the git tag but the '+' sign is replaced with a '-'.
+The `build.sh` script is configured using exports & defaults to `v1.21.2+k3s1`. Please set at least the `IMAGE_REGISTRY` variable! The script performs the following steps builds the custom K3s image including the nvidia drivers.

 [build.sh](cuda/build.sh):

@ -80,36 +66,35 @@ The resulting image is tagged as k3s-gpu:&lt;version tag&gt;. The version tag is
 {% include "cuda/build.sh" %}
 ```

-## Run and test the custom image with Docker
+## Run and test the custom image with k3d

-You can run a container based on the new image with Docker:
+You can use the image with k3d:

 ```bash
-docker run --name k3s-gpu -d --privileged --gpus all $CI_REGISTRY_IMAGE:$IMAGE_TAG
+k3d cluster create gputest --image=$IMAGE --gpus=1
 ```

 Deploy a [test pod](cuda/cuda-vector-add.yaml):

 ```bash
-docker cp cuda-vector-add.yaml k3s-gpu:/cuda-vector-add.yaml
-docker exec k3s-gpu kubectl apply -f /cuda-vector-add.yaml
-docker exec k3s-gpu kubectl logs cuda-vector-add
+kubectl apply -f cuda-vector-add.yaml
+kubectl logs cuda-vector-add
 ```

-## Run and test the custom image with k3d
-
-Tou can use the image with k3d:
+This should output something like the following:

 ```bash
-k3d cluster create local --image=$CI_REGISTRY_IMAGE:$IMAGE_TAG --gpus=1
+$ kubectl logs cuda-vector-add
+
+[Vector addition of 50000 elements]
+Copy input data from the host memory to the CUDA device
+CUDA kernel launch with 196 blocks of 256 threads
+Copy output data from the CUDA device to the host memory
+Test PASSED
+Done
 ```

-Deploy a [test pod](cuda/cuda-vector-add.yaml):
-
-```bash
-kubectl apply -f cuda-vector-add.yaml
-kubectl logs cuda-vector-add
-```
+If the `cuda-vector-add` pod is stuck in `Pending` state, probably the device-driver daemonset didn't get deployed correctly from the auto-deploy manifests. In that case, you can apply it manually via `#!bash kubectl apply -f device-plugin-daemonset.yaml`.

 ## Known issues

--- a/docs/usage/guides/cuda/Dockerfile
+++ b/docs/usage/guides/cuda/Dockerfile
@ -0,0 +1,47 @@
+ARG K3S_TAG="v1.21.2-k3s1"
+FROM rancher/k3s:$K3S_TAG as k3s
+
+FROM nvidia/cuda:11.2.0-base-ubuntu18.04
+
+ARG NVIDIA_CONTAINER_RUNTIME_VERSION
+ENV NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION
+
+RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
+
+RUN apt-get update && \
+    apt-get -y install gnupg2 curl
+
+# Install NVIDIA Container Runtime
+RUN curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | apt-key add -
+
+RUN curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list
+
+RUN apt-get update && \
+    apt-get -y install nvidia-container-runtime=${NVIDIA_CONTAINER_RUNTIME_VERSION}
+
+COPY --from=k3s / /
+
+RUN mkdir -p /etc && \
+    echo 'hosts: files dns' > /etc/nsswitch.conf
+
+RUN chmod 1777 /tmp
+
+# Provide custom containerd configuration to configure the nvidia-container-runtime
+RUN mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/
+
+COPY config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
+
+# Deploy the nvidia driver plugin on startup
+RUN mkdir -p /var/lib/rancher/k3s/server/manifests
+
+COPY device-plugin-daemonset.yaml /var/lib/rancher/k3s/server/manifests/nvidia-device-plugin-daemonset.yaml
+
+VOLUME /var/lib/kubelet
+VOLUME /var/lib/rancher/k3s
+VOLUME /var/lib/cni
+VOLUME /var/log
+
+ENV PATH="$PATH:/bin/aux"
+
+ENTRYPOINT ["/bin/k3s"]
+CMD ["agent"]
--- a/docs/usage/guides/cuda/Dockerfile.base
+++ b/docs/usage/guides/cuda/Dockerfile.base
@ -1,32 +0,0 @@
-FROM nvidia/cuda:11.2.0-base-ubuntu18.04
-
-ENV DEBIAN_FRONTEND noninteractive
-
-ARG DOCKER_VERSION
-ENV DOCKER_VERSION=$DOCKER_VERSION
-
-RUN set -x && \
-    apt-get update && \
-    apt-get install -y \
-    apt-transport-https \
-    ca-certificates \
-    curl \
-    wget \
-    tar \
-    zstd \
-    gnupg \
-    lsb-release \
-    git \
-    software-properties-common \
-    build-essential && \
-    rm -rf /var/lib/apt/lists/*
-
-RUN set -x && \
-    curl -fsSL https://download.docker.com/linux/$(lsb_release -is | tr '[:upper:]' '[:lower:]')/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
-    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/$(lsb_release -is | tr '[:upper:]' '[:lower:]') $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null && \
-    apt-get update && \
-    apt-get install -y \
-    containerd.io \
-    docker-ce=5:$DOCKER_VERSION~3-0~$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -cs) \
-    docker-ce-cli=5:$DOCKER_VERSION~3-0~$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -cs) && \
-    rm -rf /var/lib/apt/lists/*
--- a/docs/usage/guides/cuda/Dockerfile.k3d-gpu
+++ b/docs/usage/guides/cuda/Dockerfile.k3d-gpu
@ -1,72 +0,0 @@
-FROM nvidia/cuda:11.2.0-base-ubuntu18.04 as base
-
-RUN set -x && \
-    apt-get update && \
-    apt-get install -y ca-certificates zstd
-
-COPY k3s/build/out/data.tar.zst /
-
-RUN set -x && \
-    mkdir -p /image/etc/ssl/certs /image/run /image/var/run /image/tmp /image/lib/modules /image/lib/firmware && \
-    tar -I zstd -xf /data.tar.zst -C /image && \
-    cp /etc/ssl/certs/ca-certificates.crt /image/etc/ssl/certs/ca-certificates.crt
-
-RUN set -x && \
-    cd image/bin && \
-    rm -f k3s && \
-    ln -s k3s-server k3s
-
-FROM nvidia/cuda:11.2.0-base-ubuntu18.04
-
-ARG NVIDIA_CONTAINER_RUNTIME_VERSION
-ENV NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION
-
-RUN set -x && \
-    echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
-
-RUN set -x && \
-    apt-get update && \
-    apt-get -y install gnupg2 curl
-
-# Install NVIDIA Container Runtime
-RUN set -x && \
-    curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | apt-key add -
-
-RUN set -x && \
-    curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list
-
-RUN set -x && \
-    apt-get update && \
-    apt-get -y install nvidia-container-runtime=${NVIDIA_CONTAINER_RUNTIME_VERSION}
-
-
-COPY --from=base /image /
-
-RUN set -x && \
-    mkdir -p /etc && \
-    echo 'hosts: files dns' > /etc/nsswitch.conf
-
-RUN set -x && \
-    chmod 1777 /tmp
-
-# Provide custom containerd configuration to configure the nvidia-container-runtime
-RUN set -x && \
-    mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/
-
-COPY config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
-
-# Deploy the nvidia driver plugin on startup
-RUN set -x && \
-    mkdir -p /var/lib/rancher/k3s/server/manifests
-
-COPY gpu.yaml /var/lib/rancher/k3s/server/manifests/gpu.yaml
-
-VOLUME /var/lib/kubelet
-VOLUME /var/lib/rancher/k3s
-VOLUME /var/lib/cni
-VOLUME /var/log
-
-ENV PATH="$PATH:/bin/aux"
-
-ENTRYPOINT ["/bin/k3s"]
-CMD ["agent"]
--- a/docs/usage/guides/cuda/build.sh
+++ b/docs/usage/guides/cuda/build.sh
@ -1,30 +1,21 @@
 #!/bin/bash

-export CI_REGISTRY_IMAGE="YOUR_REGISTRY_IMAGE_URL"
-export VERSION="1.0"
-export K3S_TAG="v1.21.2+k3s1"
-export DOCKER_VERSION="20.10.7"
-export IMAGE_TAG="v1.21.2-k3s1"
-export NVIDIA_CONTAINER_RUNTIME_VERSION="3.5.0-1"
+set -euxo pipefail

-docker build -f Dockerfile.base --build-arg DOCKER_VERSION=$DOCKER_VERSION -t $CI_REGISTRY_IMAGE/base:$VERSION . && \
-docker push $CI_REGISTRY_IMAGE/base:$VERSION
+K3S_TAG=${K3S_TAG:="v1.21.2-k3s1"} # replace + with -, if needed
+IMAGE_REGISTRY=${IMAGE_REGISTRY:="MY_REGISTRY"}
+IMAGE_REPOSITORY=${IMAGE_REPOSITORY:="rancher/k3s"}
+IMAGE_TAG="$K3S_TAG-cuda"
+IMAGE=${IMAGE:="$IMAGE_REGISTRY/$IMAGE_REPOSITORY:$IMAGE_TAG"}

-rm -rf ./k3s && \
-git clone --depth 1 https://github.com/rancher/k3s.git -b "$K3S_TAG" && \
-docker run -ti -v ${PWD}/k3s:/k3s -v /var/run/docker.sock:/var/run/docker.sock $CI_REGISTRY_IMAGE/base:1.0 sh -c "cd /k3s && make" && \
-ls -al k3s/build/out/data.tar.zst
+NVIDIA_CONTAINER_RUNTIME_VERSION=${NVIDIA_CONTAINER_RUNTIME_VERSION:="3.5.0-1"}

-if [ -f k3s/build/out/data.tar.zst ]; then
-  echo "File exists! Building!"
-  docker build -f Dockerfile.k3d-gpu \
-    --build-arg NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION\
-    -t $CI_REGISTRY_IMAGE:$IMAGE_TAG . && \
-  docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
-  echo "Done!"
-else
-  echo "Error, file does not exist!"
-  exit 1
-fi
+echo "IMAGE=$IMAGE"

-docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
+# due to some unknown reason, copying symlinks fails with buildkit enabled
+DOCKER_BUILDKIT=0 docker build \
+  --build-arg K3S_TAG=$K3S_TAG \
+  --build-arg NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION \
+  -t $IMAGE .
+docker push $IMAGE
+echo "Done!"
--- a/docs/usage/guides/cuda/device-plugin-daemonset.yaml
+++ b/docs/usage/guides/cuda/device-plugin-daemonset.yaml