1570 lines
45 KiB

<span class="md-nav__icon md-icon"></span>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../../configfile/" class="md-nav__link">
Using Config Files
<li class="md-nav__item">
<a href="../../kubeconfig/" class="md-nav__link">
Handling Kubeconfigs
<li class="md-nav__item">
<a href="../../multiserver/" class="md-nav__link">
Creating multi-server clusters
<li class="md-nav__item">
<a href="../../registries/" class="md-nav__link">
Using Image Registries
<li class="md-nav__item">
<a href="../../exposing_services/" class="md-nav__link">
Exposing Services
<li class="md-nav__item">
<a href="../../importing_images/" class="md-nav__link">
Importing modes
<li class="md-nav__item">
<a href="../../k3s/" class="md-nav__link">
K3s Features in k3d
<li class="md-nav__item md-nav__item--active md-nav__item--nested">
<input class="md-nav__toggle md-toggle" data-md-toggle="__nav_2_8" type="checkbox" id="__nav_2_8" checked>
<label class="md-nav__link" for="__nav_2_8">
Advanced Guides
<span class="md-nav__icon md-icon"></span>
<nav class="md-nav" aria-label="Advanced Guides" data-md-level="2">
<label class="md-nav__title" for="__nav_2_8">
<span class="md-nav__icon md-icon"></span>
Advanced Guides
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../calico/" class="md-nav__link">
Use Calico instead of Flannel
<li class="md-nav__item md-nav__item--active">
<input class="md-nav__toggle md-toggle" data-md-toggle="toc" type="checkbox" id="__toc">
<label class="md-nav__link md-nav__link--active" for="__toc">
Running CUDA workloads
<span class="md-nav__icon md-icon"></span>
<a href="./" class="md-nav__link md-nav__link--active">
Running CUDA workloads
<h1 id="running-cuda-workloads">Running CUDA workloads<a class="headerlink" href="#running-cuda-workloads" title="Permanent link">&para;</a></h1>
<p>If you want to run CUDA workloads on the K3s container you need to customize the container.<br />
CUDA workloads require the NVIDIA Container Runtime, so containerd needs to be configured to use this runtime.<br />
The K3s container itself also needs to run with this runtime.<br />
If you are using Docker you can install the <a href="">NVIDIA Container Toolkit</a>.</p>
<h2 id="building-a-customized-k3s-image">Building a customized K3s image<a class="headerlink" href="#building-a-customized-k3s-image" title="Permanent link">&para;</a></h2>
<p>To get the NVIDIA container runtime in the K3s image you need to build your own K3s image.<br />
The native K3s image is based on Alpine but the NVIDIA container runtime is not supported on Alpine yet.<br />
To get around this we need to build the image with a supported base image.</p>
<h3 id="dockerfile">Dockerfile<a class="headerlink" href="#dockerfile" title="Permanent link">&para;</a></h3>
<p><a href="cuda/Dockerfile">Dockerfile</a>: </p>
ARG K3S_TAG=&rdquo;v1.21.2-k3s1&rdquo;
FROM rancher/k3s:$K3S_TAG as k3s</p>
<p>FROM nvidia/cuda:11.2.0-base-ubuntu18.04</p>
<p>RUN echo &lsquo;debconf debconf/frontend select Noninteractive&rsquo; | debconf-set-selections</p>
<p>RUN apt-get update &amp;&amp; \
apt-get -y install gnupg2 curl</p>
<h1 id="install-nvidia-container-runtime">Install NVIDIA Container Runtime<a class="headerlink" href="#install-nvidia-container-runtime" title="Permanent link">&para;</a></h1>
<p>RUN curl -s -L | apt-key add -</p>
<p>RUN curl -s -L | tee /etc/apt/sources.list.d/nvidia-container-runtime.list</p>
<p>RUN apt-get update &amp;&amp; \
apt-get -y install nvidia-container-runtime=${NVIDIA_CONTAINER_RUNTIME_VERSION}</p>
<p>COPY &ndash;from=k3s / /</p>
<p>RUN mkdir -p /etc &amp;&amp; \
echo &lsquo;hosts: files dns&rsquo; &gt; /etc/nsswitch.conf</p>
<p>RUN chmod 1777 /tmp</p>
<h1 id="provide-custom-containerd-configuration-to-configure-the-nvidia-container-runtime">Provide custom containerd configuration to configure the nvidia-container-runtime<a class="headerlink" href="#provide-custom-containerd-configuration-to-configure-the-nvidia-container-runtime" title="Permanent link">&para;</a></h1>
<p>RUN mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/</p>
<p>COPY config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl</p>
<h1 id="deploy-the-nvidia-driver-plugin-on-startup">Deploy the nvidia driver plugin on startup<a class="headerlink" href="#deploy-the-nvidia-driver-plugin-on-startup" title="Permanent link">&para;</a></h1>
<p>RUN mkdir -p /var/lib/rancher/k3s/server/manifests</p>
<p>COPY device-plugin-daemonset.yaml /var/lib/rancher/k3s/server/manifests/nvidia-device-plugin-daemonset.yaml</p>
<p>VOLUME /var/lib/kubelet
VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni
VOLUME /var/log</p>
<p>ENV PATH=&rdquo;$PATH:/bin/aux&rdquo;</p>
<p>ENTRYPOINT [&ldquo;/bin/k3s&rdquo;]
CMD [&ldquo;agent&rdquo;]
<p>This Dockerfile is based on the <a href="">K3s Dockerfile</a>
The following changes are applied:</p>
<li>Change the base images to nvidia/cuda:11.2.0-base-ubuntu18.04 so the NVIDIA Container Runtime can be installed. The version of <code>cuda:xx.x.x</code> must match the one you&rsquo;re planning to use.</li>
<li>Add a custom containerd <code>config.toml</code> template to add the NVIDIA Container Runtime. This replaces the default <code>runc</code> runtime</li>
<li>Add a manifest for the NVIDIA driver plugin for Kubernetes</li>
<h3 id="configure-containerd">Configure containerd<a class="headerlink" href="#configure-containerd" title="Permanent link">&para;</a></h3>
<p>We need to configure containerd to use the NVIDIA Container Runtime. We need to customize the config.toml that is used at startup. K3s provides a way to do this using a <a href="config.toml.tmpl">config.toml.tmpl</a> file. More information can be found on the <a href="">K3s site</a>.</p>
path = &ldquo;{{ .NodeConfig.Containerd.Opt }}&rdquo;</p>
stream_server_address = &ldquo;;
stream_server_port = &ldquo;10010&rdquo;</p>
<p>{{- if .IsRunningInUserNS }}
disable_cgroup = true
disable_apparmor = true
restrict_oom_score_adj = true
<p>{{- if .NodeConfig.AgentConfig.PauseImage }}
sandbox_image = &ldquo;{{ .NodeConfig.AgentConfig.PauseImage }}&rdquo;
<p>{{- if not .NodeConfig.NoFlannel }}
bin_dir = &ldquo;{{ .NodeConfig.AgentConfig.CNIBinDir }}&rdquo;
conf_dir = &ldquo;{{ .NodeConfig.AgentConfig.CNIConfDir }}&rdquo;
# ---- changed from &lsquo;io.containerd.runc.v2&rsquo; for GPU support
runtime_type = &ldquo;io.containerd.runtime.v1.linux&rdquo;</p>
<h1 id="-added-for-gpu-support">---- added for GPU support<a class="headerlink" href="#-added-for-gpu-support" title="Permanent link">&para;</a></h1>
runtime = &ldquo;nvidia-container-runtime&rdquo;</p>
<p>{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf &ldquo;%q&rdquo; .}}{{end}}]
<p>{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
{{ if $v.Auth.Username }}username = &ldquo;{{ $v.Auth.Username }}&rdquo;{{end}}
{{ if $v.Auth.Password }}password = &ldquo;{{ $v.Auth.Password }}&rdquo;{{end}}
{{ if $v.Auth.Auth }}auth = &ldquo;{{ $v.Auth.Auth }}&rdquo;{{end}}
{{ if $v.Auth.IdentityToken }}identitytoken = &ldquo;{{ $v.Auth.IdentityToken }}&rdquo;{{end}}
{{ if $v.TLS }}
{{ if $v.TLS.CAFile }}ca_file = &ldquo;{{ $v.TLS.CAFile }}&rdquo;{{end}}
{{ if $v.TLS.CertFile }}cert_file = &ldquo;{{ $v.TLS.CertFile }}&rdquo;{{end}}
{{ if $v.TLS.KeyFile }}key_file = &ldquo;{{ $v.TLS.KeyFile }}&rdquo;{{end}}
<h3 id="the-nvidia-device-plugin">The NVIDIA device plugin<a class="headerlink" href="#the-nvidia-device-plugin" title="Permanent link">&para;</a></h3>
<p>To enable NVIDIA GPU support on Kubernetes you also need to install the <a href="">NVIDIA device plugin</a>. The device plugin is a deamonset and allows you to automatically:</p>
<li>Expose the number of GPUs on each nodes of your cluster</li>
<li>Keep track of the health of your GPUs</li>
<li>Run GPU enabled containers in your Kubernetes cluster.</li>
apiVersion: apps/v1
kind: DaemonSet
name: nvidia-device-plugin-daemonset
namespace: kube-system
name: nvidia-device-plugin-ds
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations: ""
name: nvidia-device-plugin-ds
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
- env:
value: xids
image: nvidia/k8s-device-plugin:1.11
name: nvidia-device-plugin-ctr
allowPrivilegeEscalation: true
drop: ["ALL"]
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: device-plugin
path: /var/lib/kubelet/device-plugins</code></p>
<h3 id="build-the-k3s-image">Build the K3s image<a class="headerlink" href="#build-the-k3s-image" title="Permanent link">&para;</a></h3>
<p>To build the custom image we need to build K3s because we need the generated output.</p>
<p>Put the following files in a directory:</p>
<li><a href="cuda/Dockerfile">Dockerfile</a></li>
<li><a href="config.toml.tmpl">config.toml.tmpl</a></li>
<li><a href="device-plugin-daemonset.yaml">device-plugin-daemonset.yaml</a></li>
<li><a href=""></a></li>
<li><a href="cuda-vector-add.yaml">cuda-vector-add.yaml</a></li>
<p>The <code></code> script is configured using exports &amp; defaults to <code>v1.21.2+k3s1</code>. Please set at least the <code>IMAGE_REGISTRY</code> variable! The script performs the following steps builds the custom K3s image including the nvidia drivers.</p>
<p><a href=""></a>:</p>
<h1 id="binbash">!/bin/bash<a class="headerlink" href="#binbash" title="Permanent link">&para;</a></h1>
<p>set -euxo pipefail</p>
<p>K3S_TAG=${K3S_TAG:=&rdquo;v1.21.2-k3s1&rdquo;} # replace + with -, if needed
<p>echo &ldquo;IMAGE=$IMAGE&rdquo;</p>
<h1 id="due-to-some-unknown-reason-copying-symlinks-fails-with-buildkit-enabled">due to some unknown reason, copying symlinks fails with buildkit enabled<a class="headerlink" href="#due-to-some-unknown-reason-copying-symlinks-fails-with-buildkit-enabled" title="Permanent link">&para;</a></h1>
<p>DOCKER_BUILDKIT=0 docker build \
&ndash;build-arg K3S_TAG=$K3S_TAG \
-t $IMAGE .
docker push $IMAGE
echo &ldquo;Done!&rdquo;
<h2 id="run-and-test-the-custom-image-with-k3d">Run and test the custom image with k3d<a class="headerlink" href="#run-and-test-the-custom-image-with-k3d" title="Permanent link">&para;</a></h2>
<p>You can use the image with k3d:</p>
k3d cluster create gputest --image=$IMAGE --gpus=1</code></p>
<p>Deploy a <a href="cuda-vector-add.yaml">test pod</a>:</p>
kubectl apply -f cuda-vector-add.yaml
kubectl logs cuda-vector-add</code></p>
<p>This should output something like the following:</p>
$ kubectl logs cuda-vector-add</p>
<p>[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
<p>If the <code>cuda-vector-add</code> pod is stuck in <code>Pending</code> state, probably the device-driver daemonset didn&rsquo;t get deployed correctly from the auto-deploy manifests. In that case, you can apply it manually via `#!bash kubectl apply -f device-plugin-daemonset.yaml`.</p>
<h2 id="known-issues">Known issues<a class="headerlink" href="#known-issues" title="Permanent link">&para;</a></h2>
<li>This approach does not work on WSL2 yet. The NVIDIA driver plugin and container runtime rely on the NVIDIA Management Library (NVML) which is not yet supported. See the <a href="">CUDA on WSL User Guide</a>.</li>
<h2 id="acknowledgements">Acknowledgements<a class="headerlink" href="#acknowledgements" title="Permanent link">&para;</a></h2>
<p>Most of the information in this article was obtained from various sources:</p>
<li><a href="">Add NVIDIA GPU support to k3s with containerd</a></li>
<li><a href="">microk8s</a></li>
<li><a href="">K3s</a></li>
<li><a href="">k3s-gpu</a></li>
<h2 id="authors">Authors<a class="headerlink" href="#authors" title="Permanent link">&para;</a></h2>
<li><a href="">@markrexwinkel</a></li>
<li><a href="">@vainkop</a></li>
<li><a href="">@iwilltry42</a></li>
