Little helper to run CNCF's k3s in Docker
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
k3d/docs/faq/faq.md

162 lines
9.6 KiB

# FAQ / Nice to know
## Issues with BTRFS
- As [@jaredallard](https://github.com/jaredallard) [pointed out](https://github.com/rancher/k3d/pull/48), people running `k3d` on a system with **btrfs**, may need to mount `/dev/mapper` into the nodes for the setup to work.
- This will do: `#!bash k3d cluster create CLUSTER_NAME -v /dev/mapper:/dev/mapper`
## Issues with ZFS
- k3s currently has [no support for ZFS](https://github.com/rancher/k3s/issues/66) and thus, creating multi-server setups (e.g. `#!bash k3d cluster create multiserver --servers 3`) fails, because the initializing server node (server flag `--cluster-init`) errors out with the following log:
```bash
starting kubernetes: preparing server: start cluster and https: raft_init(): io: create I/O capabilities probe file: posix_allocate: operation not supported on socket
```
- This issue can be worked around by providing docker with a different filesystem (that's also better for docker-in-docker stuff).
- A possible solution can be found here: [https://github.com/rancher/k3s/issues/1688#issuecomment-619570374](https://github.com/rancher/k3s/issues/1688#issuecomment-619570374)
## Pods evicted due to lack of disk space
- Pods go to evicted state after doing X
- Related issues: [#133 - Pods evicted due to `NodeHasDiskPressure`](https://github.com/rancher/k3d/issues/133) (collection of #119 and #130)
- Background: somehow docker runs out of space for the k3d node containers, which triggers a hard eviction in the kubelet
- Possible [fix/workaround by @zer0def](https://github.com/rancher/k3d/issues/133#issuecomment-549065666):
- use a docker storage driver which cleans up properly (e.g. overlay2)
- clean up or expand docker root filesystem
- change the kubelet's eviction thresholds upon cluster creation:
```bash
k3d cluster create \
--k3s-agent-arg '--kubelet-arg=eviction-hard=imagefs.available<1%,nodefs.available<1%' \
--k3s-agent-arg '--kubelet-arg=eviction-minimum-reclaim=imagefs.available=1%,nodefs.available=1%'
```
## Restarting a multi-server cluster or the initializing server node fails
- What you do: You create a cluster with more than one server node and later, you either stop `server-0` or stop/start the whole cluster
- What fails: After the restart, you cannot connect to the cluster anymore and `kubectl` will give you a lot of errors
- What causes this issue: it's a [known issue with dqlite in `k3s`](https://github.com/rancher/k3s/issues/1391) which doesn't allow the initializing server node to go down
- What's the solution: Hopefully, this will be solved by the planned [replacement of dqlite with embedded etcd in k3s](https://github.com/rancher/k3s/pull/1770)
- Related issues: [#262](https://github.com/rancher/k3d/issues/262)
## Passing additional arguments/flags to k3s (and on to e.g. the kube-apiserver)
- The Problem: Passing a feature flag to the Kubernetes API Server running inside k3s.
- Example: you want to enable the EphemeralContainers feature flag in Kubernetes
- Solution: `#!bash k3d cluster create --k3s-server-arg '--kube-apiserver-arg=feature-gates=EphemeralContainers=true'`
- **Note**: Be aware of where the flags require dashes (`--`) and where not.
- the k3s flag (`--kube-apiserver-arg`) has the dashes
- the kube-apiserver flag `feature-gates` doesn't have them (k3s adds them internally)
- Second example:
```bash
k3d cluster create k3d-one \
--k3s-server-arg --cluster-cidr="10.118.0.0/17" \
--k3s-server-arg --service-cidr="10.118.128.0/17" \
--k3s-server-arg --disable=servicelb \
--k3s-server-arg --disable=traefik \
--verbose
```
- **Note**: There are many ways to use the `"` and `'` quotes, just be aware, that sometimes shells also try to interpret/interpolate parts of the commands
## How to access services (like a database) running on my Docker Host Machine
- As of version v3.1.0, we're injecting the `host.k3d.internal` entry into the k3d containers (k3s nodes) and into the CoreDNS ConfigMap, enabling you to access your host system by referring to it as `host.k3d.internal`
## Running behind a corporate proxy
Running k3d behind a corporate proxy can lead to some issues with k3d that have already been reported in more than one issue.
Some can be fixed by passing the `HTTP_PROXY` environment variables to k3d, some have to be fixed in docker's `daemon.json` file and some are as easy as adding a volume mount.
## Pods fail to start: `x509: certificate signed by unknown authority`
- Example Error Message:
```bash
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "docker.io/rancher/pause:3.1": failed to pull image "docker.io/rancher/pause:3.1": failed to pull and unpack image "docker.io/rancher/pause:3.1": failed to resolve reference "docker.io/rancher/pause:3.1": failed to do request: Head https://registry-1.docker.io/v2/rancher/pause/manifests/3.1: x509: certificate signed by unknown authority
```
- Problem: inside the container, the certificate of the corporate proxy cannot be validated
- Possible Solution: Mounting the CA Certificate from your host into the node containers at start time via `k3d cluster create --volume /path/to/your/certs.crt:/etc/ssl/certs/yourcert.crt`
- Issue: [rancher/k3d#535](https://github.com/rancher/k3d/discussions/535#discussioncomment-474982)
## Spurious PID entries in `/proc` after deleting `k3d` cluster with shared mounts
- When you perform cluster create and deletion operations multiple times with **same cluster name** and **shared volume mounts**, it was observed that `grep k3d /proc/*/mountinfo` shows many spurious entries
- Problem: Due to above, at times you'll see `no space left on device: unknown` when a pod is scheduled to the nodes
- If you observe anything of above sort you can check for inaccessible file systems and unmount them by using below command (note: please remove `xargs umount -l` and check for the diff o/p first)
- `diff <(df -ha | grep pods | awk '{print $NF}') <(df -h | grep pods | awk '{print $NF}') | awk '{print $2}' | xargs umount -l`
- As per the conversation on [rancher/k3d#594](https://github.com/rancher/k3d/issues/594#issuecomment-837900646) above issue wasn't reported/known earlier and so there are high chances that it's not universal.
## [SOLVED] Nodes fail to start or get stuck in `NotReady` state with log `nf_conntrack_max: permission denied`
### Problem
- When: This happens when running k3d on a Linux system with a kernel version >= 5.12.2 (and others like >= 5.11.19) when creating a new cluster
- the node(s) stop or get stuck with a log line like this: `<TIMESTAMP> F0516 05:05:31.782902 7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied`
- Why: The issue was introduced by a change in the Linux kernel ([Changelog 5.12.2](https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.12.2): [Commit](https://github.com/torvalds/linux/commit/671c54ea8c7ff47bd88444f3fffb65bf9799ce43)), that changed the netfilter_conntrack behavior in a way that `kube-proxy` is not able to set the `nf_conntrack_max` value anymore
### Workaround
- Workaround: as a workaround, we can tell `kube-proxy` to not even try to set this value:
```bash
k3d cluster create \
--k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" \
--k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" \
--image rancher/k3s:v1.20.6-k3s
```
### Fix
- **Note**: k3d v4.4.5 already uses rancher/k3s:v1.21.1-k3s1 as the new default k3s image, so no workarounds needed there!
This is going to be fixed "upstream" in k3s itself in [rancher/k3s#3337](https://github.com/k3s-io/k3s/pull/3337) and backported to k3s versions as low as v1.18.
- **The fix was released and backported in k3s, so you don't need to use the workaround when using one of the following k3s versions (or later ones)**
- v1.18.19-k3s1 ([rancher/k3s#3344](https://github.com/k3s-io/k3s/pull/3344))
- v1.19.11-k3s1 ([rancher/k3s#3343](https://github.com/k3s-io/k3s/pull/3343))
- v1.20.7-k3s1 ([rancher/k3s#3342](https://github.com/k3s-io/k3s/pull/3342))
- v1.21.1-k3s1 ([rancher/k3s#3341](https://github.com/k3s-io/k3s/pull/3341)))
- Issue Reference: [rancher/k3s#607](https://github.com/rancher/k3d/issues/607)
## DockerHub Pull Rate Limit
### Problem
You're deploying something to the cluster using an image from DockerHub and the image fails to be pulled, with a `429` response code and a message saying `You have reached your pull rate limit. You may increase the limit by authenticating and upgrading`.
### Cause
This is caused by DockerHub's pull rate limit (see <https://docs.docker.com/docker-hub/download-rate-limit/>), which limits pulls from unauthenticated/anonymous users to 100 pulls per hour and for authenticated users (not paying customers) to 200 pulls per hour (as of the time of writing).
### Solution
a) use images from a private registry, e.g. configured as a pull-through cache for DockerHub
b) use a different public registry without such limitations, if the same image is stored there
c) authenticate containerd inside k3s/k3d to use your DockerHub user
#### (c) Authenticate Containerd against DockerHub
1. Create a registry configuration file for containerd:
```yaml
# saved as e.g. $HOME/registries.yaml
configs:
"docker.io":
auth:
username: "$USERNAME"
password: "$PASSWORD"
```
2. Create a k3d cluster using that config:
```bash
k3d cluster create --registry-config $HOME/registries.yaml
```
3. Profit. That's it. In the test for this, we pulled the same image 120 times in a row (confirmed, that pull numbers went up), without being rate limited (as a non-paying, normal user)