July 1, 2019 · Kubernetes GCP

K8s: Stopping kube-system pods from preventing cluster scale-down

I'm now at the point where I run a fair few production services from Kubernetes, the most successful being departureboard.io and the departureboard.io API. To make sure I can meet the fluctuating demand for these services, in a cost-efficient fashion, I take advantage of the Kubernetes Cluster Autoscaler. I won't go into too much detail about how this works, but in short, it enables my cluster to automatically scale up if the total resource requests from my application pods exceed the available capacity in the cluster.

This works excellently, and I've been able to successfully load-test the departureboard.io API to over 3000 requests per second without any manual intervention. However, when setting this up, I ran into a challenge where my cluster would scale-up as expected, but would not fully scale down despite low resource utilisation.

After some investigation I found that the cause was the deployments in the kube-system namespace, such as metrics-server and kube-dns.

Whenever my cluster saw a significant increase in load, the Kubernetes Cluster Autoscaler would kick in as expected and add additional nodes to my cluster. Once added and in the "Ready" state, occasionally some of the pods in the kube-system namespace would be rescheduled onto these additionally provisioned nodes.

But that causes a slight problem...

Running kube-system pods on auto-scaled nodes isn't a problem in itself. But it does cause problems when it is time for your cluster to scale down. If we take a look at the cluster-autoscaler FAQ, it has an interesting statement that you could almost miss:

By default, kube-system pods prevent CA from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere

This means that if at any point kube-system pods are scheduled on a node that is in an auto-scaling pool, the Kubernetes Cluster Autoscaler will not automatically remove that node (even if it is hardly being utilised). This is because it will not voluntarily disrupt any kube-system pods.

Typically, this is for good reason. The applications running in kube-system namespace are critical to the operation of the cluster. For example:

The recommended solution to this problem is to define a Pod Disruption Budget for these kube-system deployments, telling the cluster that it is ok to disrupt the deployments pods, but only in a controlled fashion.

This is a perfectly valid strategy, and does mitigate the problem for some of the kube-system services, for example, kube-system deployments with multiple replicas such kube-dns can be disrupted with minimal impact, as long as it is done in a controlled fashion.

However, it does not solve the problem for some of the other kube-system deployments, such as metrics-server, which has a single replica and cannot be disrupted without a wider cluster impact.

The solution...

In order for you to understand how the solution I settled on works, you first need to understand the architecture of my cluster.

My cluster I use for the majority of my applications has 3 Node Pools:

As you can see from my above architecture, I only provision additional capacity into the cluster when the load demands. When I do provision additional capacity, I only keep it running for as minimal amount of time as possible.

Consequently, I actually don't want any kube-system pods being scheduled on the auto-scaled nodes, because typically these nodes are only going to be short lived, and I want to avoid disruption to these critical kube-system pods if possible. It isn't the end of the world if these pods occasionally have to be re-scheduled (like in a node failure scenario), but I definitely don't want these being regularly moved around.

So my solution was to prevent Kubernetes from scheduling any kube-system pods on nodes belonging to an auto-scaling node-pool. This has a couple of benefits, namely:

How I achieved it

Enter nodeSelector. nodeSelector is a node selection constraint that enables you to limit what nodes a pod can be scheduled on. It is a pretty basic Kubernetes feature.

The 3 default nodes that I have in my Kubernetes Cluster are pretty beefy, and have more than enough resources to handle all of the scaling needs of the kube-system services. They are also split across 3 AZ's, so redundancy isn't a concern.

Due to my architecture, and the 3 separate node pools, I decided that I would solve the scale-down problem by just limiting the nodes that the kube-system services can run on to the reliable, and always present default nodes. This would ensure that the kube-system pods are only ever scheduled on a node in the Default Pool, and other workloads (which are much more tolerant to disruption) would be moved to other nodes if required.

When GKE creates a node, it is automatically labels it with the node-pool it belongs to, running kubectl describe node <node-name> returns the labels assigned to any particular node, for example:

kubectl describe node gke-k8s-maynard-io-prod--default-pool-add39cab-0c3p
Name:               gke-k8s-maynard-io-prod--default-pool-add39cab-0c3p
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=default-pool

All that then needs to be done to limit a deployments pods to a specific node pool is to set the nodeSelector property in the deployment.

Google automatically creates the deployments for the kube-system services like metrics-server, kube-dns, etc. These deployments are created without any nodeSelectors specified. Consequently, you can use the kubectl patch command to update the existing deployment in-place, without having to specify the full deployment spec.

To do this, create a patch.yml file containing the following:

spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: default-pool

You then apply the patch to the deployment in the kube-system namespace, for example to patch metrics-server:

kubectl patch deployment metrics-server-v0.3.1 --patch "$(cat patch.yml)" -nkube-system

Which returns a response like so:

deployment.extensions/metrics-server-v0.3.1 patched

And that's it, the metrics-server deployment will now only ever be scheduled on the Default Pool. You can of course substitute out the cloud.google.com/gke-nodepool: default-pool label to whatever you like. If your architecture is like mine, then you should now not have any issues with metrics-server being scheduled on auto-scaled node pools, stopping issues with cluster scale-down.

Quickly applying this to all deployments in kube-system

In order for you not to run into any kube-system induced scale-down issues, you need to patch all of the kube-system deployments. Manually patching is pretty time consuming and inefficient. It can also be somewhat difficult to script due to the way that GKE names the deployments of services such as metrics-server:

Benjamins-MacBook-Pro:~ benmaynard$ kubectl get deployments -nkube-system | grep -E 'metrics|NAME'
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server-v0.3.1    1/1     1            1           63d

As you can see above, the version of metrics-server is included in the deployment name, which can change over time and make scripting difficult. To get around this I wrote a bash script that is automatically executed on cluster provisioning. It uses  kubectl to get a list of all of the deployments in the kube-system namespace, and loops through them applying patch.yml patch manifest to each of them.

#!/bin/bash
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
kubectl get deployments -nkube-system | awk 'FNR > 1 { print $1 }' | while read x; do kubectl patch deployment "$x" --patch "$(cat "$DIR"/patch.yml)" -nkube-system; done

This prevents any of the deployments in the kube-system namespace from having pods scheduled on any node-pool other than the Default Pool. Leaving my cluster free to scale up and down at will without worrying about disrupting the critical kube-system pods.


* The standard node pool will also be used if the Preemptible Instance node pool is already at the maximum scale size of 6 nodes.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket