Tidbits | April 11, 2019

Pro-Tip – jetstack/cert-manager on GKE Private Clusters

by Stephen Spencer

Kubernetes Admission Controllers

If, dear reader, you are not familiar with this controller-type, I encourage you to hit up a favorite search engine--many much less obtuse descriptions of their use and implementation exist.

This post focuses on their use by Jetstack's cert-manager controller and how to make it happy running on a GKE private cluster.

The Point

The cert-manager webhook process is an example of a validating admission webhook. Generically, VA webhooks are a useful tool to enforce policy decisions; a common example is denying submission of manifests requesting container images from non-private registries.

Do One Thing Well

The webhook is responsible for making sure any manifests that match one of its CRDs are syntactically and structurally valid before being submitted to the actual cert-manager controller. This takes the validation load for the controller as well as relieving it of overhead from processing connections that carry invalid manifest data.

The Context

Google Kubernetes Engine (private)

The Reader's Digest version: communications from the master subnet is restricted. Nodes are not granted public addresses. Users are charged for Kubernetes nodes. Master functionality is provided via a shared environment.

The Problem

NOTE: cert-manager already in place


TL;DR

The webhook registration includes connection info for the webhook process. GKE private clusters do not allow connections from the master network to the service network on port 6443.


apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  annotations: {}
  name: something
spec:
  acme:
    ca: {}

... graces us with:

Internal error occurred: failed calling admission webhook "clusterissuers.admission.certmanager.k8s.io": the server is currently unable to handle the request

That error message is absolutely correct and just as unhelpful.

The server? Yeah... let me just find... the server. Are we talking about... maybe the API server? Or the cert-manager controller? Maybe we're just talking about the guy that just brought the check to the table...


Thanks to the author of an issue for the old cert-manager helm chart, it is now common(ish?) knowledge that TCP:6443 is the listening port for the cert-manager webhook. The cert-manager-webhook pod runs on a non-master node. Because of the environment, user workloads aren't deployable to master nodes because... there aren't any.

The Kube API is still a process. It runs.

.

.

.

Somewhere.

It is where the webhook has been registered; the process that waits patiently for requests to validate relevant manifests! And it will continue to wait. Every time the API receives a properly authz'd connection with a cert-manager-related payload, the aforementioned error will be delivered because the API can't connect to the webhook service.

Because this needs a bonus...

When a namespace is deleted, the relevant controller goes through a house-keeping process, walking all registered CRD and built-in object-types and removing any of that object before actually deleting the namespace. The admission controllers registered with the API fire during the course of this process. If one of these fails, the namespace remains in a Terminating state until the failing webhook is either deregistered or it is able to eventually resolve its requests.

Retrospectively, this makes sense, though, seeing a namespace that was deleted yesterday still present and "terminating" is rather disturbing.

Because the bonus needs icing...

The aforementioned namespace problem also rears its head when cordoning a node for upgrades. The node will never reach a state of readiness (anti-readiness) that indicates the instance is ready for destruction. (First noticed with kops)

The Solution

GCE VPC firewall-rules are created using either a source tag, IP range or service account. We know the source range for the master network from when the cluster was created (in our case: 172.16.0.0/28). The target can only be selected via target tag or serviceaccount.

obtain the GCE instance name for a cluster node

gcloud compute instances list

display the GCE tags for that node:

gcloud compute instances describe --format=json [name of instance] | jq .tags.items

[
  "gke-revsys-production-deux-a2f2de43-node"
]

create the firewall rule:

gcloud compute firewall-rule create \
  --source-ranges 172.16.0.0/28 \
  --target-tags gke-revsys-production-deux-a2f2de43-node  \
  --allow TCP:6443

That's it. With that command, 3 poorly logged, unremarked error states are done away with. I hope this short post is helpful.

Now, on that note, go do something interesting.

how to train your validating admission controller webhook without losing a hand

2019-04-11T22:08:23.211940 2019-04-11T22:11:02.790338 2019 kubernetes,k8s,GKE,admission controller,webhook,cert-manager