How to Scale Using Kubernetes: From Startup to Superstar

Michael Handa Amazon Web Services, Cloud Technology, Containerization, Kubernetes, Microservices, Microsoft Azure

Kubernetes adoption has skyrocketed during the last few years. No matter which statistics you look at, you’re bound to see high rates of adoption. Sysdig’s 2019 Container Usage Report states that 77 percent of containers are now run on Kubernetes. That figure doesn’t include Red Hat OpenShift and Rancher, which are also built on Kubernetes — they take the figure up to 89 percent.

So what’s the secret? Why do organizations, big and small, embrace Kubernetes so willingly? Kubernetes comes with many benefits out of the box, agility, cost-efficiency, reliability, ease of deployment, and the most important – scalability. The best thing about it is that you can automate most of the scaling tasks. Scaling your Kubernetes will also yield considerable cost benefits.

Kubernetes provides various options for autoscaling containerized applications at both the application abstraction layer, as well as the infrastructure layer. Let’s get right into it.

Autoscaling at the Application Abstraction Layer

Kubernetes supports two types of autoscaling – horizontal and vertical — to scale the resources available for your containers:

Horizontal Autoscaling

Horizontal autoscaling is performed by the Horizontal Pod Autoscaler (HPA), which is implemented as a Kubernetes API resource and a controller. The HPA automatically scales the number of pods in a replication controller or deployment, based on CPU utilization. It can also be done based on application-based criteria, using custom metrics support.

The number of pods directly affects your cloud cost because the number of nodes you require for your application depends on it. We’ll look at how scaling affects cloud spending later in this article.

The controller assesses the metrics associated with it periodically and adjusts the number of replicas in the replication controller or deployment to match the target specified by the administrator. The default period is 15 seconds, but it can be managed via the control managers — horizontal-pod-autoscaler-sync-period flag.

The HPA is implemented as a control loop, so, during each period, the controller manager queries the relevant metrics against the target metrics defined in the HorizontalPodAutoscaler definition and decides on the resulting action. Metrics are obtained from the resource metrics API for pod-related metrics, like CPU utilization or the custom metrics API for everything else.

The HPA usually waits 3 minutes after scaling up to allow metrics to stabilize. It also waits 5 minutes after scaling down in order to avoid autoscaler thrashing — defined as unnecessarily scaling resources due to frequent fluctuations in metrics.

Vertical Autoscaling

Vertical autoscaling is another way of scaling your applications based on usage and other criteria. The Vertical Pod Autoscaler (VPA) is relatively new, and scales the amount of resources like CPU and memory available, to existing pods. This is done on a periodic basis with a default period of 10 seconds.

VPA mainly caters to stateful services and Out of Memory (OOM) events. However, it can also be used with stateless services for specialized tasks like auto-correcting the initial resource allocations of your pods. We’ll also be looking at the pricing impact of not using the right-sized nodes, as it can be considerable with larger applications.

One thing to remember is that VPA requires pods to be restarted for resource scaling to take effect. Having said this, it isn’t something you need to worry about because the VPA will respect the Pods Distribution Budget (PDB) and ensure that the minimum number of required pods is satisfied.

VPA allows you to define upper and lower limits for resource scaling so that the resource limits of your nodes are never exceeded. The VPA Recommender feature is probably the most useful out of all the benefits offered by VPA. It analyzes historical resource usage and other metrics, and provides recommendations regarding optimum resource values.

Autoscaling at the Infrastructure Layer

Pod autoscaling can take you a long way in ensuring high performance and cost efficiency. However, in some instances, apps with extensive usage may require more nodes. You may look at increasing the number of clusters or nodes to resolve such situations. There are many benefits to maintaining your applications in a single cluster. So it makes sense to scale the number of nodes within your cluster.

Cluster Autoscaling

The Cluster Autoscaler (CA) automatically manages the number of nodes within a cluster, based on pending pods. Pods are kept in a pending stage when there is a resource shortage within a cluster. The CA assesses the number of nodes required, and as of Kubernetes version 1.18, it automatically interfaces with many cloud providers like AWS, Azure, and Google. Some services like AWS offer comprehensive CA features to optimize cost benefits.

Once new nodes are made available by the provider, the Kubernetes scheduler allocates the pending pods to them. The process is then repeated if there are any more pods in the pending state.

The CA identifies pending pods within 30 seconds. However, once more nodes are added, it waits for 10 minutes, after nodes are no longer needed, before scaling down. You should use cluster autoscaling wisely, as it can considerably reduce the flexibility to scale down if it is turned on for too many pods.

Best Practices for Kubernetes Autoscaling

We looked at three separate options for autoscaling. However, it is important to note that you can’t simply turn all three of them on and expect everything to work perfectly. So let’s look at some of the best practices when it comes to autoscaling.

  • Ensure that you are using the latest versions of Kubernetes, and the latest compatible versions of HPA, VPA, and CA. This will ensure that autoscaling works at optimum levels, without any conflicts.
  • Avoid using HPA and VPA on the same sets of pods, as they are currently incompatible. However, you can achieve this if HPA is fully dependent on external or custom metrics, as they will not overlap each other’s calculations.
  • CA supports nodes with mixed resource limits. However, it is recommended that heterogeneous nodes are used, because the CA assumes that all nodes have the same resource limits.
  • Ensure that resource requests are specified to avoid autoscaling calculations being misguided.
  • Install a metrics server, as HPA and VPA depend on per-pod metrics for autoscaling. Opt for custom metrics, as external metrics APIs can lead to security issues. In addition, Prometheus is recommended for VPA, as it depends on historical data regarding utilization and OOM events.
  • Specify the PodDisruptionBudget so that CA does not reduce the number of nodes below what is required.
  • The CA service levels are limited to 1000 nodes with 30 pods each. If these limits are exceeded, it could lead to cluster sprawl, which means that resources will not be used to their full potential.
  • Over-provision clusters to allow for head room. Pending pods are identified within 30 seconds. However, this time can double for larger clusters. This delay, along with the time to provision new nodes, can have a considerable impact on performance. So, it is recommended that dummy pods are provisioned so that these delays can be avoided.

Additional Cost Benefits of Scaling

In addition to the autoscaling benefits offered natively by Kubernetes, you can also use vendor-specific features to increase savings. AWS is one of the most recommended providers of managed Kubernetes. AWS Elastic Kubernetes Service (EKS) supports all the autoscaling methods discussed in this article.

A case study recently published by AWS  explains how to enjoy up to 80 percent cost savings through autoscaling, adjusting node resource limits, switching to reserved instances for up to 70 percent of nodes, and scheduling off-peak downscaling. Since a typical large-scale application that requires 100 CPU Cores and 400GB of memory, can cost about $45,000 per year (with m4.large EC2 instances being billed at 10 cents per hour and cluster management at 20 cents per cluster, per hour), the ability to save even half of this, let alone 80 percent, is a significant advantage.

Google Kubernetes Engine (GKE) offers slightly lower pricing than AWS, with typical computing instances like n1-standard-2 being offered at 9.5 cents per hour. Also, Google will charge for cluster management, effective June 1, at 20 cents per hour, per zonal cluster. This brings the yearly cost on GKE to about $43,000.

Azure Kubernetes Service (AKS) pricing does not differ much. It offers savings by not charging for cluster management, but charges a higher rate for worker nodes at 11 cents per hour. This fee totals $48,000 a year.

While the top three managed Kubernetes providers offer quite competitive rates, AWS EKS offers the most benefits through its many scaling related features. It commands 77 percent of the Kubernetes market, even though it is not the cheapest out of the lot, as a result of the  benefits it offers via scaling.

We looked at three solutions for autoscaling and the best practices to optimize their use. These autoscaling options can work for you irrespective of the size of your application and Kubernetes infrastructure. But using them correctly will definitely take you from Startup to Superstar!

Ready to take advantage of autoscaling? Contact us today.