Mitigating disruption during Amazon EKS cluster upgrade with blue/green deployment

Mục lục

Introduction
In-Place vs. Blue/Green upgrade strategies
Upgrade cluster process
1. Prerequisite
2. Update manifests
3. Bootstrap new cluster
4. Re-deploy add-ons and third-party tools with compatible version
5. Re-deploy workloads
6. Verify workloads
7. DNS Switchover
Stateful workloads migration
Conclusion

Introduction

Upgrading your Amazon EKS cluster version is necessary for security, performance optimization, new features, and long-term support. Nowadays, Amazon EKS introduces extended support plan for Kubernetes version that will cost you remarkably. The upgrade is never a easy game and can feel like a business continuity nightmare. Some may feel tempted to postpone the inevitable. In this blog, we will walk you through our upgrade process using the Blue/Green deployment strategy.

We'll demonstrate this on an EKS cluster with EC2 instances as worker nodes. This strategy can be also applied the same for Fargate, and we'll leverage the popular AWS Retail Store sample application to demonstrate the steps. For the code, head over to the AWS repository. By the end of this blog, you'll have a clear understanding of what an EKS upgrade entails and how to navigate it with confidence.

In-Place vs. Blue/Green upgrade strategies

Upgrading a cluster can be a balance between cost and risk. There are two common strategies that be widely used: in-place and blue/green upgrades.

In-Place Upgrades: Simpler and more cost-effective. This strategy will modify your existing cluster directly. While this minimizes resource usage, it carries the risk of downtime and limits upgrades to single versions at a time. Additionally, rolling back requires extra steps.
Blue/Green Upgrades: This strategy prioritizes zero downtime by creating a brand new, upgraded cluster (the "green" environment) alongside the existing one (the "blue" environment). Here, you can migrate workloads individually, enabling upgrades across multiple versions. However, blue/green deployment requires managing two clusters simultaneously, which can be costly and strain regional resource capacity. Additionally, API endpoints and authentication methods change, requiring updates to tools like kubectl and CI/CD pipelines.

In-place upgrade method is ideal for cost-sensitive scenarios where downtime is less critical or where the two versions don't have breaking changes. For situations demanding high availability or the ability to jump multiple versions, the blue/green strategy provides a safer solution but is also more resource-intensive and costly. Thoroughly consider your specific needs, resource constraints, and infra cost to determine the best suitable upgrade method for your cluster.

Upgrade cluster process

1. Prerequisite

Explore your cluster: Before diving into your cluster upgrade, system inventory is a mandatory step in order to have insight of what is running in your cluster. Note down your cluster version, add-on versions, and the number of services and applications running. This intel helps you choose the right upgrade strategy, identify potential compatibility issues, and plan a smooth migration for all your workloads. It's like gathering intel before a mission - the more you know, the smoother the upgrade!

The current cluster's version is 1.24 and it running on extended support

Currently 04 adds-on are running.

The cluster is using EC2 instances as worker nodes

Karpenter adds-on for node autoscaling.

Around 12 services found

The application UI

Assess the impact of new version upgrade: Thoroughly review the release notes for the EKS and Kubernetes versions you want to upgrade to in order to fully grasp important information such as breaking changes and deprecated APIs. For instance, if I want to upgrade to EKS 1.29, I will read the following documents:
- Release notes của EKS: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions-standard.html
- Kubernetes change log: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.29.md
- Kubernetes new version release notes: https://kubernetes.io/blog/2023/12/13/kubernetes-v1-29-release
Backup EKS cluster (Optional)
Review and address deprecated APIs:
- Kubernetes may deprecates some APIs in new version. So, we need to identify and fix any usage of deprecated APIs within our workloads to ensure compatibility in new EKS version.
- It’s worth to read this deprecation policy to understand how Kubernetes deprecate APIs.
- There are several tools that help us find out the API deprecations in our clusters. One of them is “kube-no-trouble” aka kubent. At the time I write this document, the latest ruleset is for 1.29 in kubent. I run kubent with target version of 1.29 and got the below result. As you can see, kubent shows the deprecated APIs.

Deprecated APIs found by kubent

2. Update manifests

When we have deprecated APIs in our hands. For next steps, we need to update those API version by manually or tools such as “kubectl convert” that actually depends on number of deprecated APIs. We recommend you to update the API version manually to avoid any unforeseen error. For example, based on above kubent result, we see that our HPA apiVersion will be removed since version 1.26. This is original HPA manifest in the current EKS cluster v1.24 and updated HPA manifest in new version, respectively:

Old version

New version

3. Bootstrap new cluster

There are some typical options for a new Amazon EKS cluster deployment with your desired Kubernetes version such as AWS Management Console, eksctl tool, or Terraform. In this blog, we have deployed a new cluster, namely "green-eks", using version v1.29 and EC2 worker nodes.

New EKS cluster

EC2 worker nodes

4. Re-deploy add-ons and third-party tools with compatible version

Once the "green-eks" cluster is ready, we've re-deployed required custom add-ons and third-party tools. It's crucial to ensure those adds-on and third-party tools version are compatible with new cluster. For instance, this document shows us the suggested version of the Amazon VPC CNI add-on to use for each cluster version.

EKS adds-on in new cluster

5. Re-deploy workloads

Now that the foundation is laid, we can begin redeploying our workloads to the new "green-eks" cluster.

Application deployment in new cluster

6. Verify workloads

Once our workloads are deployed successfully in the "green-eks" cluster, it's verification time! The specific tests you run will depend on your application development process. You might opt for smoke test, integration test, manual test, or even a simple UI check like we did in this blog for demo purpose only. The key purpose is to ensure everything functions as intended in the new environment.

Application in new cluster

We also would check EKS adds-on operation. For example, Karpenter works well by scaling node as expected.

Karpenter deployment logs

7. DNS Switchover

When application is ready to serve the client requests, the final step is to switch traffic over to the "green-eks" cluster. We achieved this by updating our DNS records in DNS management such as Amazon Route 53 or any other DNS provider. Amazon Route 53 provides weighted routing policy, so we can initially direct a small percentage of users to the new cluster. This allows us perform a staged rollout and verify everything functions smoothly before migrating all traffic.

Stateful workloads migration

During workload deployments to new Kubernetes clusters, specific considerations arise for stateful workloads. These workloads, such as Solr databases or monitoring stacks like Prometheus and Grafana, require data persistence and careful migration strategies. One proven and reliable migration approach for ensuring data integrity is the backup and restore method. We shared our experience in Solr database migration between EKS cluster in previous blog. The blog serves as a reference guide for migrating your stateful workloads.

Conclusion

By leveraging the Blue/Green deployment strategy, we've successfully navigated our EKS upgrade with minimal disruption. This approach offers several benefits:

Reduced Downtime: Since you maintain a fully functional "blue" cluster while deploying the upgrade on "green," user traffic experiences minimal interruption.
Phased Rollout: Weighted routing policy with Amazon Route 53 allows for a staged rollout, letting you test the new cluster with a small percentage of users before fully traffic migration.
Rollback: If any issues arise in the new environment, you can easily switch traffic back to the "blue" cluster with minimum overhead.

This blog provides a high-level guideline for EKS upgrade process using blue/green deployment to mitigate system disruption. Remember to tailor the specific steps to your application and infrastructure. Through a well-prepared planning and execution, blue/green deployment can make your EKS upgrade a breeze!