In today\’s interconnected world, system downtime, even brief outages in a single cloud region, can significantly impact businesses and user experience. Recent incidents in AWS regions serve as a crucial reminder that overlooking system availability and disaster preparedness is no longer an option.
This article provides a comprehensive, step-by-step guide to implementing a multi-region failover system for applications hosted on Amazon Elastic Kubernetes Service (EKS) using AWS Route 53. By the end of this workshop, you\’ll have a clear understanding of how to enhance the resiliency of your critical workloads, ensuring continuous operation even in the face of regional disruptions.
Workshop Prerequisites
To follow along and build this multi-region failover setup, you will need:
- An active AWS Account
- A Public Hosted Zone configured in AWS Route 53
- Access to AWS CloudShell (or a local environment with AWS CLI configured)
- The following command-line interfaces (CLIs) installed:
aws,envsubst,eksctl,kubectl,kubectx,kubens
Once you have these prerequisites ready, let\’s dive into building a robust, multi-region architecture.
Step 1: Setting Up Your Amazon EKS Clusters Across Multiple Regions
The foundation of our multi-region failover strategy involves deploying identical EKS clusters in two distinct AWS regions. For this workshop, we\’ll consider ap-southeast-7 (Thailand) as our primary region and ap-southeast-1 (Singapore) as our secondary/failover region.
- Initialize CloudShell and Install Tools:
Begin by opening AWS CloudShell. For consistency, all commands in this guide will be executed from a single CloudShell instance (e.g., in Singapore, asap-southeast-7might not have CloudShell).
First, set up environment variables and install the necessary tools:export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account) export Region1=ap-southeast-7 # Primary Region export Region2=ap-southeast-1 # Secondary Region export Region1ClusterName=eks-cluster-th export Region2ClusterName=eks-cluster-sg export ALBName=web-alb # Replace with your actual domain export apphostname=web-demo.yourdomain.net # e.g., web-demo.cloudation101.net export HostedZoneName=yourdomain.net # e.g., cloudation101.net export HostedZoneId=$(aws route53 list-hosted-zones-by-name --dns-name $HostedZoneName --query 'HostedZones[0].Id' --output text | cut -d'/' -f3) # Install envsubst cli sudo yum install gettext -y # Install eksctl ARCH=amd64 PLATFORM=$(uname -s)_$ARCH curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz" tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz sudo install -m 0755 /tmp/eksctl /usr/local/bin && rm /tmp/eksctl # Install kubectx, kubens cli sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens - Create EKS Clusters:
Now, deploy EKS clusters in both your chosen regions. Each cluster typically takes about 15 minutes to provision.# Create EKS cluster in Region 1 (Thailand) eksctl create cluster --name=$Region1ClusterName --enable-auto-mode --region=$Region1 # Create EKS cluster in Region 2 (Singapore) eksctl create cluster --name=$Region2ClusterName --enable-auto-mode --region=$Region2 # Validate cluster status (Status: Active) aws eks describe-cluster --name $Region1ClusterName --region $Region1 --output json --query 'cluster.status' aws eks describe-cluster --name $Region2ClusterName --region $Region2 --output json --query 'cluster.status'Upon successful creation, you should see two active EKS clusters in your AWS console, one for each region.
-
Test EKS Cluster Connectivity:
Verify thatkubectlcan connect to both EKS clusters.# Switch to Region 1 cluster and list nodes kubectx $(kubectx | grep "eks-cluster-th") kubectl get node # Switch to Region 2 cluster and list nodes kubectx $(kubectx | grep "eks-cluster-sg") kubectl get nodeYou should see output similar to the example in the original article, confirming node availability for both clusters.
Step 2: Deploying Your Web Application
With the EKS clusters ready, we\’ll deploy a sample web application to each. This application will be accessible via an Application Load Balancer (ALB) and include a health check endpoint.
- Create IngressClass for ALB:
First, configure theIngressClassspecifically for ALB in both clusters.# Configure IngressClass for Region 1 (Thailand) kubectx $(kubectx | grep "eks-cluster-th") curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f - kubectl get ingressclass,ingressclassparams # Configure IngressClass for Region 2 (Singapore) kubectx $(kubectx | grep "eks-cluster-sg") curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f - kubectl get ingressclass,ingressclassparams - Deploy Web Application Components:
Deploy the web application (ServiceAccount, ClusterRole, Service, and Deployment) to both EKS clusters. Note theAPP_NAMEandBG_COLORenvironment variables, which help distinguish between the two regional deployments.Deploy Web App in Region 1 (Thailand):
# Switch to Region 1 cluster kubectx $(kubectx | grep "eks-cluster-th") cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: webapp-sa --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: webapp-clusterrole rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: webapp-clusterrolebinding subjects: - kind: ServiceAccount name: webapp-sa namespace: default roleRef: kind: ClusterRole name: webapp-clusterrole apiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: Deployment metadata: name: webapp namespace: default spec: replicas: 1 selector: matchLabels: app: webapp template: metadata: labels: app: webapp spec: serviceAccountName: webapp-sa containers: - image: dumlutimuralp/demoapp name: webapp env: - name: APP_NAME value: "WebApp_Demo_TH" - name: BG_COLOR value: "green" - name: HEALTH value: "webapphealth" - name: MY_NODENAME valueFrom: fieldRef: fieldPath: spec.nodeName --- apiVersion: v1 kind: Service metadata: name: webappservice spec: selector: app: webapp ports: - protocol: TCP port: 80 EOFDeploy Web App in Region 2 (Singapore):
# Switch to Region 2 cluster kubectx $(kubectx | grep "eks-cluster-sg") cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: webapp-sa --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: webapp-clusterrole rules: - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: webapp-clusterrolebinding subjects: - kind: ServiceAccount name: webapp-sa namespace: default roleRef: kind: ClusterRole name: webapp-clusterrole apiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: Deployment metadata: name: webapp namespace: default spec: replicas: 1 selector: matchLabels: app: webapp template: metadata: labels: app: webapp spec: serviceAccountName: webapp-sa containers: - image: dumlutimuralp/demoapp name: webapp env: - name: APP_NAME value: "WebApp_Demo_SG" - name: BG_COLOR value: "red" - name: HEALTH value: "webapphealth" - name: MY_NODENAME valueFrom: fieldRef: fieldPath: spec.nodeName --- apiVersion: v1 kind: Service metadata: name: webappservice spec: selector: app: webapp ports: - protocol: TCP port: 80 EOF - Create Ingress for Web Application:
Next, create an Ingress resource for each region, which will provision an AWS Application Load Balancer (ALB) to expose our web applications. Both ALBs will respond to the same hostname,web-demo.yourdomain.net.# Switch to Region 1 cluster (Thailand) and create Ingress kubectx $(kubectx | grep "eks-cluster-th") cat <<EOF | envsubst | kubectl apply -f - apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: webapp-ingress annotations: alb.ingress.kubernetes.io/load-balancer-name: $ALBName alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip spec: ingressClassName: alb rules: - host: $apphostname http: paths: - path: / pathType: Prefix backend: service: name: webappservice port: number: 80 - http: paths: - path: /webapphealth pathType: Prefix backend: service: name: webappservice port: number: 80 EOF # Switch to Region 2 cluster (Singapore) and create Ingress kubectx $(kubectx | grep "eks-cluster-sg") cat <<EOF | envsubst | kubectl apply -f - apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: webapp-ingress annotations: alb.ingress.kubernetes.io/load-balancer-name: $ALBName alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip spec: ingressClassName: alb rules: - host: $apphostname http: paths: - path: / pathType: Prefix backend: service: name: webappservice port: number: 80 - http: paths: - path: /webapphealth pathType: Prefix backend: service: name: webappservice port: number: 80 EOF - Verify Deployments and Test Access:
Confirm that all Kubernetes resources (Deployment, Service, Ingress) are created in both regions.kubectx $(kubectx | grep "eks-cluster-th") kubectl get all,ingress kubectx $(kubectx | grep "eks-cluster-sg") kubectl get all,ingressYou can now access each web application directly via its ALB\’s domain name. You should see “WebApp_Demo_TH” with a green background for the Thailand region and “WebApp_Demo_SG” with a red background for the Singapore region.
Step 3: Configuring Route 53 Health Checks and Alias Records
This is the core of our failover mechanism. We\’ll use Route 53 to monitor the health of our applications in both regions and intelligently route traffic.
- Create Route 53 Health Checks:
Set up HTTP health checks in Route 53 to continuously monitor the/webapphealthendpoint of each ALB.# Find ALB Domain names export Region1ALB=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text) export Region2ALB=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text) # Create Route 53 health check for Region 1 (Thailand) aws route53 create-health-check --caller-reference "RegionTH-WebApp-$(date +%s)" \ --health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region1ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}" # Create Route 53 health check for Region 2 (Singapore) aws route53 create-health-check --caller-reference "RegionSG-WebApp-$(date +%s)" \ --health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region2ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}"Verify the health checks are created in the Route 53 console. Initially, both should show as “Healthy.”
-
Create Route 53 Alias Records for Failover:
Now, create two Alias records for your application\’s hostname (web-demo.yourdomain.net), pointing to the ALBs in each region. We\’ll use a Weighted Routing Policy where Region 1 (Thailand) is the primary (weight 100) and Region 2 (Singapore) is the secondary (weight 0). This setup ensures traffic is always directed to Region 1 unless its health check fails.# Find Health Check IDs export HealthCheckRegion1AppId=$(aws route53 list-health-checks \ --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region1ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \ --output text) export HealthCheckRegion2AppId=$(aws route53 list-health-checks \ --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region2ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \ --output text) # Find Hosted Zone ID of ALBs export Region1ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?DNSName=='$Region1ALB'].CanonicalHostedZoneId" --output text) export Region2ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?DNSName=='$Region2ALB'].CanonicalHostedZoneId" --output text) # Create alias record for Region 1 (Thailand) with weight 100 aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \ --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}" # Create alias record for Region 2 (Singapore) with weight 0 aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \ --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}"Verify the DNS records are created in your Route 53 Hosted Zone. When you access `http://web-demo.yourdomain.net`, you should initially be routed to the green “WebApp_Demo_TH” in Region 1.
Step 4: Testing Application Failure and Failover
Now for the crucial test: simulating a failure in our primary region to observe the automatic failover.
- Simulate Failure in Primary Region:
Scale down the web application deployment in Region 1 (Thailand) to zero replicas. This will cause its health check to fail.kubectx $(kubectx | grep "eks-cluster-th") kubectl scale deployment webapp --replicas=0After scaling down, monitor the Route 53 Health Checks. The health check for Region 1 should quickly transition to “Unhealthy.”
-
Observe Automatic Failover:
Once Region 1\’s health check is unhealthy, try accessing `http://web-demo.yourdomain.net` again. Route 53, detecting the failure in the primary weighted record, will automatically direct traffic to the healthy secondary record (Region 2). You should now see the red “WebApp_Demo_SG” page, confirming a successful failover!- Note on Failover Speed: The speed of redirection depends on several factors, including your Route 53 health check configuration (interval, failure threshold), client-side DNS TTL settings, and ALB HTTP Keepalive configurations.
Step 5: Cleaning Up Your Environment
To avoid unnecessary costs, remember to clean up all the resources created during this workshop.
- Delete Route 53 Alias Records:
aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \ --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}" aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \ --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}" - Delete Route 53 Health Checks:
aws route53 delete-health-check --health-check-id $HealthCheckRegion1AppId aws route53 delete-health-check --health-check-id $HealthCheckRegion2AppId - Delete EKS Clusters:
eksctl delete cluster --name=$Region1ClusterName --region=$Region1 eksctl delete cluster --name=$Region2ClusterName --region=$Region2
Key Learnings from this Workshop
This practical exercise offers several important insights into multi-region failover design:
- Granular Failover Effect: If multiple services share the same ALB, Route 53 might not failover the entire region if only specific services become unhealthy. This is due to the
EvaluateTargetHealth: falsesetting in the Alias records, which tells Route 53 to rely solely on its own health checks. - Active-Active Routing: For an active-active failover pattern, where traffic is distributed across both regions simultaneously, configure identical routing policies (e.g., equal weights) for both Alias records.
- HTTPS Health Check Considerations: Route 53 health checkers do not validate SSL/TLS certificates when configured for HTTPS endpoints.
- Monitoring Private IPs: Route 53 health checks cannot directly monitor endpoints with private IP addresses. For private resources, consider using AWS CloudWatch alarms or AWS Lambda functions for health monitoring.
- Cost of Health Checks: Be mindful of the number of Route 53 Health Checks you create, as they incur costs, especially at scale.
Extending Multi-Region Resiliency to Other AWS Services
While this guide focused on EKS, a comprehensive multi-region strategy must encompass all your application\’s dependencies. Here\’s a brief overview of how to achieve multi-region readiness for other common AWS services:
- Amazon RDS: For MySQL, PostgreSQL, or Aurora, you can configure cross-region Read Replicas. Aurora Global Database offers even higher availability and lower RTO/RPO but might have regional limitations.
- Amazon MSK (Kafka): As MSK lacks fully managed multi-region replication, tools like MSK Replicator or open-source solutions like MirrorMaker2 are necessary to replicate data between regions.
- Amazon S3: S3 natively supports Cross-Region Replication (CRR). You can configure policies to automatically replicate critical bucket data to another region.
- Amazon EBS: EBS volumes are zonal. To protect against regional failures, create snapshots and replicate them to the target region. You would then restore the snapshot to a new EBS volume in the failover region.
For more in-depth practices, explore the “Creating a Multi-Region Application with AWS Services series” on the AWS Architecture Blog.
Conclusion
This article has demonstrated the fundamental steps to design and implement a multi-region failover system for applications running on Amazon EKS. Whether you\’re planning a large-scale project or enhancing an existing system, this approach provides a robust framework for increased resilience.
Ultimately, the decision of how extensively to implement multi-region failover – for every service or just critical ones – involves a careful balance. It\’s essential to weigh the additional investment (infrastructure costs, operational overhead) against your organization\’s acceptable Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and the potential business impact of downtime. Focusing multi-region efforts on truly critical services and data might be the most cost-effective strategy. Crucially, regularly planning, conducting, and testing your Disaster Recovery (DR) procedures across regions is paramount to ensure your designed system will perform as expected when an actual emergency occurs.
We hope this guide provides valuable insights. Feel free to share your questions or experiences with multi-region failover on EKS in the comments. May your infrastructure always be ready for any situation!
Reference: Original Workshop: https://aws.amazon.com/blogs/containers/implementing-granular-failover-in-multi-region-amazon-eks/