Innovative Software Technology-Achieving High Availability: Multi-Region Failover for Amazon EKS with AWS Route 53

In today\’s interconnected world, system downtime, even brief outages in a single cloud region, can significantly impact businesses and user experience. Recent incidents in AWS regions serve as a crucial reminder that overlooking system availability and disaster preparedness is no longer an option.

This article provides a comprehensive, step-by-step guide to implementing a multi-region failover system for applications hosted on Amazon Elastic Kubernetes Service (EKS) using AWS Route 53. By the end of this workshop, you\’ll have a clear understanding of how to enhance the resiliency of your critical workloads, ensuring continuous operation even in the face of regional disruptions.

Workshop Prerequisites

To follow along and build this multi-region failover setup, you will need:

An active AWS Account
A Public Hosted Zone configured in AWS Route 53
Access to AWS CloudShell (or a local environment with AWS CLI configured)
The following command-line interfaces (CLIs) installed: aws, envsubst, eksctl, kubectl, kubectx, kubens

Once you have these prerequisites ready, let\’s dive into building a robust, multi-region architecture.

Step 1: Setting Up Your Amazon EKS Clusters Across Multiple Regions

The foundation of our multi-region failover strategy involves deploying identical EKS clusters in two distinct AWS regions. For this workshop, we\’ll consider ap-southeast-7 (Thailand) as our primary region and ap-southeast-1 (Singapore) as our secondary/failover region.

Initialize CloudShell and Install Tools:
Begin by opening AWS CloudShell. For consistency, all commands in this guide will be executed from a single CloudShell instance (e.g., in Singapore, as ap-southeast-7 might not have CloudShell).
First, set up environment variables and install the necessary tools:

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export Region1=ap-southeast-7 # Primary Region
export Region2=ap-southeast-1 # Secondary Region
export Region1ClusterName=eks-cluster-th
export Region2ClusterName=eks-cluster-sg
export ALBName=web-alb

# Replace with your actual domain
export apphostname=web-demo.yourdomain.net # e.g., web-demo.cloudation101.net
export HostedZoneName=yourdomain.net # e.g., cloudation101.net
export HostedZoneId=$(aws route53 list-hosted-zones-by-name --dns-name $HostedZoneName --query 'HostedZones[0].Id' --output text |  cut -d'/' -f3)

# Install envsubst cli
sudo yum install gettext -y

# Install eksctl
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo install -m 0755 /tmp/eksctl /usr/local/bin && rm /tmp/eksctl

# Install kubectx, kubens cli
sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx
sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx
sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens

Create EKS Clusters:
Now, deploy EKS clusters in both your chosen regions. Each cluster typically takes about 15 minutes to provision.

# Create EKS cluster in Region 1 (Thailand)
eksctl create cluster --name=$Region1ClusterName --enable-auto-mode --region=$Region1

# Create EKS cluster in Region 2 (Singapore)
eksctl create cluster --name=$Region2ClusterName --enable-auto-mode --region=$Region2

# Validate cluster status (Status: Active)
aws eks describe-cluster --name $Region1ClusterName --region $Region1 --output json --query 'cluster.status'
aws eks describe-cluster --name $Region2ClusterName --region $Region2 --output json --query 'cluster.status'

Upon successful creation, you should see two active EKS clusters in your AWS console, one for each region.

Test EKS Cluster Connectivity:
Verify that kubectl can connect to both EKS clusters.

# Switch to Region 1 cluster and list nodes
kubectx $(kubectx | grep "eks-cluster-th")
kubectl get node

# Switch to Region 2 cluster and list nodes
kubectx $(kubectx | grep "eks-cluster-sg")
kubectl get node

You should see output similar to the example in the original article, confirming node availability for both clusters.

Step 2: Deploying Your Web Application

With the EKS clusters ready, we\’ll deploy a sample web application to each. This application will be accessible via an Application Load Balancer (ALB) and include a health check endpoint.

Create IngressClass for ALB:
First, configure the IngressClass specifically for ALB in both clusters.

# Configure IngressClass for Region 1 (Thailand)
kubectx $(kubectx | grep "eks-cluster-th")
curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f -
kubectl get ingressclass,ingressclassparams

# Configure IngressClass for Region 2 (Singapore)
kubectx $(kubectx | grep "eks-cluster-sg")
curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f -
kubectl get ingressclass,ingressclassparams

Deploy Web Application Components:
Deploy the web application (ServiceAccount, ClusterRole, Service, and Deployment) to both EKS clusters. Note the APP_NAME and BG_COLOR environment variables, which help distinguish between the two regional deployments.

Deploy Web App in Region 1 (Thailand):

# Switch to Region 1 cluster
kubectx $(kubectx | grep "eks-cluster-th")

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: webapp-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: webapp-clusterrole
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: webapp-clusterrolebinding
subjects:
  - kind: ServiceAccount
    name: webapp-sa
    namespace: default
roleRef:
  kind: ClusterRole
  name: webapp-clusterrole
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      serviceAccountName: webapp-sa
      containers:
      - image: dumlutimuralp/demoapp
        name: webapp
        env:
        - name: APP_NAME
          value: "WebApp_Demo_TH"
        - name: BG_COLOR
          value: "green"
        - name: HEALTH
          value: "webapphealth"
        - name: MY_NODENAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
---
apiVersion: v1
kind: Service
metadata:
  name: webappservice
spec:
  selector:
    app: webapp
  ports:
    - protocol: TCP
      port: 80
EOF

Deploy Web App in Region 2 (Singapore):

# Switch to Region 2 cluster
kubectx $(kubectx | grep "eks-cluster-sg")

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: webapp-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: webapp-clusterrole
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: webapp-clusterrolebinding
subjects:
  - kind: ServiceAccount
    name: webapp-sa
    namespace: default
roleRef:
  kind: ClusterRole
  name: webapp-clusterrole
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      serviceAccountName: webapp-sa
      containers:
      - image: dumlutimuralp/demoapp
        name: webapp
        env:
        - name: APP_NAME
          value: "WebApp_Demo_SG"
        - name: BG_COLOR
          value: "red"
        - name: HEALTH
          value: "webapphealth"
        - name: MY_NODENAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
---
apiVersion: v1
kind: Service
metadata:
  name: webappservice
spec:
  selector:
    app: webapp
  ports:
    - protocol: TCP
      port: 80
EOF

Create Ingress for Web Application:
Next, create an Ingress resource for each region, which will provision an AWS Application Load Balancer (ALB) to expose our web applications. Both ALBs will respond to the same hostname, web-demo.yourdomain.net.

# Switch to Region 1 cluster (Thailand) and create Ingress
kubectx $(kubectx | grep "eks-cluster-th")

cat <<EOF | envsubst | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: $ALBName
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: alb
  rules:
  - host: $apphostname
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: webappservice
            port:
               number: 80
  - http:
      paths:
      - path: /webapphealth
        pathType: Prefix
        backend:
          service:
            name: webappservice
            port:
               number: 80
EOF

# Switch to Region 2 cluster (Singapore) and create Ingress
kubectx $(kubectx | grep "eks-cluster-sg")

cat <<EOF | envsubst | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: $ALBName
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: alb
  rules:
  - host: $apphostname
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: webappservice
            port:
               number: 80
  - http:
      paths:
      - path: /webapphealth
        pathType: Prefix
        backend:
          service:
            name: webappservice
            port:
               number: 80
EOF

Verify Deployments and Test Access:
Confirm that all Kubernetes resources (Deployment, Service, Ingress) are created in both regions.
```
kubectx $(kubectx | grep "eks-cluster-th")
kubectl get all,ingress

kubectx $(kubectx | grep "eks-cluster-sg")
kubectl get all,ingress
```
You can now access each web application directly via its ALB\’s domain name. You should see “WebApp_Demo_TH” with a green background for the Thailand region and “WebApp_Demo_SG” with a red background for the Singapore region.

Step 3: Configuring Route 53 Health Checks and Alias Records

This is the core of our failover mechanism. We\’ll use Route 53 to monitor the health of our applications in both regions and intelligently route traffic.

Create Route 53 Health Checks:
Set up HTTP health checks in Route 53 to continuously monitor the /webapphealth endpoint of each ALB.

# Find ALB Domain names
export Region1ALB=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text)
export Region2ALB=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text)

# Create Route 53 health check for Region 1 (Thailand)
aws route53 create-health-check --caller-reference "RegionTH-WebApp-$(date +%s)" \
--health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region1ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}"

# Create Route 53 health check for Region 2 (Singapore)
aws route53 create-health-check --caller-reference "RegionSG-WebApp-$(date +%s)" \
--health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region2ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}"

Verify the health checks are created in the Route 53 console. Initially, both should show as “Healthy.”

Create Route 53 Alias Records for Failover:
Now, create two Alias records for your application\’s hostname (web-demo.yourdomain.net), pointing to the ALBs in each region. We\’ll use a Weighted Routing Policy where Region 1 (Thailand) is the primary (weight 100) and Region 2 (Singapore) is the secondary (weight 0). This setup ensures traffic is always directed to Region 1 unless its health check fails.

# Find Health Check IDs
export HealthCheckRegion1AppId=$(aws route53 list-health-checks \
  --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region1ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \
  --output text)

export HealthCheckRegion2AppId=$(aws route53 list-health-checks \
  --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region2ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \
  --output text)

# Find Hosted Zone ID of ALBs
export Region1ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?DNSName=='$Region1ALB'].CanonicalHostedZoneId" --output text)
export Region2ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?DNSName=='$Region2ALB'].CanonicalHostedZoneId" --output text)

# Create alias record for Region 1 (Thailand) with weight 100
aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
  --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}"

# Create alias record for Region 2 (Singapore) with weight 0
aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
  --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}"

Verify the DNS records are created in your Route 53 Hosted Zone. When you access `http://web-demo.yourdomain.net`, you should initially be routed to the green “WebApp_Demo_TH” in Region 1.

Step 4: Testing Application Failure and Failover

Now for the crucial test: simulating a failure in our primary region to observe the automatic failover.

Simulate Failure in Primary Region:
Scale down the web application deployment in Region 1 (Thailand) to zero replicas. This will cause its health check to fail.
```
kubectx $(kubectx | grep "eks-cluster-th")
kubectl scale deployment webapp --replicas=0
```
After scaling down, monitor the Route 53 Health Checks. The health check for Region 1 should quickly transition to “Unhealthy.”
Observe Automatic Failover:
Once Region 1\’s health check is unhealthy, try accessing `http://web-demo.yourdomain.net` again. Route 53, detecting the failure in the primary weighted record, will automatically direct traffic to the healthy secondary record (Region 2). You should now see the red “WebApp_Demo_SG” page, confirming a successful failover!
- Note on Failover Speed: The speed of redirection depends on several factors, including your Route 53 health check configuration (interval, failure threshold), client-side DNS TTL settings, and ALB HTTP Keepalive configurations.

Step 5: Cleaning Up Your Environment

To avoid unnecessary costs, remember to clean up all the resources created during this workshop.

Delete Route 53 Alias Records:

aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
  --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}"

aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
  --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}"

Delete Route 53 Health Checks:

aws route53 delete-health-check --health-check-id $HealthCheckRegion1AppId
aws route53 delete-health-check --health-check-id $HealthCheckRegion2AppId

Delete EKS Clusters:

eksctl delete cluster --name=$Region1ClusterName --region=$Region1
eksctl delete cluster --name=$Region2ClusterName --region=$Region2

Key Learnings from this Workshop

This practical exercise offers several important insights into multi-region failover design:

Granular Failover Effect: If multiple services share the same ALB, Route 53 might not failover the entire region if only specific services become unhealthy. This is due to the EvaluateTargetHealth: false setting in the Alias records, which tells Route 53 to rely solely on its own health checks.
Active-Active Routing: For an active-active failover pattern, where traffic is distributed across both regions simultaneously, configure identical routing policies (e.g., equal weights) for both Alias records.
HTTPS Health Check Considerations: Route 53 health checkers do not validate SSL/TLS certificates when configured for HTTPS endpoints.
Monitoring Private IPs: Route 53 health checks cannot directly monitor endpoints with private IP addresses. For private resources, consider using AWS CloudWatch alarms or AWS Lambda functions for health monitoring.
Cost of Health Checks: Be mindful of the number of Route 53 Health Checks you create, as they incur costs, especially at scale.

Extending Multi-Region Resiliency to Other AWS Services

While this guide focused on EKS, a comprehensive multi-region strategy must encompass all your application\’s dependencies. Here\’s a brief overview of how to achieve multi-region readiness for other common AWS services:

Amazon RDS: For MySQL, PostgreSQL, or Aurora, you can configure cross-region Read Replicas. Aurora Global Database offers even higher availability and lower RTO/RPO but might have regional limitations.
Amazon MSK (Kafka): As MSK lacks fully managed multi-region replication, tools like MSK Replicator or open-source solutions like MirrorMaker2 are necessary to replicate data between regions.
Amazon S3: S3 natively supports Cross-Region Replication (CRR). You can configure policies to automatically replicate critical bucket data to another region.
Amazon EBS: EBS volumes are zonal. To protect against regional failures, create snapshots and replicate them to the target region. You would then restore the snapshot to a new EBS volume in the failover region.

For more in-depth practices, explore the “Creating a Multi-Region Application with AWS Services series” on the AWS Architecture Blog.

Conclusion

This article has demonstrated the fundamental steps to design and implement a multi-region failover system for applications running on Amazon EKS. Whether you\’re planning a large-scale project or enhancing an existing system, this approach provides a robust framework for increased resilience.

Ultimately, the decision of how extensively to implement multi-region failover – for every service or just critical ones – involves a careful balance. It\’s essential to weigh the additional investment (infrastructure costs, operational overhead) against your organization\’s acceptable Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and the potential business impact of downtime. Focusing multi-region efforts on truly critical services and data might be the most cost-effective strategy. Crucially, regularly planning, conducting, and testing your Disaster Recovery (DR) procedures across regions is paramount to ensure your designed system will perform as expected when an actual emergency occurs.

We hope this guide provides valuable insights. Feel free to share your questions or experiences with multi-region failover on EKS in the comments. May your infrastructure always be ready for any situation!

Reference: Original Workshop: https://aws.amazon.com/blogs/containers/implementing-granular-failover-in-multi-region-amazon-eks/