In today\’s interconnected world, system downtime, even brief outages in a single cloud region, can significantly impact businesses and user experience. Recent incidents in AWS regions serve as a crucial reminder that overlooking system availability and disaster preparedness is no longer an option.

This article provides a comprehensive, step-by-step guide to implementing a multi-region failover system for applications hosted on Amazon Elastic Kubernetes Service (EKS) using AWS Route 53. By the end of this workshop, you\’ll have a clear understanding of how to enhance the resiliency of your critical workloads, ensuring continuous operation even in the face of regional disruptions.

Workshop Prerequisites

To follow along and build this multi-region failover setup, you will need:

  • An active AWS Account
  • A Public Hosted Zone configured in AWS Route 53
  • Access to AWS CloudShell (or a local environment with AWS CLI configured)
  • The following command-line interfaces (CLIs) installed: aws, envsubst, eksctl, kubectl, kubectx, kubens

Once you have these prerequisites ready, let\’s dive into building a robust, multi-region architecture.

Step 1: Setting Up Your Amazon EKS Clusters Across Multiple Regions

The foundation of our multi-region failover strategy involves deploying identical EKS clusters in two distinct AWS regions. For this workshop, we\’ll consider ap-southeast-7 (Thailand) as our primary region and ap-southeast-1 (Singapore) as our secondary/failover region.

  1. Initialize CloudShell and Install Tools:
    Begin by opening AWS CloudShell. For consistency, all commands in this guide will be executed from a single CloudShell instance (e.g., in Singapore, as ap-southeast-7 might not have CloudShell).
    First, set up environment variables and install the necessary tools:

    export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
    export Region1=ap-southeast-7 # Primary Region
    export Region2=ap-southeast-1 # Secondary Region
    export Region1ClusterName=eks-cluster-th
    export Region2ClusterName=eks-cluster-sg
    export ALBName=web-alb
    
    # Replace with your actual domain
    export apphostname=web-demo.yourdomain.net # e.g., web-demo.cloudation101.net
    export HostedZoneName=yourdomain.net # e.g., cloudation101.net
    export HostedZoneId=$(aws route53 list-hosted-zones-by-name --dns-name $HostedZoneName --query 'HostedZones[0].Id' --output text |  cut -d'/' -f3)
    
    # Install envsubst cli
    sudo yum install gettext -y
    
    # Install eksctl
    ARCH=amd64
    PLATFORM=$(uname -s)_$ARCH
    curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
    tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
    sudo install -m 0755 /tmp/eksctl /usr/local/bin && rm /tmp/eksctl
    
    # Install kubectx, kubens cli
    sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx
    sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx
    sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens
    
  2. Create EKS Clusters:
    Now, deploy EKS clusters in both your chosen regions. Each cluster typically takes about 15 minutes to provision.

    # Create EKS cluster in Region 1 (Thailand)
    eksctl create cluster --name=$Region1ClusterName --enable-auto-mode --region=$Region1
    
    # Create EKS cluster in Region 2 (Singapore)
    eksctl create cluster --name=$Region2ClusterName --enable-auto-mode --region=$Region2
    
    # Validate cluster status (Status: Active)
    aws eks describe-cluster --name $Region1ClusterName --region $Region1 --output json --query 'cluster.status'
    aws eks describe-cluster --name $Region2ClusterName --region $Region2 --output json --query 'cluster.status'
    

    Upon successful creation, you should see two active EKS clusters in your AWS console, one for each region.

  3. Test EKS Cluster Connectivity:
    Verify that kubectl can connect to both EKS clusters.

    # Switch to Region 1 cluster and list nodes
    kubectx $(kubectx | grep "eks-cluster-th")
    kubectl get node
    
    # Switch to Region 2 cluster and list nodes
    kubectx $(kubectx | grep "eks-cluster-sg")
    kubectl get node
    

    You should see output similar to the example in the original article, confirming node availability for both clusters.

Step 2: Deploying Your Web Application

With the EKS clusters ready, we\’ll deploy a sample web application to each. This application will be accessible via an Application Load Balancer (ALB) and include a health check endpoint.

  1. Create IngressClass for ALB:
    First, configure the IngressClass specifically for ALB in both clusters.

    # Configure IngressClass for Region 1 (Thailand)
    kubectx $(kubectx | grep "eks-cluster-th")
    curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f -
    kubectl get ingressclass,ingressclassparams
    
    # Configure IngressClass for Region 2 (Singapore)
    kubectx $(kubectx | grep "eks-cluster-sg")
    curl -s https://raw.githubusercontent.com/aws-samples/multi-region-ingress/refs/heads/main/ingressclassconfiguration.yaml | kubectl apply -f -
    kubectl get ingressclass,ingressclassparams
    
  2. Deploy Web Application Components:
    Deploy the web application (ServiceAccount, ClusterRole, Service, and Deployment) to both EKS clusters. Note the APP_NAME and BG_COLOR environment variables, which help distinguish between the two regional deployments.

    Deploy Web App in Region 1 (Thailand):

    # Switch to Region 1 cluster
    kubectx $(kubectx | grep "eks-cluster-th")
    
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: webapp-sa
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: webapp-clusterrole
    rules:
      - apiGroups: [""]
        resources: ["nodes"]
        verbs: ["get", "list"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: webapp-clusterrolebinding
    subjects:
      - kind: ServiceAccount
        name: webapp-sa
        namespace: default
    roleRef:
      kind: ClusterRole
      name: webapp-clusterrole
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: webapp
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: webapp
      template:
        metadata:
          labels:
            app: webapp
        spec:
          serviceAccountName: webapp-sa
          containers:
          - image: dumlutimuralp/demoapp
            name: webapp
            env:
            - name: APP_NAME
              value: "WebApp_Demo_TH"
            - name: BG_COLOR
              value: "green"
            - name: HEALTH
              value: "webapphealth"
            - name: MY_NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: webappservice
    spec:
      selector:
        app: webapp
      ports:
        - protocol: TCP
          port: 80
    EOF
    

    Deploy Web App in Region 2 (Singapore):

    # Switch to Region 2 cluster
    kubectx $(kubectx | grep "eks-cluster-sg")
    
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: webapp-sa
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: webapp-clusterrole
    rules:
      - apiGroups: [""]
        resources: ["nodes"]
        verbs: ["get", "list"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: webapp-clusterrolebinding
    subjects:
      - kind: ServiceAccount
        name: webapp-sa
        namespace: default
    roleRef:
      kind: ClusterRole
      name: webapp-clusterrole
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: webapp
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: webapp
      template:
        metadata:
          labels:
            app: webapp
        spec:
          serviceAccountName: webapp-sa
          containers:
          - image: dumlutimuralp/demoapp
            name: webapp
            env:
            - name: APP_NAME
              value: "WebApp_Demo_SG"
            - name: BG_COLOR
              value: "red"
            - name: HEALTH
              value: "webapphealth"
            - name: MY_NODENAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: webappservice
    spec:
      selector:
        app: webapp
      ports:
        - protocol: TCP
          port: 80
    EOF
    
  3. Create Ingress for Web Application:
    Next, create an Ingress resource for each region, which will provision an AWS Application Load Balancer (ALB) to expose our web applications. Both ALBs will respond to the same hostname, web-demo.yourdomain.net.

    # Switch to Region 1 cluster (Thailand) and create Ingress
    kubectx $(kubectx | grep "eks-cluster-th")
    
    cat <<EOF | envsubst | kubectl apply -f -
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: webapp-ingress
      annotations:
        alb.ingress.kubernetes.io/load-balancer-name: $ALBName
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: ip
    spec:
      ingressClassName: alb
      rules:
      - host: $apphostname
        http:
          paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webappservice
                port:
                   number: 80
      - http:
          paths:
          - path: /webapphealth
            pathType: Prefix
            backend:
              service:
                name: webappservice
                port:
                   number: 80
    EOF
    
    # Switch to Region 2 cluster (Singapore) and create Ingress
    kubectx $(kubectx | grep "eks-cluster-sg")
    
    cat <<EOF | envsubst | kubectl apply -f -
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: webapp-ingress
      annotations:
        alb.ingress.kubernetes.io/load-balancer-name: $ALBName
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: ip
    spec:
      ingressClassName: alb
      rules:
      - host: $apphostname
        http:
          paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webappservice
                port:
                   number: 80
      - http:
          paths:
          - path: /webapphealth
            pathType: Prefix
            backend:
              service:
                name: webappservice
                port:
                   number: 80
    EOF
    
  4. Verify Deployments and Test Access:
    Confirm that all Kubernetes resources (Deployment, Service, Ingress) are created in both regions.

    kubectx $(kubectx | grep "eks-cluster-th")
    kubectl get all,ingress
    
    kubectx $(kubectx | grep "eks-cluster-sg")
    kubectl get all,ingress
    

    You can now access each web application directly via its ALB\’s domain name. You should see “WebApp_Demo_TH” with a green background for the Thailand region and “WebApp_Demo_SG” with a red background for the Singapore region.

Step 3: Configuring Route 53 Health Checks and Alias Records

This is the core of our failover mechanism. We\’ll use Route 53 to monitor the health of our applications in both regions and intelligently route traffic.

  1. Create Route 53 Health Checks:
    Set up HTTP health checks in Route 53 to continuously monitor the /webapphealth endpoint of each ALB.

    # Find ALB Domain names
    export Region1ALB=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text)
    export Region2ALB=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?contains(LoadBalancerName, '$ALBName')].DNSName | [0]" --output text)
    
    # Create Route 53 health check for Region 1 (Thailand)
    aws route53 create-health-check --caller-reference "RegionTH-WebApp-$(date +%s)" \
    --health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region1ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}"
    
    # Create Route 53 health check for Region 2 (Singapore)
    aws route53 create-health-check --caller-reference "RegionSG-WebApp-$(date +%s)" \
    --health-check-config "{\"Port\":80,\"Type\":\"HTTP\",\"ResourcePath\":\"/webapphealth\",\"FullyQualifiedDomainName\":\"$Region2ALB\",\"RequestInterval\":10,\"FailureThreshold\":2,\"MeasureLatency\":true,\"Inverted\":false,\"Disabled\":false,\"EnableSNI\":false}"
    

    Verify the health checks are created in the Route 53 console. Initially, both should show as “Healthy.”

  2. Create Route 53 Alias Records for Failover:
    Now, create two Alias records for your application\’s hostname (web-demo.yourdomain.net), pointing to the ALBs in each region. We\’ll use a Weighted Routing Policy where Region 1 (Thailand) is the primary (weight 100) and Region 2 (Singapore) is the secondary (weight 0). This setup ensures traffic is always directed to Region 1 unless its health check fails.

    # Find Health Check IDs
    export HealthCheckRegion1AppId=$(aws route53 list-health-checks \
      --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region1ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \
      --output text)
    
    export HealthCheckRegion2AppId=$(aws route53 list-health-checks \
      --query "HealthChecks[?HealthCheckConfig.FullyQualifiedDomainName=='$Region2ALB' && HealthCheckConfig.ResourcePath=='/webapphealth'] | [0].Id" \
      --output text)
    
    # Find Hosted Zone ID of ALBs
    export Region1ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region1 --query "LoadBalancers[?DNSName=='$Region1ALB'].CanonicalHostedZoneId" --output text)
    export Region2ALBHostedZoneId=$(aws elbv2 describe-load-balancers --region $Region2 --query "LoadBalancers[?DNSName=='$Region2ALB'].CanonicalHostedZoneId" --output text)
    
    # Create alias record for Region 1 (Thailand) with weight 100
    aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
      --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}"
    
    # Create alias record for Region 2 (Singapore) with weight 0
    aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
      --change-batch "{\"Changes\":[{\"Action\":\"CREATE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}"
    

    Verify the DNS records are created in your Route 53 Hosted Zone. When you access `http://web-demo.yourdomain.net`, you should initially be routed to the green “WebApp_Demo_TH” in Region 1.

Step 4: Testing Application Failure and Failover

Now for the crucial test: simulating a failure in our primary region to observe the automatic failover.

  1. Simulate Failure in Primary Region:
    Scale down the web application deployment in Region 1 (Thailand) to zero replicas. This will cause its health check to fail.

    kubectx $(kubectx | grep "eks-cluster-th")
    kubectl scale deployment webapp --replicas=0
    

    After scaling down, monitor the Route 53 Health Checks. The health check for Region 1 should quickly transition to “Unhealthy.”

  2. Observe Automatic Failover:
    Once Region 1\’s health check is unhealthy, try accessing `http://web-demo.yourdomain.net` again. Route 53, detecting the failure in the primary weighted record, will automatically direct traffic to the healthy secondary record (Region 2). You should now see the red “WebApp_Demo_SG” page, confirming a successful failover!

    • Note on Failover Speed: The speed of redirection depends on several factors, including your Route 53 health check configuration (interval, failure threshold), client-side DNS TTL settings, and ALB HTTP Keepalive configurations.

Step 5: Cleaning Up Your Environment

To avoid unnecessary costs, remember to clean up all the resources created during this workshop.

  1. Delete Route 53 Alias Records:
    aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
      --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region1-Primary\",\"Weight\":100,\"HealthCheckId\":\"$HealthCheckRegion1AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region1ALBHostedZoneId\",\"DNSName\":\"$Region1ALB\",\"EvaluateTargetHealth\":false}}}]}"
    
    aws route53 change-resource-record-sets --hosted-zone-id $HostedZoneId \
      --change-batch "{\"Changes\":[{\"Action\":\"DELETE\",\"ResourceRecordSet\":{\"Name\":\"$apphostname\",\"Type\":\"A\",\"SetIdentifier\":\"Region2-Secondary\",\"Weight\":0,\"HealthCheckId\":\"$HealthCheckRegion2AppId\",\"AliasTarget\":{\"HostedZoneId\":\"$Region2ALBHostedZoneId\",\"DNSName\":\"$Region2ALB\",\"EvaluateTargetHealth\":false}}}]}"
    
  2. Delete Route 53 Health Checks:
    aws route53 delete-health-check --health-check-id $HealthCheckRegion1AppId
    aws route53 delete-health-check --health-check-id $HealthCheckRegion2AppId
    
  3. Delete EKS Clusters:
    eksctl delete cluster --name=$Region1ClusterName --region=$Region1
    eksctl delete cluster --name=$Region2ClusterName --region=$Region2
    

Key Learnings from this Workshop

This practical exercise offers several important insights into multi-region failover design:

  1. Granular Failover Effect: If multiple services share the same ALB, Route 53 might not failover the entire region if only specific services become unhealthy. This is due to the EvaluateTargetHealth: false setting in the Alias records, which tells Route 53 to rely solely on its own health checks.
  2. Active-Active Routing: For an active-active failover pattern, where traffic is distributed across both regions simultaneously, configure identical routing policies (e.g., equal weights) for both Alias records.
  3. HTTPS Health Check Considerations: Route 53 health checkers do not validate SSL/TLS certificates when configured for HTTPS endpoints.
  4. Monitoring Private IPs: Route 53 health checks cannot directly monitor endpoints with private IP addresses. For private resources, consider using AWS CloudWatch alarms or AWS Lambda functions for health monitoring.
  5. Cost of Health Checks: Be mindful of the number of Route 53 Health Checks you create, as they incur costs, especially at scale.

Extending Multi-Region Resiliency to Other AWS Services

While this guide focused on EKS, a comprehensive multi-region strategy must encompass all your application\’s dependencies. Here\’s a brief overview of how to achieve multi-region readiness for other common AWS services:

  • Amazon RDS: For MySQL, PostgreSQL, or Aurora, you can configure cross-region Read Replicas. Aurora Global Database offers even higher availability and lower RTO/RPO but might have regional limitations.
  • Amazon MSK (Kafka): As MSK lacks fully managed multi-region replication, tools like MSK Replicator or open-source solutions like MirrorMaker2 are necessary to replicate data between regions.
  • Amazon S3: S3 natively supports Cross-Region Replication (CRR). You can configure policies to automatically replicate critical bucket data to another region.
  • Amazon EBS: EBS volumes are zonal. To protect against regional failures, create snapshots and replicate them to the target region. You would then restore the snapshot to a new EBS volume in the failover region.

For more in-depth practices, explore the “Creating a Multi-Region Application with AWS Services series” on the AWS Architecture Blog.

Conclusion

This article has demonstrated the fundamental steps to design and implement a multi-region failover system for applications running on Amazon EKS. Whether you\’re planning a large-scale project or enhancing an existing system, this approach provides a robust framework for increased resilience.

Ultimately, the decision of how extensively to implement multi-region failover – for every service or just critical ones – involves a careful balance. It\’s essential to weigh the additional investment (infrastructure costs, operational overhead) against your organization\’s acceptable Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and the potential business impact of downtime. Focusing multi-region efforts on truly critical services and data might be the most cost-effective strategy. Crucially, regularly planning, conducting, and testing your Disaster Recovery (DR) procedures across regions is paramount to ensure your designed system will perform as expected when an actual emergency occurs.

We hope this guide provides valuable insights. Feel free to share your questions or experiences with multi-region failover on EKS in the comments. May your infrastructure always be ready for any situation!

Reference: Original Workshop: https://aws.amazon.com/blogs/containers/implementing-granular-failover-in-multi-region-amazon-eks/

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed