AWS Disaster Recovery: Warm Standby Strategy

In the AWS Certified Solutions Architect – Associate (SAA-C03) exam, understanding Disaster Recovery (DR) strategies is critical. Warm Standby is a mid-tier DR strategy that balances cost and recovery speed. It involves maintaining a scaled-down but functional version of your application environment in a secondary AWS Region, ready to scale up at a moment’s notice.

The Real-World Analogy: Think of a spare car kept in your driveway. Unlike a “Pilot Light” (which is just the keys and a battery charger), the Warm Standby car is already running at an idle. It’s not a full-sized SUV like your primary car—maybe it’s a small moped—but it’s already moving. When the primary car breaks down, you hop on the moped and start driving immediately while you wait for a rental car (scaling) to arrive.

Core Concepts of Warm Standby

Warm Standby is often confused with “Pilot Light.” The key difference is that in Warm Standby, business-critical services are already running, albeit at a lower capacity. In Pilot Light, services like App Servers are physically “off” (AMIs only) or exist as shut-down EC2 instances.

Key Characteristics:

  • RTO (Recovery Time Objective): Minutes. Because services are already running, failover involves DNS changes and auto-scaling, not infrastructure deployment.
  • RPO (Recovery Point Objective): Seconds to Minutes. Usually involves asynchronous data replication (e.g., RDS Cross-Region Read Replicas).
  • Cost: Medium. You are paying for “always-on” smaller instances and data transfer.

Comparison of DR Strategies

Strategy RTO / RPO Cost Infrastructure Status
Backup & Restore Hours / 24h $ None. Restore from S3/EBS snapshots.
Pilot Light 10s of Minutes $$ Database live; App servers off/stopped.
Warm Standby Minutes $$$ Scaled-down version always running.
Multi-Site Active-Active Near Zero $$$$ Full capacity running in 2+ regions.

Decision Matrix: If-Then Guide

  • IF the requirement says “Recovery within minutes” and “Minimize costs compared to Active-Active”… THEN choose Warm Standby.
  • IF the requirement says “Database must be up-to-date but App servers can be started via CloudFormation/Terraform”… THEN choose Pilot Light.
  • IF the requirement mentions “Zero downtime” or “Zero data loss”… THEN choose Multi-Site Active-Active.

Exam Tips and Gotchas

  • Route 53 Failover: The SAA exam expects you to know that Route 53 Health Checks and Failover Routing Policies are the primary mechanism to trigger the shift from Primary to Warm Standby.
  • Auto Scaling Group (ASG): In a Warm Standby, your ASG in the secondary region has a “Minimum Capacity” of 1 or 2, while the Primary might have 10. Failover involves updating the ASG “Desired Capacity” to 10.
  • RDS Cross-Region: Use RDS Read Replicas for the standby region. During failover, you must promote the Read Replica to a standalone Primary database.
  • Data Transfer Costs: Remember that Cross-Region Data Transfer (replication) is a significant part of the Warm Standby cost model.

Topics covered:

Summary of key subtopics covered in this guide:

  • Definition of Warm Standby vs. Pilot Light.
  • RTO and RPO targets for SAA-C03 scenarios.
  • Route 53 Failover and Health Check integration.
  • Scaling strategies (ASG minimums) during a disaster event.
  • Database replication patterns (RDS Promotion).
AWS Warm Standby Architecture
Region A (Primary – Full Scale) ASG (Full Capacity) Master DB Replication Region B (Standby – Scaled Down) ASG (Min Capacity: 1) Read Replica
Service Ecosystem

Route 53: Uses DNS failover to point to the Standby ELB.

RDS/Aurora: Cross-region replicas ensure data is ready.

IAM: Roles must be consistent across both regions.

Performance & Scaling

During failover, the ASG in Region B triggers a scale-out event. Performance is degraded for the first few minutes until the “Desired Capacity” is reached.

Tip: Pre-warm your Elastic Load Balancers if you expect massive traffic spikes immediately upon failover.

Cost Optimization

To save money, use T-series instances (burstable) in the standby region for the minimal footprint, then switch to M or C series during a disaster via ASG Launch Template updates.

Production Use Case:

A banking application requires that if the US-EAST-1 region fails, the system must be back online in under 10 minutes. They keep 2 small EC2 instances running in US-WEST-2 and a mirrored RDS instance. This ensures the “Health Check” passes, and users can log in immediately while the rest of the fleet boots up.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top