AWS Disaster Recovery: Warm Standby Strategy
In the AWS Certified Solutions Architect – Associate (SAA-C03) exam, understanding Disaster Recovery (DR) strategies is critical. Warm Standby is a mid-tier DR strategy that balances cost and recovery speed. It involves maintaining a scaled-down but functional version of your application environment in a secondary AWS Region, ready to scale up at a moment’s notice.
Core Concepts of Warm Standby
Warm Standby is often confused with “Pilot Light.” The key difference is that in Warm Standby, business-critical services are already running, albeit at a lower capacity. In Pilot Light, services like App Servers are physically “off” (AMIs only) or exist as shut-down EC2 instances.
Key Characteristics:
- RTO (Recovery Time Objective): Minutes. Because services are already running, failover involves DNS changes and auto-scaling, not infrastructure deployment.
- RPO (Recovery Point Objective): Seconds to Minutes. Usually involves asynchronous data replication (e.g., RDS Cross-Region Read Replicas).
- Cost: Medium. You are paying for “always-on” smaller instances and data transfer.
Comparison of DR Strategies
| Strategy | RTO / RPO | Cost | Infrastructure Status |
|---|---|---|---|
| Backup & Restore | Hours / 24h | $ | None. Restore from S3/EBS snapshots. |
| Pilot Light | 10s of Minutes | $$ | Database live; App servers off/stopped. |
| Warm Standby | Minutes | $$$ | Scaled-down version always running. |
| Multi-Site Active-Active | Near Zero | $$$$ | Full capacity running in 2+ regions. |
Decision Matrix: If-Then Guide
- IF the requirement says “Recovery within minutes” and “Minimize costs compared to Active-Active”… THEN choose Warm Standby.
- IF the requirement says “Database must be up-to-date but App servers can be started via CloudFormation/Terraform”… THEN choose Pilot Light.
- IF the requirement mentions “Zero downtime” or “Zero data loss”… THEN choose Multi-Site Active-Active.
Exam Tips and Gotchas
- Route 53 Failover: The SAA exam expects you to know that Route 53 Health Checks and Failover Routing Policies are the primary mechanism to trigger the shift from Primary to Warm Standby.
- Auto Scaling Group (ASG): In a Warm Standby, your ASG in the secondary region has a “Minimum Capacity” of 1 or 2, while the Primary might have 10. Failover involves updating the ASG “Desired Capacity” to 10.
- RDS Cross-Region: Use RDS Read Replicas for the standby region. During failover, you must promote the Read Replica to a standalone Primary database.
- Data Transfer Costs: Remember that Cross-Region Data Transfer (replication) is a significant part of the Warm Standby cost model.
Topics covered:
Summary of key subtopics covered in this guide:
- Definition of Warm Standby vs. Pilot Light.
- RTO and RPO targets for SAA-C03 scenarios.
- Route 53 Failover and Health Check integration.
- Scaling strategies (ASG minimums) during a disaster event.
- Database replication patterns (RDS Promotion).
Route 53: Uses DNS failover to point to the Standby ELB.
RDS/Aurora: Cross-region replicas ensure data is ready.
IAM: Roles must be consistent across both regions.
During failover, the ASG in Region B triggers a scale-out event. Performance is degraded for the first few minutes until the “Desired Capacity” is reached.
Tip: Pre-warm your Elastic Load Balancers if you expect massive traffic spikes immediately upon failover.
To save money, use T-series instances (burstable) in the standby region for the minimal footprint, then switch to M or C series during a disaster via ASG Launch Template updates.
Production Use Case:
A banking application requires that if the US-EAST-1 region fails, the system must be back online in under 10 minutes. They keep 2 small EC2 instances running in US-WEST-2 and a mirrored RDS instance. This ensures the “Health Check” passes, and users can log in immediately while the rest of the fleet boots up.