AWS Disaster Recovery: Warm Standby Strategy

In the AWS Certified Solutions Architect – Associate (SAA-C03) exam, understanding Disaster Recovery (DR) strategies is critical. Warm Standby is a mid-tier DR strategy that balances cost and recovery speed. It involves maintaining a scaled-down but functional version of your application environment in a secondary AWS Region, ready to scale up at a moment’s notice.

The Real-World Analogy: Think of a spare car kept in your driveway. Unlike a “Pilot Light” (which is just the keys and a battery charger), the Warm Standby car is already running at an idle. It’s not a full-sized SUV like your primary car—maybe it’s a small moped—but it’s already moving. When the primary car breaks down, you hop on the moped and start driving immediately while you wait for a rental car (scaling) to arrive.

Core Concepts of Warm Standby

Warm Standby is often confused with “Pilot Light.” The key difference is that in Warm Standby, business-critical services are already running, albeit at a lower capacity. In Pilot Light, services like App Servers are physically “off” (AMIs only) or exist as shut-down EC2 instances.

Key Characteristics:

RTO (Recovery Time Objective): Minutes. Because services are already running, failover involves DNS changes and auto-scaling, not infrastructure deployment.
RPO (Recovery Point Objective): Seconds to Minutes. Usually involves asynchronous data replication (e.g., RDS Cross-Region Read Replicas).
Cost: Medium. You are paying for “always-on” smaller instances and data transfer.

Comparison of DR Strategies

Strategy	RTO / RPO	Cost	Infrastructure Status
Backup & Restore	Hours / 24h	$	None. Restore from S3/EBS snapshots.
Pilot Light	10s of Minutes	$$	Database live; App servers off/stopped.
Warm Standby	Minutes	$$$	Scaled-down version always running.
Multi-Site Active-Active	Near Zero	$$$$	Full capacity running in 2+ regions.

Decision Matrix: If-Then Guide

IF the requirement says “Recovery within minutes” and “Minimize costs compared to Active-Active”… THEN choose Warm Standby.
IF the requirement says “Database must be up-to-date but App servers can be started via CloudFormation/Terraform”… THEN choose Pilot Light.
IF the requirement mentions “Zero downtime” or “Zero data loss”… THEN choose Multi-Site Active-Active.

Exam Tips and Gotchas

Route 53 Failover: The SAA exam expects you to know that Route 53 Health Checks and Failover Routing Policies are the primary mechanism to trigger the shift from Primary to Warm Standby.
Auto Scaling Group (ASG): In a Warm Standby, your ASG in the secondary region has a “Minimum Capacity” of 1 or 2, while the Primary might have 10. Failover involves updating the ASG “Desired Capacity” to 10.
RDS Cross-Region: Use RDS Read Replicas for the standby region. During failover, you must promote the Read Replica to a standalone Primary database.
Data Transfer Costs: Remember that Cross-Region Data Transfer (replication) is a significant part of the Warm Standby cost model.

Topics covered:

Summary of key subtopics covered in this guide:

Definition of Warm Standby vs. Pilot Light.
RTO and RPO targets for SAA-C03 scenarios.
Route 53 Failover and Health Check integration.
Scaling strategies (ASG minimums) during a disaster event.
Database replication patterns (RDS Promotion).

AWS Warm Standby Architecture

Service Ecosystem

Route 53: Uses DNS failover to point to the Standby ELB.

RDS/Aurora: Cross-region replicas ensure data is ready.

IAM: Roles must be consistent across both regions.

Performance & Scaling

During failover, the ASG in Region B triggers a scale-out event. Performance is degraded for the first few minutes until the “Desired Capacity” is reached.

Tip: Pre-warm your Elastic Load Balancers if you expect massive traffic spikes immediately upon failover.

Cost Optimization

To save money, use T-series instances (burstable) in the standby region for the minimal footprint, then switch to M or C series during a disaster via ASG Launch Template updates.

Production Use Case:

A banking application requires that if the US-EAST-1 region fails, the system must be back online in under 10 minutes. They keep 2 small EC2 instances running in US-WEST-2 and a mirrored RDS instance. This ensures the “Health Check” passes, and users can log in immediately while the rest of the fleet boots up.