AWS Disaster Recovery (DR) Strategies

In the AWS Cloud, Disaster Recovery is the process of preparing for and recovering from a disaster. It focuses on the Recovery Time Objective (RTO)—how quickly you need to be back up—and the Recovery Point Objective (RPO)—how much data loss you can tolerate.

The Real-World Analogy

Imagine your business depends on a car.

Backup & Restore: You have a spare tire in the trunk. If one pops, you stop and change it (Slow recovery).
Pilot Light: You have a second car in the garage, but the battery is disconnected and the fluids are drained. It takes a little time to start (Medium recovery).
Warm Standby: The second car is idling in the driveway. You just need to hop in and drive (Fast recovery).
Multi-Site: You are driving two cars simultaneously with a driver in each (Instant recovery).

The Four DR Strategies

1. Backup and Restore (RPO: Hours, RTO: 24h+)

This is the most cost-effective strategy. You back up your data and applications to S3. In the event of a disaster, you provision new resources (EC2, RDS) and restore the data.

Key Service: AWS Backup, S3 Cross-Region Replication (CRR).
Exam Focus: Use this when the business can tolerate significant downtime to save costs.

2. Pilot Light (RPO: Minutes, RTO: Hours)

Data is kept up-to-date in a DR region (like an RDS Read Replica), but the compute resources (EC2) are “off” or minimal. You might only have a small database server running. In a disaster, you “turn on” the rest of the infrastructure (usually via Auto Scaling groups or CloudFormation).

Key Service: RDS Read Replicas, Route 53 Failover.
Exam Focus: Core data is “live,” but application servers are provisioned only during a disaster.

3. Warm Standby (RPO: Seconds, RTO: Minutes)

A scaled-down but fully functional version of your environment is always running in another region. It can handle a small amount of traffic. During a disaster, the environment is scaled up to handle the full production load.

Key Service: Route 53 Weighted Routing, Auto Scaling.
Exam Focus: Faster than Pilot Light because the app is already running, just at a smaller scale.

4. Multi-Site Active-Active (RPO: Near Zero, RTO: Near Zero)

Your application runs in two or more regions simultaneously. Traffic is distributed across them. If one region fails, the other regions simply continue to handle the traffic.

Key Service: Route 53 Latency/Geoproximity Routing, Aurora Global Database, DynamoDB Global Tables.
Exam Focus: Most expensive, but provides the lowest RTO/RPO. Necessary for mission-critical apps.

Strategy	RPO (Data Loss)	RTO (Downtime)	Cost	Complexity
Backup & Restore	High (Hours)	High (24h+)	$ (Lowest)	Low
Pilot Light	Medium (Minutes)	Medium (Hours)	$$	Medium
Warm Standby	Low (Seconds)	Low (Minutes)	$$$	High
Multi-Site	Near Zero	Near Zero	$$$$ (Highest)	Very High

Decision Matrix / If–Then Guide

IF requirement is “Minimize Cost” CHOOSE Backup & Restore

IF requirement is “RTO < 15 mins" CHOOSE Warm Standby

IF requirement is “Zero Data Loss” CHOOSE Multi-Site (Global Tables)

IF requirement is “Database is live, EC2 is off” CHOOSE Pilot Light

Exam Tips and Gotchas

CloudEndure (AWS Elastic Disaster Recovery): If the exam mentions migrating or protecting on-premises workloads to AWS with low RTO, AWS DRS (formerly CloudEndure) is often the right answer.
Route 53 Health Checks: You cannot failover automatically without a Health Check attached to your DNS records.
S3 Durability vs. Availability: S3 provides 11 9s of durability, but for DR, you need Cross-Region Replication (CRR) to protect against a full region failure.
Read Replicas vs. Multi-AZ: Multi-AZ is for High Availability (same region). Read Replicas (Cross-Region) are for Disaster Recovery.

Topics covered:

Summary of key subtopics covered in this guide:

RTO and RPO Definitions
Backup and Restore Strategy
Pilot Light Architecture
Warm Standby Implementation
Multi-site Active-Active Patterns
AWS Elastic Disaster Recovery (DRS)
Route 53 Failover Mechanisms

DR Strategy Spectrum

Service Ecosystem

Connectivity

Route 53: Global traffic management & failover.
AWS Backup: Centralized backup across services.
S3 CRR: Automated cross-region data sync.
Aurora Global DB: Sub-second data replication.

Performance & Scaling

Optimization

Use Infrastructure as Code (CloudFormation/CDK) to rapidly deploy resources in the DR region. This ensures environment parity and reduces human error during high-stress failover events.

Cost Optimization

Savings

Avoid Multi-Site for non-critical dev/test environments. Use S3 Glacier for long-term backup storage to minimize costs while maintaining compliance.

Production Use Case: E-Commerce Site

A global retailer uses Warm Standby. They keep a small fleet of EC2 instances and an RDS Read Replica in a secondary region. When the primary region goes down, Route 53 health checks trigger an Auto Scaling policy to increase the secondary fleet, and the Read Replica is promoted to primary.