Overview of Disaster Recovery
Disaster Recovery in AWS focuses on how an organization responds when a service disruption occurs. It is a critical component of the Reliability Pillar of the AWS Well-Architected Framework. The primary goal is to minimize the impact on users by meeting two key metrics:
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time).
- RTO (Recovery Time Objective): Maximum acceptable downtime before service is restored.
The “Transportation” Analogy
To understand the four DR models, imagine you are preparing for a flat tire during a road trip:
- Backup & Restore: You have a spare tire in the trunk, but you have to pull over and manually change it (High RTO).
- Pilot Light: You have a small “donut” tire already mounted on a special fifth wheel that just needs to be inflated to take the load (Lower RTO).
- Warm Standby: You are driving a truck that has dual rear wheels. If one pops, the other is already spinning and carrying the load, though you might need to slow down (Very Low RTO).
- Multi-Site Active-Active: You are driving two identical cars at the same time. If one disappears, you are already in the other one (Zero RTO).
Core Concepts & Comparison
| Strategy | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | 24h+ | $ | Low |
| Pilot Light | Minutes | Hours | $$ | Medium |
| Warm Standby | Seconds | Minutes | $$$ | High |
| Multi-Site | Near Zero | Near Zero | $$$$ | Very High |
Decision Matrix: If/Then Scenarios
- If the requirement is the lowest possible cost and high downtime is acceptable, then use Backup & Restore (S3 + AWS Backup).
- If you need to keep a “scaled-down” version of your infrastructure always running, then use Pilot Light.
- If you need a “business-critical” app to failover in minutes with a small amount of traffic already handled in Region B, then use Warm Standby.
- If the application is “mission-critical” and cannot afford any downtime, then use Multi-Site Active-Active (Route 53 + Global Accelerator).
Exam Tips: Golden Nuggets
- Route 53: Always the answer for DNS-based failover between regions using Health Checks.
- Aurora Global Database: Best for cross-region DR with a typical RPO of < 1 second and RTO of < 1 minute.
- S3 Cross-Region Replication (CRR): The foundation for most DR strategies to ensure data is physically in another region.
- CloudFormation: Crucial for “Pilot Light” and “Warm Standby” to ensure infrastructure parity in the secondary region.
Visualizing DR Flow
Key Services
- AWS Backup: Centralized backup management.
- Route 53: Failover routing policies.
- RDS Read Replicas: Cross-region data sync.
Common Pitfalls
- Hardcoded IPs: Use DNS names instead for failover.
- No Testing: DR plans that aren’t tested fail during real events.
- Service Quotas: Forgetting to increase limits in the DR region.
Quick Patterns
- S3 to S3: Cross-Region Replication (CRR).
- EBS to S3: EBS Snapshots copied to other regions.
- AMI Copy: Share custom images across regions.