AWS Business Continuity Planning (BCP) & Disaster Recovery
Business Continuity Planning (BCP) in the AWS Cloud focuses on ensuring that mission-critical systems remain operational or can be recovered quickly during a disaster. For the SAA-C03 exam, this primarily revolves around Disaster Recovery (DR) strategies, leveraging global infrastructure to minimize downtime and data loss.
The “Spare Tire” Analogy
Think of BCP like the emergency preparation for a long road trip. Backup & Restore is like having a can of tire sealant (slow, but gets you there). Pilot Light is like carrying a deflated spare tire and a jack. Warm Standby is having a full-sized spare tire already mounted on a rack. Multi-Site is like driving two identical cars side-by-side; if one breaks down, you simply hop into the other without stopping.
Core Concepts: The Reliability Pillar
The AWS Well-Architected Framework’s Reliability Pillar emphasizes that systems must recover from infrastructure or service disruptions. Key metrics include:
- RPO (Recovery Point Objective): Maximum acceptable amount of data loss measured in time (e.g., “We can lose 4 hours of data”).
- RTO (Recovery Time Objective): Maximum acceptable downtime to restore service (e.g., “The system must be back up in 15 minutes”).
Comparison of Disaster Recovery Strategies
| Strategy | RPO / RTO | Cost | Key AWS Services |
|---|---|---|---|
| Backup & Restore | Hours | $ (Lowest) | AWS Backup, S3, EBS Snapshots |
| Pilot Light | Minutes / Hours | $$ | Route 53, RDS Read Replicas, AMI |
| Warm Standby | Seconds / Minutes | $$$ | Auto Scaling, RDS Multi-AZ |
| Multi-Site (Active-Active) | Near Zero | $$$$ (Highest) | Route 53 (Latency/Failover), Global Accelerator |
Decision Matrix: If/Then Scenarios
- If the requirement is the lowest cost and high RTO is acceptable, Then use Backup & Restore.
- If you need to recover in minutes but want to keep costs low, Then use Pilot Light (Keep data live, but keep compute “off”).
- If you require zero data loss (RPO = 0), Then use Multi-AZ for databases and S3 Cross-Region Replication.
- If you need to handle a regional failure with no manual intervention, Then use Multi-Site Active-Active with Route 53 health checks.
Exam Tips: Golden Nuggets
- Multi-AZ vs. Multi-Region: Multi-AZ is for High Availability (local failure); Multi-Region is for Disaster Recovery (regional disaster).
- CloudEndure (AWS DRS): Look for “AWS Elastic Disaster Recovery” for on-premises to AWS or AWS-to-AWS continuous replication.
- S3 Versioning: Essential for BCP to protect against accidental deletion or ransomware.
- Distractor Alert: “Multi-AZ” does not protect against a region-wide outage; only a Multi-Region strategy does.
Regional Failover Architecture
Key Services
- Route 53: DNS Failover & Health Checks.
- AWS Backup: Centralized backup management.
- AWS DRS: Block-level replication.
- S3 CRR: Cross-Region Replication.
Common Pitfalls
- Not testing DR plans regularly (Game Days).
- Hardcoded IP addresses instead of DNS names.
- Ignoring Service Quotas in the DR region.
- Forgetting IAM role parity across regions.
Quick Patterns
- Static Site: S3 + CloudFront (Origin Failover).
- Databases: RDS Cross-Region Read Replicas.
- Networking: Transit Gateway for Inter-Region peering.