AWS Business Continuity Planning (BCP) & Disaster Recovery

Business Continuity Planning (BCP) in the AWS Cloud focuses on ensuring that mission-critical systems remain operational or can be recovered quickly during a disaster. For the SAA-C03 exam, this primarily revolves around Disaster Recovery (DR) strategies, leveraging global infrastructure to minimize downtime and data loss.

The “Spare Tire” Analogy

Think of BCP like the emergency preparation for a long road trip. Backup & Restore is like having a can of tire sealant (slow, but gets you there). Pilot Light is like carrying a deflated spare tire and a jack. Warm Standby is having a full-sized spare tire already mounted on a rack. Multi-Site is like driving two identical cars side-by-side; if one breaks down, you simply hop into the other without stopping.

Core Concepts: The Reliability Pillar

The AWS Well-Architected Framework’s Reliability Pillar emphasizes that systems must recover from infrastructure or service disruptions. Key metrics include:

RPO (Recovery Point Objective): Maximum acceptable amount of data loss measured in time (e.g., “We can lose 4 hours of data”).
RTO (Recovery Time Objective): Maximum acceptable downtime to restore service (e.g., “The system must be back up in 15 minutes”).

Comparison of Disaster Recovery Strategies

Strategy	RPO / RTO	Cost	Key AWS Services
Backup & Restore	Hours	$ (Lowest)	AWS Backup, S3, EBS Snapshots
Pilot Light	Minutes / Hours	$$	Route 53, RDS Read Replicas, AMI
Warm Standby	Seconds / Minutes	$$$	Auto Scaling, RDS Multi-AZ
Multi-Site (Active-Active)	Near Zero	$$$$ (Highest)	Route 53 (Latency/Failover), Global Accelerator

Decision Matrix: If/Then Scenarios

If the requirement is the lowest cost and high RTO is acceptable, Then use Backup & Restore.
If you need to recover in minutes but want to keep costs low, Then use Pilot Light (Keep data live, but keep compute “off”).
If you require zero data loss (RPO = 0), Then use Multi-AZ for databases and S3 Cross-Region Replication.
If you need to handle a regional failure with no manual intervention, Then use Multi-Site Active-Active with Route 53 health checks.

Exam Tips: Golden Nuggets

Multi-AZ vs. Multi-Region: Multi-AZ is for High Availability (local failure); Multi-Region is for Disaster Recovery (regional disaster).
CloudEndure (AWS DRS): Look for “AWS Elastic Disaster Recovery” for on-premises to AWS or AWS-to-AWS continuous replication.
S3 Versioning: Essential for BCP to protect against accidental deletion or ransomware.
Distractor Alert: “Multi-AZ” does not protect against a region-wide outage; only a Multi-Region strategy does.

Regional Failover Architecture

Key Services

Route 53: DNS Failover & Health Checks.
AWS Backup: Centralized backup management.
AWS DRS: Block-level replication.
S3 CRR: Cross-Region Replication.

Common Pitfalls

Not testing DR plans regularly (Game Days).
Hardcoded IP addresses instead of DNS names.
Ignoring Service Quotas in the DR region.
Forgetting IAM role parity across regions.

Quick Patterns

Static Site: S3 + CloudFront (Origin Failover).
Databases: RDS Cross-Region Read Replicas.
Networking: Transit Gateway for Inter-Region peering.