Disaster Recovery Strategies: The Pilot Light Pattern
In the AWS ecosystem, achieving High Availability (HA) and Disaster Recovery (DR) is about balancing Recovery Time Objective (RTO), Recovery Point Objective (RPO), and Cost. The “Pilot Light” strategy is a middle-ground approach often tested in the SAA-C03 exam for scenarios requiring low cost but relatively fast recovery.
The Real-World Analogy
Think of a gas water heater. The pilot light is a tiny flame that is always burning. It doesn’t heat the water itself, but when you need a hot shower, it provides the spark to ignite the main burners instantly. Without the pilot light, you’d have to find matches and manually start the heater, which takes much longer.
Core Concepts of Pilot Light
In a Pilot Light architecture, your most critical data is always “on” (replicated and live), while your application infrastructure (EC2 instances, specialized compute) is “off” or exists only as templates (AMIs/CloudFormation). If a disaster occurs, you “turn on” the rest of the infrastructure to match the live data.
Key Architectural Components
- Data Replication: Databases are kept running in a secondary region. This is typically achieved using Amazon RDS Read Replicas (Global Database for Aurora) or DynamoDB Global Tables.
- Infrastructure Templates: Application servers are NOT running. Instead, you store Amazon Machine Images (AMIs) and Infrastructure as Code (Terraform/CloudFormation) scripts ready to deploy.
- DNS Failover: Amazon Route 53 uses health checks to detect failure in the primary region and update DNS records to point to the secondary region once the infrastructure is provisioned.
Comparison of DR Strategies
| Strategy | RTO (Time) | RPO (Data Loss) | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours/Days | High (Last Backup) | $ (Lowest) | Simple |
| Pilot Light | Minutes/Hours | Low (Live Data) | $$ | Moderate |
| Warm Standby | Minutes | Near Zero | $$$ | Complex |
| Multi-Site (Active-Active) | Near Zero | Zero | $$$$ (Highest) | Very Complex |
Decision Matrix: When to Choose Pilot Light
- IF the requirement is to keep costs low but ensure data loss is minimal… THEN choose Pilot Light.
- IF the business can tolerate 30-60 minutes of downtime to spin up servers… THEN choose Pilot Light.
- IF the application requires “Zero RTO” (instant failover)… THEN Pilot Light is NOT the answer (Choose Multi-Site).
- IF the primary region is compromised… THEN you must manually or automatically trigger an ASG to start instances from pre-baked AMIs.
Exam Tips and Gotchas
- The Distractor: The exam often confuses Pilot Light with “Warm Standby.” Remember: In Pilot Light, app servers are NOT running. In Warm Standby, a “scaled-down” version (e.g., 1 small instance) IS running.
- Golden Nugget: Pilot Light relies heavily on RDS Cross-Region Read Replicas. During failover, you must promote the Read Replica to a standalone Master.
- Automation: For the SAA-C03, assume Pilot Light involves CloudFormation or Auto Scaling to quickly provision the “missing” compute layer.
- Storage: EBS Snapshots should be copied to the destination region regularly if not using RDS.
Topics covered:
Summary of key subtopics covered in this guide:
- Definition of Pilot Light vs. other DR strategies.
- RTO and RPO characteristics for SAA-C03 scenarios.
- Core AWS services involved: Route 53, RDS, AMI, CloudFormation.
- Cost-optimization benefits of keeping compute “off.”
- The failover process: Promoting databases and scaling compute.
Infographic: Pilot Light Architecture
Route 53: Health checks and CNAME failover.
RDS: Cross-region replication is the “flame.”
IAM & KMS: Ensure keys are available in both regions for encrypted data.
Savings: You save 70-90% on compute costs because instances aren’t running.
Trade-off: You pay for storage (EBS snapshots, AMIs) and the minimal database instance footprint.
Recovery: Scaling is not instantaneous. You must account for the time it takes for EC2s to pass health checks.
Tip: Use Warm Pools for Auto Scaling to speed up the “Ignition” phase.