Disaster Recovery Strategies: The Pilot Light Pattern

In the AWS ecosystem, achieving High Availability (HA) and Disaster Recovery (DR) is about balancing Recovery Time Objective (RTO), Recovery Point Objective (RPO), and Cost. The “Pilot Light” strategy is a middle-ground approach often tested in the SAA-C03 exam for scenarios requiring low cost but relatively fast recovery.

The Real-World Analogy

Think of a gas water heater. The pilot light is a tiny flame that is always burning. It doesn’t heat the water itself, but when you need a hot shower, it provides the spark to ignite the main burners instantly. Without the pilot light, you’d have to find matches and manually start the heater, which takes much longer.

Core Concepts of Pilot Light

In a Pilot Light architecture, your most critical data is always “on” (replicated and live), while your application infrastructure (EC2 instances, specialized compute) is “off” or exists only as templates (AMIs/CloudFormation). If a disaster occurs, you “turn on” the rest of the infrastructure to match the live data.

Key Architectural Components

  • Data Replication: Databases are kept running in a secondary region. This is typically achieved using Amazon RDS Read Replicas (Global Database for Aurora) or DynamoDB Global Tables.
  • Infrastructure Templates: Application servers are NOT running. Instead, you store Amazon Machine Images (AMIs) and Infrastructure as Code (Terraform/CloudFormation) scripts ready to deploy.
  • DNS Failover: Amazon Route 53 uses health checks to detect failure in the primary region and update DNS records to point to the secondary region once the infrastructure is provisioned.

Comparison of DR Strategies

Strategy RTO (Time) RPO (Data Loss) Cost Complexity
Backup & Restore Hours/Days High (Last Backup) $ (Lowest) Simple
Pilot Light Minutes/Hours Low (Live Data) $$ Moderate
Warm Standby Minutes Near Zero $$$ Complex
Multi-Site (Active-Active) Near Zero Zero $$$$ (Highest) Very Complex

Decision Matrix: When to Choose Pilot Light

  • IF the requirement is to keep costs low but ensure data loss is minimal… THEN choose Pilot Light.
  • IF the business can tolerate 30-60 minutes of downtime to spin up servers… THEN choose Pilot Light.
  • IF the application requires “Zero RTO” (instant failover)… THEN Pilot Light is NOT the answer (Choose Multi-Site).
  • IF the primary region is compromised… THEN you must manually or automatically trigger an ASG to start instances from pre-baked AMIs.

Exam Tips and Gotchas

  • The Distractor: The exam often confuses Pilot Light with “Warm Standby.” Remember: In Pilot Light, app servers are NOT running. In Warm Standby, a “scaled-down” version (e.g., 1 small instance) IS running.
  • Golden Nugget: Pilot Light relies heavily on RDS Cross-Region Read Replicas. During failover, you must promote the Read Replica to a standalone Master.
  • Automation: For the SAA-C03, assume Pilot Light involves CloudFormation or Auto Scaling to quickly provision the “missing” compute layer.
  • Storage: EBS Snapshots should be copied to the destination region regularly if not using RDS.

Topics covered:

Summary of key subtopics covered in this guide:

  • Definition of Pilot Light vs. other DR strategies.
  • RTO and RPO characteristics for SAA-C03 scenarios.
  • Core AWS services involved: Route 53, RDS, AMI, CloudFormation.
  • Cost-optimization benefits of keeping compute “off.”
  • The failover process: Promoting databases and scaling compute.

Infographic: Pilot Light Architecture

Primary Region (Active) EC2 Fleet Master DB Live Sync DR Region (Pilot Light) AMIs Ready Read Replica
Service Ecosystem

Route 53: Health checks and CNAME failover.

RDS: Cross-region replication is the “flame.”

IAM & KMS: Ensure keys are available in both regions for encrypted data.

Cost Optimization

Savings: You save 70-90% on compute costs because instances aren’t running.

Trade-off: You pay for storage (EBS snapshots, AMIs) and the minimal database instance footprint.

Performance & Scaling

Recovery: Scaling is not instantaneous. You must account for the time it takes for EC2s to pass health checks.

Tip: Use Warm Pools for Auto Scaling to speed up the “Ignition” phase.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top