High Availability (HA) Concepts for SAA-C03

In the AWS ecosystem, High Availability refers to a system’s ability to remain operational and accessible even when individual components fail. Unlike “Fault Tolerance,” which aims for zero downtime (often at a high cost), HA focuses on 99.9% to 99.999% uptime by ensuring rapid, automated recovery and redundancy.

The Coffee Shop Analogy: Imagine a coffee shop with only one espresso machine. If it breaks, the shop closes (Single Point of Failure). An HA Coffee Shop has two machines and two baristas. If one machine breaks, the other keeps serving customers. You might experience a slight delay, but the shop stays open.

Topics covered:

Summary of key subtopics covered in this guide:

Difference between Availability, Durability, and Fault Tolerance.
Multi-AZ Architecture as the foundation of HA.
Regional vs. Zonal Service availability.
Elastic Load Balancing (ELB) and Auto Scaling Groups (ASG).
Database HA (RDS Multi-AZ vs. Read Replicas).
The role of Route 53 in HA.

1. Core Concepts & Definitions

Availability vs. Durability

Students often confuse these two. Availability is the percentage of time a service is “up” (e.g., S3 Standard is 99.99% available). Durability is the likelihood that data will not be lost (e.g., S3 is 99.999999999% durable).

High Availability vs. Fault Tolerance

Feature	High Availability (HA)	Fault Tolerance (FT)
Downtime	Minimal (seconds to minutes)	Zero
Cost	Moderate	Very High (requires 2x resources)
Complexity	Standard Multi-AZ	Specialized (e.g., Cluster placement)

2. Architectural Patterns for HA

The Multi-AZ Principle

The gold standard for SAA-C03 is the Multi-AZ deployment. By distributing resources across at least two Availability Zones within a Region, you protect your application from a single data center failure.

Compute: Use an ASG with a “Minimum Capacity” of 2, spread across 2 AZs.
Database: Use RDS Multi-AZ for synchronous replication and automatic failover.
Networking: Use an Application Load Balancer (ALB) to route traffic only to healthy instances in multiple AZs.

3. Service-Specific HA Strategies

RDS & Aurora

For the exam, remember: RDS Multi-AZ is for Disaster Recovery/HA (Synchronous), while Read Replicas are for Performance/Scaling (Asynchronous). Amazon Aurora is HA by default, storing 6 copies of your data across 3 AZs.

S3 & EFS

S3 is a Regional service; it is inherently HA. Amazon EFS (Elastic File System) is also Regional, allowing multiple EC2 instances in different AZs to access the same file system simultaneously.

Exam Tips and Gotchas

Single Point of Failure (SPOF): Any architecture with a single EC2 instance, a single AZ, or a non-replicated database is NOT Highly Available.
NAT Gateways: For high availability, you must deploy one NAT Gateway in each AZ. If the AZ with the only NAT Gateway goes down, all private subnets lose internet access.
Route 53 Health Checks: Route 53 can remove an unhealthy endpoint from its DNS response. This is the first line of defense for HA across Regions.
Sticky Sessions: Be careful with “Session Affinity” on ALBs. While useful, it can lead to uneven load distribution, potentially impacting HA during a spike.

Decision Matrix / If–Then Guide

Protect against a whole AWS Region failing

If the requirement is…	Then choose…
Automatic failover for a SQL database	RDS Multi-AZ Deployment
Shared storage for EC2 across multiple AZs	Amazon EFS
Route 53 with Multi-Region Failover Routing
Zero data loss for a critical application	Synchronous Replication (Multi-AZ)

Infographic: The HA Architecture Flow

Service Ecosystem

IAM: Control permissions for failover actions.
CloudWatch: Trigger ASG scaling based on health metrics.
VPC: Use subnets in different AZs for network isolation.

Performance & Scaling

Horizontal Scaling: Adding more instances (Ideal for HA).
Vertical Scaling: Increasing instance size (Risky for HA as it requires a restart).

Cost Optimization

Use Reserved Instances for baseline HA capacity and Spot Instances for additional scaling capacity where workload interruption is acceptable.

Production Use Case: A banking application uses ALB + ASG across 3 AZs. The database is Amazon Aurora. If an entire AWS data center (AZ) loses power, the ALB automatically stops sending traffic there, and Aurora promotes a replica in a healthy AZ to primary within 30 seconds.