Designing Highly Available Systems

In the AWS ecosystem, High Availability (HA) ensures that a system remains operational and accessible even when individual components fail. Unlike “Fault Tolerance,” which aims for zero downtime at a high cost, HA focuses on minimizing downtime to an acceptable level through redundancy and automated recovery.

The “Twin-Engine Plane” Analogy

Think of a highly available system like a modern commercial jet. These planes are designed with two or more engines. If one engine fails (a component failure), the plane doesn’t fall out of the sky; the remaining engine provides enough power to reach the destination safely. In AWS, the “engines” are your instances or databases, and the “plane” is your application. By spreading these engines across different Availability Zones (AZs), you ensure that even if a “bird strike” (data center outage) hits one, the journey continues.

Core Concepts & Well-Architected Framework

The Reliability Pillar of the AWS Well-Architected Framework provides the foundation for HA. Key principles include:

Horizontal Scaling: Replace one large resource with multiple smaller ones to reduce the impact of a single failure.
Multi-AZ Deployment: Distributing resources across geographically distinct locations within a region.
Self-Healing: Using health checks and automation (Auto Scaling) to replace unhealthy components without human intervention.

Comparison: Elastic Load Balancing (ELB) Variants

Feature	Application Load Balancer (ALB)	Network Load Balancer (NLB)	Gateway Load Balancer (GLB)
OSI Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)	Layer 3 (IP Packets)
Best Use Case	Web apps, Microservices, Containers	Ultra-low latency, Gaming, Static IPs	3rd-party Virtual Appliances (Firewalls)
Performance	High (ms latency)	Ultra-High (sub-ms latency)	High
Target Types	IP, Instance, Lambda, ALB	IP, Instance, ALB	IP, Instance

Scenario-Based Decision Matrix

If the requirement is to handle millions of requests with ultra-low latency Then use Network Load Balancer.
If you need to route traffic based on the URL path (/images vs /api) Then use Application Load Balancer.
If you need a database with automatic failover to a standby instance Then enable RDS Multi-AZ.
If you need to survive a total AWS Regional failure Then implement Route 53 Multi-Region Routing.

Exam Tips: SAA-C03 Golden Nuggets

Multi-AZ vs Read Replicas: Multi-AZ is for HA (synchronous, disaster recovery); Read Replicas are for performance (asynchronous, scaling reads).
The “Free” HA: S3 and DynamoDB are highly available by design (stored across 3 AZs) without manual configuration.
NAT Gateways: They are redundant within an AZ, but not across AZs. For HA, you must create one NAT Gateway in each AZ.
Health Checks: Route 53 and ELB use health checks to redirect traffic. If an exam question mentions “removing unhealthy instances,” look for ELB + Auto Scaling.

High Availability Architecture Flow

Key Services

Route 53: DNS Failover & Latency routing.
Auto Scaling: Maintains instance count.
ELB: Distributes load across AZs.
RDS Multi-AZ: Synchronous replication.

Common Pitfalls

Deploying all resources in a single AZ.
Hardcoding IP addresses instead of DNS names.
Missing health checks on Load Balancers.
Ignoring Cross-Zone Load Balancing settings.

Quick Patterns

Active-Active: Both AZs serve traffic (Highest HA).
Active-Passive: Secondary AZ on standby.
Pilot Light: Minimal version running in another Region.