High Availability, Scalability & Fault Tolerance

In the AWS ecosystem, designing for success means moving away from single points of failure. This module focuses on how to build systems that grow with demand (Scalability), remain available during localized outages (High Availability), and continue operating even when components fail (Fault Tolerance).

The Restaurant Analogy

Scalability: If your restaurant gets crowded, you add more tables (Horizontal Scaling) or buy a bigger stove (Vertical Scaling).

High Availability: You open two identical restaurants in different parts of town. If one street is closed for repairs, customers go to the other.

Fault Tolerance: Your kitchen has two ovens. If one explodes, the other is already hot and the chef doesn’t even stop chopping onions. The customer never notices a delay.

Core Concepts & The Well-Architected Framework

These concepts primarily fall under the Reliability Pillar of the AWS Well-Architected Framework. The goal is to provide consistent service despite hardware failures, load spikes, or network interruptions.

1. Scalability

Vertical Scaling (Scaling Up): Increasing the power of an existing resource (e.g., moving from a t3.micro to a m5.large). Limited by hardware ceilings.
Horizontal Scaling (Scaling Out): Adding more resources to your pool (e.g., adding more EC2 instances). This is preferred for cloud-native architectures.

2. High Availability (HA)

HA is usually achieved by deploying resources across multiple Availability Zones (AZs). If one AZ goes down due to a power outage or flood, the application continues to run in the other AZ.

3. Fault Tolerance

Fault tolerance is a higher bar than HA. It implies zero downtime and zero data loss. It often requires redundant components that are “hot” and ready to take over instantly.

Service Comparison: Load Balancing

Feature	Application Load Balancer (ALB)	Network Load Balancer (NLB)	Gateway Load Balancer (GWLB)
OSI Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)	Layer 3 (IP Packets)
Best For	Microservices & Containers	Ultra-low latency / Gaming	Third-party Virtual Appliances
Key Feature	Path/Host-based routing	Static IP / Elastic IP support	Transparent inspection
Performance	High	Ultra-High (Millions of RPS)	High

Scenario-Based Decision Matrix

IF the requirement is to handle a sudden spike of millions of requests per second… THEN use Network Load Balancer.
IF you need to route traffic based on the URL path (e.g., /api vs /orders)… THEN use Application Load Balancer.
IF you need to ensure an RDS database survives an AZ failure… THEN enable Multi-AZ Deployment.
IF you need to scale based on CPU utilization… THEN use EC2 Auto Scaling with a Target Tracking policy.

Exam Tips: Golden Nuggets

ELB vs. ASG: The Load Balancer distributes traffic; the Auto Scaling Group adds/removes instances. They work together but are distinct services.
Multi-AZ vs. Read Replicas: Multi-AZ is for Disaster Recovery (High Availability). Read Replicas are for Scaling read performance.
Statelessness: To scale horizontally effectively, store session data in DynamoDB or ElastiCache, not on the local EC2 instance.
Health Checks: Always ensure your ELB health checks are pointing to a valid page, or the ELB will mark instances as unhealthy and stop sending traffic.

Architecting for Resilience

Key Services

Route 53: Global DNS with health checks.
ELB: Distributes incoming app traffic.
ASG: Automates fleet size based on load.
RDS Multi-AZ: Synchronous replication for failover.

Common Pitfalls

Single AZ: Using only one AZ is a major exam “No-Go”.
Scaling Lag: Forgetting that new instances take time to boot.
Hardcoded IPs: Always use DNS names for Load Balancers.

Quick Patterns

Stateless: Move data to DynamoDB.
Self-Healing: Use ASG to replace failed instances.
Loose Coupling: Use SQS between tiers to buffer load.