High Availability, Scalability & Fault Tolerance
In the AWS ecosystem, designing for success means moving away from single points of failure. This module focuses on how to build systems that grow with demand (Scalability), remain available during localized outages (High Availability), and continue operating even when components fail (Fault Tolerance).
The Restaurant Analogy
Scalability: If your restaurant gets crowded, you add more tables (Horizontal Scaling) or buy a bigger stove (Vertical Scaling).
High Availability: You open two identical restaurants in different parts of town. If one street is closed for repairs, customers go to the other.
Fault Tolerance: Your kitchen has two ovens. If one explodes, the other is already hot and the chef doesn’t even stop chopping onions. The customer never notices a delay.
Core Concepts & The Well-Architected Framework
These concepts primarily fall under the Reliability Pillar of the AWS Well-Architected Framework. The goal is to provide consistent service despite hardware failures, load spikes, or network interruptions.
1. Scalability
- Vertical Scaling (Scaling Up): Increasing the power of an existing resource (e.g., moving from a
t3.microto am5.large). Limited by hardware ceilings. - Horizontal Scaling (Scaling Out): Adding more resources to your pool (e.g., adding more EC2 instances). This is preferred for cloud-native architectures.
2. High Availability (HA)
HA is usually achieved by deploying resources across multiple Availability Zones (AZs). If one AZ goes down due to a power outage or flood, the application continues to run in the other AZ.
3. Fault Tolerance
Fault tolerance is a higher bar than HA. It implies zero downtime and zero data loss. It often requires redundant components that are “hot” and ready to take over instantly.
Service Comparison: Load Balancing
| Feature | Application Load Balancer (ALB) | Network Load Balancer (NLB) | Gateway Load Balancer (GWLB) |
|---|---|---|---|
| OSI Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) | Layer 3 (IP Packets) |
| Best For | Microservices & Containers | Ultra-low latency / Gaming | Third-party Virtual Appliances |
| Key Feature | Path/Host-based routing | Static IP / Elastic IP support | Transparent inspection |
| Performance | High | Ultra-High (Millions of RPS) | High |
Scenario-Based Decision Matrix
- IF the requirement is to handle a sudden spike of millions of requests per second… THEN use Network Load Balancer.
- IF you need to route traffic based on the URL path (e.g., /api vs /orders)… THEN use Application Load Balancer.
- IF you need to ensure an RDS database survives an AZ failure… THEN enable Multi-AZ Deployment.
- IF you need to scale based on CPU utilization… THEN use EC2 Auto Scaling with a Target Tracking policy.
Exam Tips: Golden Nuggets
- ELB vs. ASG: The Load Balancer distributes traffic; the Auto Scaling Group adds/removes instances. They work together but are distinct services.
- Multi-AZ vs. Read Replicas: Multi-AZ is for Disaster Recovery (High Availability). Read Replicas are for Scaling read performance.
- Statelessness: To scale horizontally effectively, store session data in DynamoDB or ElastiCache, not on the local EC2 instance.
- Health Checks: Always ensure your ELB health checks are pointing to a valid page, or the ELB will mark instances as unhealthy and stop sending traffic.
Architecting for Resilience
Key Services
- Route 53: Global DNS with health checks.
- ELB: Distributes incoming app traffic.
- ASG: Automates fleet size based on load.
- RDS Multi-AZ: Synchronous replication for failover.
Common Pitfalls
- Single AZ: Using only one AZ is a major exam “No-Go”.
- Scaling Lag: Forgetting that new instances take time to boot.
- Hardcoded IPs: Always use DNS names for Load Balancers.
Quick Patterns
- Stateless: Move data to DynamoDB.
- Self-Healing: Use ASG to replace failed instances.
- Loose Coupling: Use SQS between tiers to buffer load.