AWS Health Checks: The Vital Signs of Your Architecture

In AWS, high availability isn’t just about having multiple resources; it’s about ensuring those resources are actually functioning. Health checks are the mechanism AWS uses to monitor the status of your resources and automatically steer traffic away from failure.

The Restaurant Analogy

Imagine a busy restaurant with 10 tables. The host (Load Balancer) doesn’t just seat people at any table. They first check if the table is clean and the chairs are ready (Health Check). If a waiter spills a drink (Resource Failure), the host marks that table as “unavailable” until it’s cleaned, ensuring guests always have a good experience.

Core Concepts

1. ELB Health Checks (Target Groups)

Elastic Load Balancers (ALB, NLB, GLB) use health checks to determine if an instance or IP in a Target Group is ready to receive traffic. If a target fails, the ELB stops sending it requests until it passes a specific number of consecutive checks.

Healthy Threshold: Number of consecutive successful checks required to mark a target “Healthy”.
Unhealthy Threshold: Number of consecutive failed checks required to mark a target “Unhealthy”.
Timeout: Amount of time the ELB waits for a response.
Interval: Time between health check attempts.

2. Route 53 Health Checks

Route 53 health checks monitor endpoints (IPs, domain names) from outside the VPC using a global network of health checkers. This is critical for DNS Failover.

Endpoint Monitoring: Checks a specific URL or IP.
Calculated Health Checks: Combines multiple health checks using OR, AND, or NOT logic.
CloudWatch Alarm Monitoring: Monitors internal metrics (like disk space) and fails DNS if an alarm triggers.

Comparison: ELB vs. Route 53 Health Checks

Feature	ELB Health Checks	Route 53 Health Checks
Primary Purpose	Routing traffic to instances/containers.	DNS Failover and Global routing.
Location	Internal to the AWS Region.	External (Global checkers).
Target	EC2, Lambda, IP addresses in VPC.	Public IPs, Domain Names, CW Alarms.
Visibility	Private or Public.	Public Endpoints only (unless using CW).

Decision Matrix / If–Then Guide

Requirement	Recommended Strategy
Route traffic away from a crashed EC2 instance within a region.	Use ALB/NLB Health Checks.
Failover from a Primary Region to a Secondary Region (DR).	Use Route 53 Failover Policy with Health Checks.
Terminate and replace a “stuck” EC2 instance.	Link ASG Health Checks to ELB Health Checks.
Monitor a legacy on-premises server for DNS failover.	Use Route 53 Health Checks on the public IP.

Exam Tips and Gotchas

Security Groups: A common exam scenario involves health checks failing because the EC2 Security Group doesn’t allow inbound traffic from the Load Balancer’s IP range.
ASG vs. ELB: By default, Auto Scaling Groups only monitor EC2 “Status Checks” (hardware/system). You must explicitly tell ASG to use ELB health checks if you want it to replace instances that are “running” but failing web requests.
Grace Periods: Always set a HealthCheckGracePeriod in ASG to prevent the system from killing instances while they are still booting up.
HTTP Codes: For ALB, you can define which codes (e.g., 200, 301) are considered “Healthy.”

Topics covered:

Summary of key subtopics covered in this guide:

Difference between ELB and Route 53 Health Checks
Health Check configuration parameters (Thresholds, Intervals)
Auto Scaling Group (ASG) integration and Grace Periods
Route 53 DNS Failover and Calculated Health Checks
Troubleshooting Security Group interference

Visualizing AWS Health Check Flow

Service Ecosystem

Integrations

CloudWatch: Trigger alarms based on health check failure rates.
ASG: Automatically terminate instances that fail ELB health checks.
IAM: Control who can modify health check settings.

Performance

Optimization

Use Shallow Health Checks (checking if the web server is up) for high-frequency pings, and Deep Health Checks (checking DB connectivity) sparingly to avoid overloading backend resources.

Cost Optimization

Route 53 Pricing

Route 53 Health Checks are charged per check per month. Non-AWS endpoints cost more than AWS endpoints. Monitoring 100+ endpoints can add up; use Calculated Health Checks to group resources.

Production Use Case: Multi-Region Failover

A global e-commerce site uses Route 53 with Latency Routing. They attach Route 53 Health Checks to each regional ALB. If the US-East-1 region suffers an outage, Route 53 detects the health check failure and automatically updates DNS records to point all global traffic to the US-West-2 region within seconds.