AWS Management and Monitoring: Metrics and Alarms
In the AWS ecosystem, Amazon CloudWatch acts as the central nervous system for monitoring. Metrics provide the raw data, while Alarms provide the automated response mechanism. For the SAA-C03 exam, understanding how to observe performance and react to changes is critical for building resilient, self-healing architectures.
The Real-World Analogy
Think of CloudWatch Metrics as the speedometer and fuel gauge in your car. They constantly report data points (speed, RPM, fuel level). CloudWatch Alarms are the dashboard warning lights. If your speed exceeds 80mph, a light flashes (Alarm), and if you have cruise control enabled, the car might automatically slow down (Automated Action).
1. CloudWatch Metrics: The Data Foundation
Metrics are time-ordered data points identified by a name, a namespace, and one or more dimensions. They are stored for up to 15 months, allowing for historical trend analysis.
Key Concepts
- Namespaces: Containers for metrics (e.g.,
AWS/EC2,AWS/S3). - Dimensions: Name/value pairs that uniquely identify a metric (e.g.,
InstanceId=i-12345). Filtering by dimensions is a common exam requirement. - Statistics: Aggregations over a period of time (Average, Sum, Minimum, Maximum, SampleCount, and Percentiles like P99).
- Resolution:
- Standard Resolution: Data is provided in 1-minute intervals.
- High Resolution: Data is provided in sub-minute intervals (down to 1 second).
Standard vs. Custom Metrics
| Feature | Standard Metrics | Custom Metrics |
|---|---|---|
| Source | AWS Services (EC2, RDS, S3, etc.) | Your applications, scripts, or the CloudWatch Agent |
| Cost | Mostly free (included with service) | Charged per metric/month |
| Examples | CPUUtilization, DiskReadBytes | Memory Utilization, Page Load Time, Logins |
| Memory/Disk | EC2 does NOT report Memory by default | Requires CloudWatch Agent for Memory/Disk metrics |
2. CloudWatch Alarms: The Reactive Layer
Alarms watch a single metric (or the result of a math expression) over a specified time period and perform one or more actions based on the value of the metric relative to a threshold.
Alarm States
- OK: The metric is within the defined threshold.
- ALARM: The metric is outside the threshold.
- INSUFFICIENT_DATA: The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the state.
Alarm Actions
- Amazon SNS: Send notifications (Email, SMS, PagerDuty).
- Auto Scaling: Trigger a Scaling Policy (Scale out/in).
- EC2 Actions: Stop, Terminate, Reboot, or Recover (recovers an instance to new hardware if the underlying host fails).
- Systems Manager (SSM): Trigger OpsItems or Incident Manager.
Exam Tips and Gotchas
- The Memory Trap: Standard EC2 metrics include CPU, Network, and Disk I/O. They do not include Memory (RAM) utilization or Disk Space. To monitor these, you must install the CloudWatch Unified Agent.
- High-Resolution Alarms: If you need to react within seconds (e.g., sub-minute scaling), you must use High-Resolution metrics and set the alarm period accordingly (10s or 30s).
- Composite Alarms: These allow you to combine multiple alarms using logical operators (AND/OR). This helps reduce “alarm fatigue” by only triggering when multiple conditions are met.
- Missing Data: You can configure how alarms treat missing data (treat as ignore, breaching, non-breaching, or missing). This is vital for sparse metrics.
3. Decision Matrix / If–Then Guide
| If the requirement is… | Choose this solution… |
|---|---|
| Monitor EC2 RAM usage | CloudWatch Agent + Custom Metric |
| Detect unusual traffic patterns (machine learning) | CloudWatch Anomaly Detection |
| Scale based on SQS queue depth | Custom Metric (Backlog per Instance) |
| Automate host recovery for EC2 | CloudWatch Alarm + EC2 Status Check + Recover Action |
| Consolidate multiple alarms into one | Composite Alarms |
Topics covered:
Summary of key subtopics covered in this guide:
- CloudWatch Metric structure (Namespaces, Dimensions, Statistics).
- Standard vs. High-Resolution metrics.
- The role of the CloudWatch Unified Agent for OS-level metrics.
- Alarm states and automated response actions (SNS, ASG, EC2).
- Architecture patterns for high availability and self-healing.
- Composite Alarms and Anomaly Detection.
CloudWatch Monitoring Architecture
Integrations
- IAM: Control who can view metrics or modify alarms.
- EventBridge: Trigger workflows based on alarm state changes.
- CloudTrail: Audit who created or deleted an alarm.
Cost Control
Avoid PutMetricData API spamming. Use Metric Streams for high-volume data export to S3 or Kinesis Firehose for lower latency and cost-effective analysis.
Production Use Case
Scenario: A legacy app on EC2 crashes when CPU hits 90% for 5 mins.
Solution: CloudWatch Alarm monitoring CPUUtilization. Action: Trigger SNS to DevOps and an EC2 Reboot action to clear hung processes.