AWS Management and Monitoring: Metrics and Alarms

In the AWS ecosystem, Amazon CloudWatch acts as the central nervous system for monitoring. Metrics provide the raw data, while Alarms provide the automated response mechanism. For the SAA-C03 exam, understanding how to observe performance and react to changes is critical for building resilient, self-healing architectures.

The Real-World Analogy

Think of CloudWatch Metrics as the speedometer and fuel gauge in your car. They constantly report data points (speed, RPM, fuel level). CloudWatch Alarms are the dashboard warning lights. If your speed exceeds 80mph, a light flashes (Alarm), and if you have cruise control enabled, the car might automatically slow down (Automated Action).

1. CloudWatch Metrics: The Data Foundation

Metrics are time-ordered data points identified by a name, a namespace, and one or more dimensions. They are stored for up to 15 months, allowing for historical trend analysis.

Key Concepts

Namespaces: Containers for metrics (e.g., AWS/EC2, AWS/S3).
Dimensions: Name/value pairs that uniquely identify a metric (e.g., InstanceId=i-12345). Filtering by dimensions is a common exam requirement.
Statistics: Aggregations over a period of time (Average, Sum, Minimum, Maximum, SampleCount, and Percentiles like P99).
Resolution:
- Standard Resolution: Data is provided in 1-minute intervals.
- High Resolution: Data is provided in sub-minute intervals (down to 1 second).

Standard vs. Custom Metrics

Feature	Standard Metrics	Custom Metrics
Source	AWS Services (EC2, RDS, S3, etc.)	Your applications, scripts, or the CloudWatch Agent
Cost	Mostly free (included with service)	Charged per metric/month
Examples	CPUUtilization, DiskReadBytes	Memory Utilization, Page Load Time, Logins
Memory/Disk	EC2 does NOT report Memory by default	Requires CloudWatch Agent for Memory/Disk metrics

2. CloudWatch Alarms: The Reactive Layer

Alarms watch a single metric (or the result of a math expression) over a specified time period and perform one or more actions based on the value of the metric relative to a threshold.

Alarm States

OK: The metric is within the defined threshold.
ALARM: The metric is outside the threshold.
INSUFFICIENT_DATA: The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the state.

Alarm Actions

Amazon SNS: Send notifications (Email, SMS, PagerDuty).
Auto Scaling: Trigger a Scaling Policy (Scale out/in).
EC2 Actions: Stop, Terminate, Reboot, or Recover (recovers an instance to new hardware if the underlying host fails).
Systems Manager (SSM): Trigger OpsItems or Incident Manager.

Exam Tips and Gotchas

The Memory Trap: Standard EC2 metrics include CPU, Network, and Disk I/O. They do not include Memory (RAM) utilization or Disk Space. To monitor these, you must install the CloudWatch Unified Agent.
High-Resolution Alarms: If you need to react within seconds (e.g., sub-minute scaling), you must use High-Resolution metrics and set the alarm period accordingly (10s or 30s).
Composite Alarms: These allow you to combine multiple alarms using logical operators (AND/OR). This helps reduce “alarm fatigue” by only triggering when multiple conditions are met.
Missing Data: You can configure how alarms treat missing data (treat as ignore, breaching, non-breaching, or missing). This is vital for sparse metrics.

3. Decision Matrix / If–Then Guide

If the requirement is…	Choose this solution…
Monitor EC2 RAM usage	CloudWatch Agent + Custom Metric
Detect unusual traffic patterns (machine learning)	CloudWatch Anomaly Detection
Scale based on SQS queue depth	Custom Metric (Backlog per Instance)
Automate host recovery for EC2	CloudWatch Alarm + EC2 Status Check + Recover Action
Consolidate multiple alarms into one	Composite Alarms

Topics covered:

Summary of key subtopics covered in this guide:

CloudWatch Metric structure (Namespaces, Dimensions, Statistics).
Standard vs. High-Resolution metrics.
The role of the CloudWatch Unified Agent for OS-level metrics.
Alarm states and automated response actions (SNS, ASG, EC2).
Architecture patterns for high availability and self-healing.
Composite Alarms and Anomaly Detection.

CloudWatch Monitoring Architecture

Ecosystem

Integrations

IAM: Control who can view metrics or modify alarms.
EventBridge: Trigger workflows based on alarm state changes.
CloudTrail: Audit who created or deleted an alarm.

Optimization

Cost Control

Avoid PutMetricData API spamming. Use Metric Streams for high-volume data export to S3 or Kinesis Firehose for lower latency and cost-effective analysis.

Case Study

Production Use Case

Scenario: A legacy app on EC2 crashes when CPU hits 90% for 5 mins.

Solution: CloudWatch Alarm monitoring CPUUtilization. Action: Trigger SNS to DevOps and an EC2 Reboot action to clear hung processes.