Amazon CloudWatch: The Pulse of AWS
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. It provides data and actionable insights to monitor applications, respond to system-wide performance changes, and optimize resource utilization.
The Analogy: The Modern Hospital Patient Monitor
Imagine an Intensive Care Unit (ICU). CloudWatch Metrics are the continuous streams of data like heart rate and blood pressure. CloudWatch Logs are the detailed nurse’s notes recording every specific event or medication administered. CloudWatch Alarms are the sirens that trigger immediately if the heart rate drops below a safe threshold, alerting the crash team (Auto Scaling or SNS) to take action.
Core Concepts & Well-Architected Lens
Operational Excellence
CloudWatch enables teams to understand the health of their systems. By using CloudWatch Logs Insights, you can query log data to identify the root cause of deployment failures or application errors in seconds.
Reliability
Through Alarms, CloudWatch ensures that the system can self-heal. For example, if an EC2 instance fails a status check, a CloudWatch Alarm can trigger an EC2 Action to recover the instance automatically.
Performance Efficiency
Monitoring resource utilization (CPU, Network) allows you to use Auto Scaling to match demand, ensuring you aren’t over-provisioned (wasting money) or under-provisioned (hurting performance).
Service Comparison: Logs vs. Metrics
| Feature | CloudWatch Metrics | CloudWatch Logs |
|---|---|---|
| Primary Purpose | Numerical time-series data for performance. | Textual records of events and errors. |
| Retention | Up to 15 months (with decreasing resolution). | Indefinite (Configurable from 1 day to Never Expire). |
| Resolution | Standard (1 min) or High (1 sec). | Real-time ingestion. |
| Trigger Source | Used to trigger Alarms and Auto Scaling. | Used for Metric Filters to create Metrics from text. |
Decision Matrix: Monitoring Scenarios
- If you need to monitor EC2 Memory or Disk Space usage… Then you MUST install the CloudWatch Unified Agent (these are not default metrics).
- If you need to react to an AWS API call (e.g., “Who deleted this S3 bucket?”)… Then use CloudTrail integrated with CloudWatch Logs.
- If you need to visualize trends over several months… Then use CloudWatch Dashboards.
- If you want to detect “Unknown Unknowns” or anomalies… Then enable CloudWatch Anomaly Detection.
Exam Tips: Golden Nuggets
- Standard vs. Custom: CPU, Network, and Disk I/O are free/default. Memory, Disk Swap, and Application-level stats are Custom Metrics.
- Resolution: Standard resolution is 60 seconds. High resolution can go down to 1 second (useful for fast-scaling needs).
- Log Groups: Organize your logs. Retention policies are set at the Log Group level, not the individual stream level.
- Alarms: An alarm has three states:
OK,ALARM, andINSUFFICIENT_DATA.
CloudWatch Architectural Flow
Key Services
- Container Insights: For ECS/EKS.
- Lambda Insights: For serverless performance.
- Contributor Insights: Find the “Top-N” contributors (e.g., busiest IPs).
Common Pitfalls
- Expecting RAM metrics without the Agent.
- Assuming Logs are encrypted by default (must use KMS).
- Ignoring the cost of High Resolution metrics.
Quick Patterns
- Centralized Logging: Use Kinesis Data Firehose to stream logs to S3.
- Alerting: CloudWatch Alarm -> SNS Topic -> PagerDuty/Email.