AWS Management and Monitoring: Amazon CloudWatch
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. It provides data and actionable insights to monitor applications, respond to system-wide performance changes, and optimize resource utilization.
Core Components of CloudWatch
1. CloudWatch Metrics
Metrics represent time-ordered data points. By default, many AWS services provide free metrics (Standard monitoring). Custom metrics can be pushed using the CLI or SDK.
- Standard Monitoring: 5-minute intervals (Free for many services).
- Detailed Monitoring: 1-minute intervals (Paid, common for EC2).
- High-Resolution Metrics: Up to 1-second resolution (used for sub-minute spikes).
2. CloudWatch Logs
Logs allow you to centralize logs from all your systems, applications, and AWS services. You can search, filter, and store them indefinitely or set expiration policies.
- Log Groups: A collection of log streams that share the same retention and access control.
- Log Streams: A sequence of log events from a specific source (e.g., an instance id).
- Metric Filters: Extract data from logs to create numerical metrics (e.g., count “404” errors in Apache logs).
3. CloudWatch Alarms
Alarms watch a single metric over a specified time period and perform one or more actions based on the value of the metric relative to a threshold.
- Actions: SNS Notifications, Auto Scaling policies, or EC2 Actions (Stop, Terminate, Reboot, Recover).
- Composite Alarms: Alarms that trigger based on the state of multiple other alarms.
4. CloudWatch Events / EventBridge
Amazon EventBridge (formerly CloudWatch Events) is a serverless event bus that makes it easy to connect applications using data from your own applications, integrated SaaS applications, and AWS services. It is the “Glue” for event-driven architectures.
Comparison: Monitoring vs. Auditing vs. Governance
| Feature | Amazon CloudWatch | AWS CloudTrail | AWS Config |
|---|---|---|---|
| Focus | Performance & Health | API Auditing / User Activity | Resource Inventory & Compliance |
| Primary Data | Metrics, Logs, Alarms | Who did what? (API calls) | Configuration History / State |
| Use Case | “Is my CPU usage high?” | “Who deleted this S3 bucket?” | “Is port 22 open on my SG?” |
Decision Matrix / If–Then Guide
- If you need to monitor EC2 Memory or Disk Usage… then you must install the Unified CloudWatch Agent (these are not available by default).
- If you need to react to a specific state change (e.g., an S3 upload)… then use EventBridge to trigger a Lambda function.
- If you need to store logs long-term for low cost… then use CloudWatch Logs S3 Export or Kinesis Firehose.
- If you need to monitor application performance across accounts… then use CloudWatch Cross-Account Observability.
Exam Tips and Gotchas
- Memory is NOT a default metric: This is a classic SAA-C03 question. CPU, Network, and Disk I/O (status checks) are default; Memory requires the Agent.
- Resolution Matters: Standard metrics are 1 min or 5 min. High-resolution metrics can go down to 1 second.
- CloudWatch Logs Agent vs. Unified Agent: Always prefer the “Unified CloudWatch Agent” as it handles both logs and system-level metrics.
- Metric Retention: Metrics are not stored forever. Data points with a period of < 60s are kept for 3 hours; 1-minute data points for 15 days; 1-hour data points for 15 months.
- Namespace: A container for CloudWatch metrics. AWS services use
AWS/ServiceName(e.g.,AWS/EC2).
Topics covered:
Summary of key subtopics covered in this guide:
- Standard vs. Detailed Monitoring
- Custom Metrics and High Resolution
- CloudWatch Logs, Log Groups, and Metric Filters
- CloudWatch Alarms and Actions (SNS, ASG, EC2)
- EventBridge (CloudWatch Events) for automation
- The Unified CloudWatch Agent requirements
- Comparison between CloudWatch, CloudTrail, and Config
Amazon CloudWatch Architecture
Integrations: Works natively with almost every AWS service. Use IAM Roles to allow EC2 instances to push logs/metrics. Use KMS to encrypt log groups for sensitive data.
High Resolution: Use for critical apps where 1-minute visibility is too slow. Anomaly Detection: Uses ML to adjust alarm thresholds based on historical patterns.
Log Retention: Set to “Never Expire” by default—change this to 30/90 days to save costs. Detailed Monitoring: Enable only on production instances to avoid extra per-metric charges.
Production Use Case: A retail website uses CloudWatch Alarms on “RequestCount” from an ALB. If traffic spikes, an Auto Scaling policy adds instances. Simultaneously, a Metric Filter scans logs for “PaymentFailed” strings, triggering an SNS alert to the DevOps team.