AWS Management and Monitoring: Amazon CloudWatch

Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. It provides data and actionable insights to monitor applications, respond to system-wide performance changes, and optimize resource utilization.

The Analogy: Think of CloudWatch as the “Central Nervous System” of your AWS infrastructure. Just as your nerves send signals to your brain about pain or temperature, CloudWatch collects metrics and logs from your resources. If something is wrong, it triggers an “Alarm” (like a reflex) to fix the issue or notify you.

Core Components of CloudWatch

1. CloudWatch Metrics

Metrics represent time-ordered data points. By default, many AWS services provide free metrics (Standard monitoring). Custom metrics can be pushed using the CLI or SDK.

Standard Monitoring: 5-minute intervals (Free for many services).
Detailed Monitoring: 1-minute intervals (Paid, common for EC2).
High-Resolution Metrics: Up to 1-second resolution (used for sub-minute spikes).

2. CloudWatch Logs

Logs allow you to centralize logs from all your systems, applications, and AWS services. You can search, filter, and store them indefinitely or set expiration policies.

Log Groups: A collection of log streams that share the same retention and access control.
Log Streams: A sequence of log events from a specific source (e.g., an instance id).
Metric Filters: Extract data from logs to create numerical metrics (e.g., count “404” errors in Apache logs).

3. CloudWatch Alarms

Alarms watch a single metric over a specified time period and perform one or more actions based on the value of the metric relative to a threshold.

Actions: SNS Notifications, Auto Scaling policies, or EC2 Actions (Stop, Terminate, Reboot, Recover).
Composite Alarms: Alarms that trigger based on the state of multiple other alarms.

4. CloudWatch Events / EventBridge

Amazon EventBridge (formerly CloudWatch Events) is a serverless event bus that makes it easy to connect applications using data from your own applications, integrated SaaS applications, and AWS services. It is the “Glue” for event-driven architectures.

Comparison: Monitoring vs. Auditing vs. Governance

Feature	Amazon CloudWatch	AWS CloudTrail	AWS Config
Focus	Performance & Health	API Auditing / User Activity	Resource Inventory & Compliance
Primary Data	Metrics, Logs, Alarms	Who did what? (API calls)	Configuration History / State
Use Case	“Is my CPU usage high?”	“Who deleted this S3 bucket?”	“Is port 22 open on my SG?”

Decision Matrix / If–Then Guide

If you need to monitor EC2 Memory or Disk Usage… then you must install the Unified CloudWatch Agent (these are not available by default).
If you need to react to a specific state change (e.g., an S3 upload)… then use EventBridge to trigger a Lambda function.
If you need to store logs long-term for low cost… then use CloudWatch Logs S3 Export or Kinesis Firehose.
If you need to monitor application performance across accounts… then use CloudWatch Cross-Account Observability.

Exam Tips and Gotchas

Memory is NOT a default metric: This is a classic SAA-C03 question. CPU, Network, and Disk I/O (status checks) are default; Memory requires the Agent.
Resolution Matters: Standard metrics are 1 min or 5 min. High-resolution metrics can go down to 1 second.
CloudWatch Logs Agent vs. Unified Agent: Always prefer the “Unified CloudWatch Agent” as it handles both logs and system-level metrics.
Metric Retention: Metrics are not stored forever. Data points with a period of < 60s are kept for 3 hours; 1-minute data points for 15 days; 1-hour data points for 15 months.
Namespace: A container for CloudWatch metrics. AWS services use AWS/ServiceName (e.g., AWS/EC2).

Topics covered:

Summary of key subtopics covered in this guide:

Standard vs. Detailed Monitoring
Custom Metrics and High Resolution
CloudWatch Logs, Log Groups, and Metric Filters
CloudWatch Alarms and Actions (SNS, ASG, EC2)
EventBridge (CloudWatch Events) for automation
The Unified CloudWatch Agent requirements
Comparison between CloudWatch, CloudTrail, and Config

Amazon CloudWatch Architecture

Service Ecosystem

Integrations: Works natively with almost every AWS service. Use IAM Roles to allow EC2 instances to push logs/metrics. Use KMS to encrypt log groups for sensitive data.

Performance & Scaling

High Resolution: Use for critical apps where 1-minute visibility is too slow. Anomaly Detection: Uses ML to adjust alarm thresholds based on historical patterns.

Cost Optimization

Log Retention: Set to “Never Expire” by default—change this to 30/90 days to save costs. Detailed Monitoring: Enable only on production instances to avoid extra per-metric charges.

Production Use Case: A retail website uses CloudWatch Alarms on “RequestCount” from an ALB. If traffic spikes, an Auto Scaling policy adds instances. Simultaneously, a Metric Filter scans logs for “PaymentFailed” strings, triggering an SNS alert to the DevOps team.