Overview
Monitoring and troubleshooting Compute Engine instances is a core pillar of the ACE exam. It involves observing resource utilization, diagnosing connectivity issues, and setting up automated responses to system health changes using Google Cloud’s operations suite (formerly Stackdriver).
The Analogy: The Modern Car Dashboard
Imagine driving a high-end car. Cloud Monitoring is your dashboard showing speed and fuel (Metrics). Cloud Logging is the “black box” recorder that logs every turn and gear shift (Events). The Ops Agent is like an aftermarket sensor you install on the engine to get specialized data like oil pressure or tire temperature that the standard dashboard doesn’t show by default.
Detail Elaboration: Google Cloud Operations Suite
To effectively manage VMs, you must understand the distinction between infrastructure-level metrics and OS-level metrics. By default, Google Cloud sees the “outside” of the VM (CPU, Network, Disk I/O). To see the “inside” (Memory usage, specific process logs), you must install the Ops Agent.
Core Concepts
- Reliability: Use Uptime Checks to verify your application is reachable from global locations.
- Scalability: Link Monitoring metrics to Instance Groups for Autoscaling (e.g., scale out when CPU > 70%).
- Operational Excellence: Use Log Explorer to query specific error codes (e.g., 404 or 500) across thousands of VMs instantly.
Comparison: Monitoring vs. Logging vs. Error Reporting
| Feature | Cloud Monitoring | Cloud Logging | Error Reporting |
|---|---|---|---|
| Focus | Numerical Metrics (CPU, RAM) | Text-based Events/Logs | Application Crashes/Exceptions |
| Use Case | Dashboards & Alerting | Auditing & Debugging | Developer Debugging |
| Data Type | Time-series integers/floats | JSON/String entries | Stack traces |
| Cost Driver | Volume of custom metrics | Log ingestion volume (GB) | Included with Logging |
Scenario-Based Decision Matrix
If the requirement is…
- To monitor RAM usage: You MUST install the Ops Agent (Compute Engine does not see RAM by default).
- To troubleshoot a VM that won’t boot: Enable and check the Serial Console Output.
- To find out who deleted a VM: Check the Cloud Audit Logs in Log Explorer.
- To be notified when a website is down: Create an Uptime Check and an Alerting Policy.
Exam Tips: Golden Nuggets
- The Agent Rule: If an exam question asks about monitoring Memory (RAM) or Disk Space *inside* the OS, the answer almost always involves the Ops Agent.
- Firewall Troubleshooting: If a VM is running but unreachable, check VPC Firewall Rules and use Connectivity Tests.
- Alerting: You can send alerts to Email, SMS, Slack, PagerDuty, or even trigger a Cloud Pub/Sub message for automation.
- Default Retention: Remember that logs are kept for a limited time (usually 30 days for standard logs). For long-term storage, export logs to Cloud Storage or BigQuery.
Infographic: Observability Flow
Key GCP Services
- Ops Agent: Combined logging/metrics agent.
- Metric Explorer: Visualizer for ad-hoc queries.
- Log Router: Sinks for exporting logs.
Common Pitfalls
- Forgetting to give the VM Service Account “Monitoring Metric Writer” permissions.
- Assuming standard metrics include Disk Space or RAM.
Architecture Patterns
- Centralized Logging: Exporting logs from multiple projects into one BigQuery dataset.
- Auto-Healing: Using HTTP Health Checks to recreate failed VMs.