VM Monitoring & Troubleshooting

Overview

Monitoring and troubleshooting Compute Engine instances is a core pillar of the ACE exam. It involves observing resource utilization, diagnosing connectivity issues, and setting up automated responses to system health changes using Google Cloud’s operations suite (formerly Stackdriver).

The Analogy: The Modern Car Dashboard

Imagine driving a high-end car. Cloud Monitoring is your dashboard showing speed and fuel (Metrics). Cloud Logging is the “black box” recorder that logs every turn and gear shift (Events). The Ops Agent is like an aftermarket sensor you install on the engine to get specialized data like oil pressure or tire temperature that the standard dashboard doesn’t show by default.

Detail Elaboration: Google Cloud Operations Suite

To effectively manage VMs, you must understand the distinction between infrastructure-level metrics and OS-level metrics. By default, Google Cloud sees the “outside” of the VM (CPU, Network, Disk I/O). To see the “inside” (Memory usage, specific process logs), you must install the Ops Agent.

Core Concepts

Reliability: Use Uptime Checks to verify your application is reachable from global locations.
Scalability: Link Monitoring metrics to Instance Groups for Autoscaling (e.g., scale out when CPU > 70%).
Operational Excellence: Use Log Explorer to query specific error codes (e.g., 404 or 500) across thousands of VMs instantly.

Comparison: Monitoring vs. Logging vs. Error Reporting

Feature	Cloud Monitoring	Cloud Logging	Error Reporting
Focus	Numerical Metrics (CPU, RAM)	Text-based Events/Logs	Application Crashes/Exceptions
Use Case	Dashboards & Alerting	Auditing & Debugging	Developer Debugging
Data Type	Time-series integers/floats	JSON/String entries	Stack traces
Cost Driver	Volume of custom metrics	Log ingestion volume (GB)	Included with Logging

Scenario-Based Decision Matrix

If the requirement is…

To monitor RAM usage: You MUST install the Ops Agent (Compute Engine does not see RAM by default).
To troubleshoot a VM that won’t boot: Enable and check the Serial Console Output.
To find out who deleted a VM: Check the Cloud Audit Logs in Log Explorer.
To be notified when a website is down: Create an Uptime Check and an Alerting Policy.

Exam Tips: Golden Nuggets

The Agent Rule: If an exam question asks about monitoring Memory (RAM) or Disk Space *inside* the OS, the answer almost always involves the Ops Agent.
Firewall Troubleshooting: If a VM is running but unreachable, check VPC Firewall Rules and use Connectivity Tests.
Alerting: You can send alerts to Email, SMS, Slack, PagerDuty, or even trigger a Cloud Pub/Sub message for automation.
Default Retention: Remember that logs are kept for a limited time (usually 30 days for standard logs). For long-term storage, export logs to Cloud Storage or BigQuery.

Infographic: Observability Flow

Key GCP Services

Ops Agent: Combined logging/metrics agent.
Metric Explorer: Visualizer for ad-hoc queries.
Log Router: Sinks for exporting logs.

Common Pitfalls

Forgetting to give the VM Service Account “Monitoring Metric Writer” permissions.
Assuming standard metrics include Disk Space or RAM.

Architecture Patterns

Centralized Logging: Exporting logs from multiple projects into one BigQuery dataset.
Auto-Healing: Using HTTP Health Checks to recreate failed VMs.

Key Fact: Cloud Monitoring can monitor resources on AWS and On-Premises using the BindPlane integration.