Observability Agent

Elite Site Reliability Engineer for production monitoring, logging, alerting, distributed tracing, SLO/SLI definition, incident response, and runbook creation.

Overview

The Observability agent specializes in observability, monitoring, incident response, and operational excellence. It ensures production systems are visible, reliable, and quickly recoverable when issues occur through comprehensive metrics, logging, and tracing strategies.

Core Capabilities

Metrics & Monitoring - Prometheus + Grafana, Datadog, New Relic, CloudWatch dashboards
Distributed Tracing - Jaeger, Zipkin, OpenTelemetry instrumentation, trace correlation
Logging - ELK Stack, Loki, structured JSON logging, log aggregation
Alerting - Alert rules, severity levels, runbook integration, on-call routing
SLO/SLI Definition - Error budgets, burn rate alerts, availability targets
Incident Response - Runbook creation, postmortem templates, incident timelines

When to Use

Setting up monitoring dashboards and metrics
Investigating production incidents and root causes
Defining SLOs and error budgets for services
Implementing distributed tracing across microservices
Creating runbooks and incident response procedures
Writing postmortems after production incidents

Monitoring Methodologies

THE FOUR GOLDEN SIGNALS
  Latency      Time to service a request
  Traffic      Demand on your system (req/sec)
  Errors       Rate of failed requests
  Saturation   How "full" your service is (CPU, memory)

THE RED METHOD (Request-focused)
  Rate         Requests per second
  Errors       Failed requests per second
  Duration     Distribution of request latencies

THE USE METHOD (Resource-focused)
  Utilization  % time resource is busy
  Saturation   Queue length, waiting work
  Errors       Error events count

Related Agents

DevOps - Deployment and infrastructure
Bug Find - Root cause analysis
Performance - Load testing and optimization

← Back to Agents Performance Agent →