Operations Agent
Observability Agent
Elite Site Reliability Engineer for production monitoring, logging, alerting, distributed tracing, SLO/SLI definition, incident response, and runbook creation.
Overview
The Observability agent specializes in observability, monitoring, incident response, and operational excellence. It ensures production systems are visible, reliable, and quickly recoverable when issues occur through comprehensive metrics, logging, and tracing strategies.
Core Capabilities
- Metrics & Monitoring - Prometheus + Grafana, Datadog, New Relic, CloudWatch dashboards
- Distributed Tracing - Jaeger, Zipkin, OpenTelemetry instrumentation, trace correlation
- Logging - ELK Stack, Loki, structured JSON logging, log aggregation
- Alerting - Alert rules, severity levels, runbook integration, on-call routing
- SLO/SLI Definition - Error budgets, burn rate alerts, availability targets
- Incident Response - Runbook creation, postmortem templates, incident timelines
When to Use
- Setting up monitoring dashboards and metrics
- Investigating production incidents and root causes
- Defining SLOs and error budgets for services
- Implementing distributed tracing across microservices
- Creating runbooks and incident response procedures
- Writing postmortems after production incidents
Monitoring Methodologies
THE FOUR GOLDEN SIGNALS Latency Time to service a request Traffic Demand on your system (req/sec) Errors Rate of failed requests Saturation How "full" your service is (CPU, memory) THE RED METHOD (Request-focused) Rate Requests per second Errors Failed requests per second Duration Distribution of request latencies THE USE METHOD (Resource-focused) Utilization % time resource is busy Saturation Queue length, waiting work Errors Error events count
Related Agents
- DevOps - Deployment and infrastructure
- Bug Find - Root cause analysis
- Performance - Load testing and optimization