Agent skill
monitoring-skill
Monitoring and observability with Prometheus, Grafana, ELK Stack, and distributed tracing.
Install this agent skill to your Project
npx add-skill https://github.com/pluginagentmarketplace/custom-plugin-devops/tree/main/skills/monitoring
SKILL.md
Monitoring & Observability Skill
Overview
Master the three pillars of observability: metrics, logs, and traces.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| pillar | string | No | all | Observability pillar |
| tool | string | No | prometheus | Tool focus |
Core Topics
MANDATORY
- Prometheus metrics and PromQL
- Grafana dashboards
- ELK Stack basics
- SLIs, SLOs, error budgets
- Alerting rules
OPTIONAL
- Distributed tracing
- OpenTelemetry
- Custom exporters
- Log correlation
ADVANCED
- High cardinality handling
- Recording rules
- Federation
- Continuous profiling
Quick Reference
# PromQL
sum(rate(http_requests_total[5m])) by (service)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Prometheus API
curl http://localhost:9090/api/v1/targets
curl 'http://localhost:9090/api/v1/query?query=up'
curl -X POST http://localhost:9090/-/reload
# Alertmanager
amtool silence add alertname="HighLatency" --duration=2h
amtool alert
SRE Golden Signals
| Signal | Metric |
|---|---|
| Latency | histogram_quantile(0.99, ...) |
| Traffic | sum(rate(requests_total[5m])) |
| Errors | rate(errors_total[5m]) |
| Saturation | node_memory_MemAvailable_bytes |
Troubleshooting
Common Failures
| Symptom | Root Cause | Solution |
|---|---|---|
| No data | Scrape failing | Check targets page |
| Alert not firing | PromQL error | Test in UI |
| High cardinality | Too many labels | Reduce labels |
| Slow queries | Too much data | Add aggregation |
Debug Checklist
- Check targets:
/targets - Test query in UI
- Check logs:
journalctl -u prometheus - Verify time sync (NTP)
Recovery Procedures
Prometheus OOM
- Check cardinality
- Reduce retention
- Add federation
Resources
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gitops
GitOps practices with ArgoCD, FluxCD, and declarative infrastructure management
version-control
Git version control, branching strategies, GitHub/GitLab workflows, and collaborative development
containers-skill
Docker and Kubernetes - containerization, orchestration, and production deployment.
iac-skill
Infrastructure as Code with Terraform, Ansible, and CloudFormation.
scripting
DevOps scripting with Bash, Python, and Go for automation, tooling, and infrastructure management
serverless
Serverless computing with AWS Lambda, Azure Functions, Google Cloud Functions, and edge computing
Didn't find tool you were looking for?