Agent skill
prometheus-configuration
Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
Install this agent skill to your Project
npx add-skill https://github.com/wshobson/agents/tree/main/plugins/observability-monitoring/skills/prometheus-configuration
SKILL.md
Prometheus Configuration
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
Purpose
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
When to Use
- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery
Prometheus Architecture
┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
│ /metrics endpoint
↓
┌──────────────┐
│ Prometheus │ ← Scrapes metrics periodically
│ Server │
└──────┬───────┘
│
├─→ AlertManager (alerts)
├─→ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)
Installation
Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
Docker Compose
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
volumes:
prometheus-data:
Configuration File
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node exporters
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"
# Kubernetes pods with annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Application metrics
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090"
metrics_path: "/metrics"
scheme: "https"
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key
Reference: See assets/prometheus.yml.template
Scrape Configurations
Static Targets
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"
File-based Service Discovery
scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5m
targets/production.json:
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]
Kubernetes Service Discovery
scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Reference: See references/scrape-configs.md
Recording Rules
Create pre-computed metrics for frequently queried expressions:
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
# HTTP request rate per service
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Error rate percentage
- record: job:http_requests_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_requests_error_rate:percentage
expr: |
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
# P95 latency
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
# CPU utilization percentage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
- record: instance:node_memory:utilization
expr: |
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Disk usage percentage
- record: instance:node_disk:utilization
expr: |
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Reference: See references/recording-rules.md
Alert Rules
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute"
- alert: HighErrorRate
expr: job:http_requests_error_rate:percentage > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
- name: resources
interval: 1m
rules:
- alert: HighCPUUsage
expr: instance:node_cpu:utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: instance:node_memory:utilization > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"
Validation
# Validate configuration
promtool check config prometheus.yml
# Validate rules
promtool check rules /etc/prometheus/rules/*.yml
# Test query
promtool query instant http://localhost:9090 'up'
Reference: See scripts/validate-prometheus.sh
Best Practices
- Use consistent naming for metrics (prefix_name_unit)
- Set appropriate scrape intervals (15-60s typical)
- Use recording rules for expensive queries
- Implement high availability (multiple Prometheus instances)
- Configure retention based on storage capacity
- Use relabeling for metric cleanup
- Monitor Prometheus itself
- Implement federation for large deployments
- Use Thanos/Cortex for long-term storage
- Document custom metrics
Troubleshooting
Check scrape targets:
curl http://localhost:9090/api/v1/targets
Check configuration:
curl http://localhost:9090/api/v1/status/config
Test query:
curl 'http://localhost:9090/api/v1/query?query=up'
Related Skills
grafana-dashboards- For visualizationslo-implementation- For SLO monitoringdistributed-tracing- For request tracing
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
protocol-reverse-engineering
Master network protocol reverse engineering including packet analysis, protocol dissection, and custom protocol documentation. Use when analyzing network traffic, understanding proprietary protocols, or debugging network communication.
binary-analysis-patterns
Master binary analysis patterns including disassembly, decompilation, control flow analysis, and code pattern recognition. Use when analyzing executables, understanding compiled code, or performing static analysis on binaries.
anti-reversing-techniques
Understand anti-reversing, obfuscation, and protection techniques encountered during software analysis. Use this skill when analyzing malware evasion techniques, when implementing anti-debugging protections for CTF challenges, when reverse engineering packed binaries, or when building security research tools that need to detect virtualized environments.
memory-forensics
Master memory forensics techniques including memory acquisition, process analysis, and artifact extraction using Volatility and related tools. Use when analyzing memory dumps, investigating incidents, or performing malware analysis from RAM captures.
nx-workspace-patterns
Configure and optimize Nx monorepo workspaces. Use when setting up Nx, configuring project boundaries, optimizing build caching, or implementing affected commands.
auth-implementation-patterns
Master authentication and authorization patterns including JWT, OAuth2, session management, and RBAC to build secure, scalable access control systems. Use when implementing auth systems, securing APIs, or debugging security issues.
Didn't find tool you were looking for?