Agent skill

Observability & Monitoring

Structured logging, metrics, distributed tracing, and alerting strategies

Stars 8
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/ArieGoldkin/ai-agent-hub/tree/main/skills/observability-monitoring

SKILL.md

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

When to Use

  • Setting up application monitoring
  • Implementing structured logging
  • Adding metrics and dashboards
  • Configuring distributed tracing
  • Creating alerting rules
  • Debugging production issues

Three Pillars of Observability

┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘

Structured Logging

Log Levels

Level Use Case
ERROR Unhandled exceptions, failed operations
WARN Deprecated API, retry attempts
INFO Business events, successful operations
DEBUG Development troubleshooting

Best Practice

typescript
// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);

See templates/structured-logging.ts for Winston setup and request middleware

Metrics Collection

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

  • Rate - Requests per second
  • Errors - Failed requests per second
  • Duration - Request latency distribution

Prometheus Buckets

typescript
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

See templates/prometheus-metrics.ts for full metrics configuration

Distributed Tracing

OpenTelemetry Setup

Auto-instrument common libraries:

  • Express/HTTP
  • PostgreSQL
  • Redis

Manual Spans

typescript
tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});

See templates/opentelemetry-tracing.ts for full setup

Alerting Strategy

Severity Levels

Level Response Time Examples
Critical (P1) < 15 min Service down, data loss
High (P2) < 1 hour Major feature broken
Medium (P3) < 4 hours Increased error rate
Low (P4) Next day Warnings

Key Alerts

Alert Condition Severity
ServiceDown up == 0 for 1m Critical
HighErrorRate 5xx > 5% for 5m Critical
HighLatency p95 > 2s for 5m High
LowCacheHitRate < 70% for 10m Medium

See templates/alerting-rules.yml for Prometheus alerting rules

Health Checks

Kubernetes Probes

Probe Purpose Endpoint
Liveness Is app running? /health
Readiness Ready for traffic? /ready
Startup Finished starting? /startup

Readiness Response

json
{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}

See templates/health-checks.ts for implementation

Observability Checklist

Implementation

  • JSON structured logging
  • Request correlation IDs
  • RED metrics (Rate, Errors, Duration)
  • Business metrics
  • Distributed tracing
  • Health check endpoints

Alerting

  • Service outage alerts
  • Error rate thresholds
  • Latency thresholds
  • Resource utilization alerts

Dashboards

  • Service overview
  • Error analysis
  • Performance metrics

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

  • Incident investigation - Correlating logs, metrics, traces
  • Alert tuning - Reducing noise, catching real issues
  • Architecture decisions - Choosing monitoring solutions
  • Performance debugging - Cross-service latency analysis

Templates Reference

Template Purpose
structured-logging.ts Winston logger with request middleware
prometheus-metrics.ts HTTP, DB, cache metrics with middleware
opentelemetry-tracing.ts Distributed tracing setup
alerting-rules.yml Prometheus alerting rules
health-checks.ts Liveness, readiness, startup probes

Expand your agent's capabilities with these related and highly-rated skills.

ArieGoldkin/ai-agent-hub

brainstorming

Use when creating or developing anything, before writing code or implementation plans - refines rough ideas into fully-formed designs through structured Socratic questioning, alternative exploration, and incremental validation

8 0
Explore
ArieGoldkin/ai-agent-hub

security-checklist

Use this skill when implementing security measures or conducting security audits. Provides OWASP Top 10 mitigations, authentication patterns, input validation strategies, and compliance guidelines. Ensures applications are secure against common vulnerabilities.

8 0
Explore
ArieGoldkin/ai-agent-hub

prototype-to-production

Convert design prototypes (HTML, CSS, Figma exports) into production-ready components. Analyzes prototype structure, extracts design tokens, identifies reusable patterns, and generates typed React components. Adapts to existing project tech stack with React + TypeScript as default.

8 0
Explore
ArieGoldkin/ai-agent-hub

Performance Optimization

Full-stack performance analysis, optimization patterns, and monitoring strategies

8 0
Explore
ArieGoldkin/ai-agent-hub

ai-native-development

Build AI-first applications with RAG pipelines, embeddings, vector databases, agentic workflows, and LLM integration. Master prompt engineering, function calling, streaming responses, and cost optimization for 2025+ AI development.

8 0
Explore
ArieGoldkin/ai-agent-hub

react-server-components-framework

Design and implement React Server Components with Next.js 15 App Router. Master server-first architecture, streaming SSR, Server Actions, and modern data fetching patterns for 2025+ frontend development.

8 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results