Agent skill
anth-observability
Set up observability for Claude API integrations with metrics, logging, and alerting for latency, cost, errors, and token usage. Trigger with phrases like "anthropic monitoring", "claude observability", "anthropic metrics", "track claude usage", "claude dashboard".
Install this agent skill to your Project
npx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/anthropic-pack/skills/anth-observability
SKILL.md
Anthropic Observability
Overview
Instrument Claude API calls with structured logging, Prometheus metrics, and cost tracking. Every API response includes usage data and rate limit headers — capture these for dashboards and alerting.
Structured Logging
import anthropic
import logging
import time
import json
logger = logging.getLogger("claude")
def create_with_logging(client: anthropic.Anthropic, **kwargs) -> anthropic.types.Message:
start = time.monotonic()
request_meta = {
"model": kwargs.get("model"),
"max_tokens": kwargs.get("max_tokens"),
"tool_count": len(kwargs.get("tools", [])),
"stream": kwargs.get("stream", False),
}
try:
response = client.messages.create(**kwargs)
duration_ms = int((time.monotonic() - start) * 1000)
logger.info(json.dumps({
"event": "claude.request",
"request_id": response._request_id,
"model": response.model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cache_read_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
"stop_reason": response.stop_reason,
"duration_ms": duration_ms,
"content_blocks": len(response.content),
}))
return response
except anthropic.APIStatusError as e:
duration_ms = int((time.monotonic() - start) * 1000)
logger.error(json.dumps({
"event": "claude.error",
"status": e.status_code,
"error_type": getattr(e, "type", "unknown"),
"duration_ms": duration_ms,
"request_id": e.response.headers.get("request-id", "unknown"),
}))
raise
Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
claude_requests = Counter(
"claude_requests_total", "Total Claude API requests",
["model", "stop_reason", "status"]
)
claude_latency = Histogram(
"claude_latency_seconds", "Claude API latency",
["model"], buckets=[0.5, 1, 2, 5, 10, 30, 60]
)
claude_tokens = Counter(
"claude_tokens_total", "Token usage",
["model", "direction"] # direction: input|output|cache_read
)
claude_cost = Counter(
"claude_cost_usd", "Estimated cost in USD",
["model"]
)
claude_rate_limit_remaining = Gauge(
"claude_rate_limit_remaining", "Remaining rate limit",
["dimension"] # dimension: requests|tokens
)
def track_metrics(response, duration: float):
model = response.model
claude_requests.labels(model=model, stop_reason=response.stop_reason, status="ok").inc()
claude_latency.labels(model=model).observe(duration)
claude_tokens.labels(model=model, direction="input").inc(response.usage.input_tokens)
claude_tokens.labels(model=model, direction="output").inc(response.usage.output_tokens)
# Cost estimation
pricing = {"claude-haiku-4-20250514": (0.80, 4.0), "claude-sonnet-4-20250514": (3.0, 15.0)}
rates = pricing.get(model, (3.0, 15.0))
cost = (response.usage.input_tokens * rates[0] + response.usage.output_tokens * rates[1]) / 1e6
claude_cost.labels(model=model).inc(cost)
Key Metrics Dashboard
| Metric | Description | Alert Threshold |
|---|---|---|
claude_requests_total{status="error"} |
Error count | > 5% of total |
claude_latency_seconds p99 |
Tail latency | > 10s |
claude_cost_usd daily |
Daily spend | > 80% budget |
claude_rate_limit_remaining{dimension="requests"} |
RPM headroom | < 10% remaining |
claude_tokens_total{direction="output"} rate |
Output throughput | Spike detection |
Usage API (Server-Side)
# Anthropic's Usage & Cost API for billing reconciliation
# GET https://api.anthropic.com/v1/usage
# Returns daily token usage and cost per model
Error Handling
| Observability Gap | Risk | Fix |
|---|---|---|
| No request_id logged | Can't debug with support | Capture response._request_id |
| Missing cost tracking | Budget surprise | Track per-request cost |
| No latency histogram | Can't spot slow queries | Add Prometheus/Datadog histograms |
Resources
Next Steps
For incident response, see anth-incident-runbook.
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
dockerfile-generator
Dockerfile Generator - Auto-activating skill for DevOps Basics. Triggers on: dockerfile generator, dockerfile generator Part of the DevOps Basics skill category.
branch-naming-helper
Branch Naming Helper - Auto-activating skill for DevOps Basics. Triggers on: branch naming helper, branch naming helper Part of the DevOps Basics skill category.
readme-generator
Readme Generator - Auto-activating skill for DevOps Basics. Triggers on: readme generator, readme generator Part of the DevOps Basics skill category.
makefile-generator
Makefile Generator - Auto-activating skill for DevOps Basics. Triggers on: makefile generator, makefile generator Part of the DevOps Basics skill category.
gitignore-generator
Gitignore Generator - Auto-activating skill for DevOps Basics. Triggers on: gitignore generator, gitignore generator Part of the DevOps Basics skill category.
pre-commit-hook-setup
Pre Commit Hook Setup - Auto-activating skill for DevOps Basics. Triggers on: pre commit hook setup, pre commit hook setup Part of the DevOps Basics skill category.
Didn't find tool you were looking for?