Agent skill
observability
Use when diagnosing operation failures, stuck or slow operations, querying Jaeger traces, working with Grafana dashboards, debugging distributed system issues, or investigating worker selection and service communication problems.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/observability
SKILL.md
Observability & Debugging
Load this skill when:
- Diagnosing operation failures, stuck operations, or slow operations
- Working with Jaeger traces or Grafana dashboards
- Debugging distributed system issues
- Investigating worker selection or service communication problems
First Rule: Check Observability Before Logs
When users report issues with operations, use Jaeger first — not logs. KTRDR has comprehensive OpenTelemetry instrumentation that provides complete visibility into distributed operations.
This enables first-response diagnosis instead of iterative detective work.
When to Query Jaeger
Query Jaeger when user reports:
| Symptom | What Jaeger Shows |
|---|---|
| "Operation stuck" | Which phase is stuck and why |
| "Operation failed" | Exact error with full context |
| "Operation slow" | Bottleneck span immediately |
| "No workers selected" | Worker selection decision |
| "Missing data" | Data flow from IB to cache |
| "Service not responding" | HTTP call attempt and result |
Quick Start Workflow
Step 1: Get operation ID
From CLI output or API response (e.g., op_training_20251113_123456_abc123)
Step 2: Query Jaeger API
OPERATION_ID="op_training_20251113_123456_abc123"
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID&limit=1" | jq
Step 3: Analyze trace structure
# Get span summary with durations
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
service: .process.serviceName,
duration_ms: (.duration / 1000),
error: ([.tags[] | select(.key == "error" and .value == "true")] | length > 0)
}' | jq -s 'sort_by(.duration_ms) | reverse'
Step 4: Extract relevant attributes
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
attributes: (.tags | map({key: .key, value: .value}) | from_entries)
}'
Common Diagnostic Patterns
Pattern 1: Operation Stuck
# Check for worker selection and dispatch
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName == "worker_registry.select_worker") |
.tags[] |
select(.key | startswith("worker_registry.")) |
{key: .key, value: .value}'
Look for:
worker_registry.total_workers: 0→ No workers startedworker_registry.capable_workers: 0→ No capable workersworker_registry.selection_status: NO_WORKERS_AVAILABLE→ All busy
Pattern 2: Operation Failed
# Extract error details
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.tags[] | select(.key == "error" and .value == "true")) |
{
span: .operationName,
service: .process.serviceName,
exception_type: (.tags[] | select(.key == "exception.type") | .value),
exception_message: (.tags[] | select(.key == "exception.message") | .value)
}'
Common errors:
ConnectionRefusedError→ Service not running (checkhttp.url)ValueError→ Invalid input parametersDataNotFoundError→ Data not loaded (checkdata.symbol,data.timeframe)
Pattern 3: Operation Slow
# Find bottleneck span (longest duration)
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
duration_ms: (.duration / 1000)
}' | jq -s 'sort_by(.duration_ms) | reverse | .[0]'
Common bottlenecks:
training.training_loop→ Checktraining.device(GPU vs CPU)data.fetch→ Checkib.latency_msib.fetch_historical→ Checkdata.bars_requested
Pattern 4: Service Communication Failure
# Check HTTP calls between services
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName | startswith("POST") or startswith("GET")) |
{
http_call: .operationName,
url: (.tags[] | select(.key == "http.url") | .value),
status: (.tags[] | select(.key == "http.status_code") | .value),
error: (.tags[] | select(.key == "error.type") | .value)
}'
Look for:
http.status_code: null→ Connection failederror.type: ConnectionRefusedError→ Target service not runninghttp.url→ Shows which service was being called
Key Span Attributes Reference
Operation Attributes
operation.id— Operation identifieroperation.type— TRAINING, BACKTESTING, DATA_DOWNLOADoperation.status— PENDING, RUNNING, COMPLETED, FAILED
Worker Selection
worker_registry.total_workers— Total registered workersworker_registry.available_workers— Available workersworker_registry.capable_workers— Capable workers for this operationworker_registry.selected_worker_id— Which worker was chosenworker_registry.selection_status— SUCCESS, NO_WORKERS_AVAILABLE, NO_CAPABLE_WORKERS
Progress Tracking
progress.percentage— Current progress (0-100)progress.phase— Current execution phaseoperations_service.instance_id— OperationsService instance (check for mismatches)
Error Context
exception.type— Python exception classexception.message— Error messageexception.stacktrace— Full stack traceerror.symbol,error.strategy— Business context
Performance
http.status_code— HTTP response statushttp.url— Target URL for HTTP callsib.latency_ms— IB Gateway latencytraining.device— cuda:0 or cpugpu.utilization_percent— GPU usage
Response Template
When diagnosing with observability, use this structure:
🔍 **Trace Analysis for operation_id: {operation_id}**
**Trace Summary**:
- Trace ID: {trace_id}
- Total Duration: {duration_ms}ms
- Services: {list of services}
- Status: {OK/ERROR}
**Execution Flow**:
1. {span_name} ({service}) - {duration_ms}ms
2. {span_name} ({service}) - {duration_ms}ms
...
**Diagnosis**:
{identified_issue_with_evidence_from_spans}
**Root Cause**:
{root_cause_explanation_with_span_attributes}
**Solution**:
{recommended_fix_with_commands}
Grafana Dashboards
Check Grafana for quick diagnostics before diving into traces.
| Dashboard | Path | Use Case |
|---|---|---|
| System Overview | /d/ktrdr-system-overview |
Service health, error rates, latency |
| Worker Status | /d/ktrdr-worker-status |
Worker capacity, resource usage |
| Operations | /d/ktrdr-operations |
Operation counts, success rates |
Quick Workflows
- "Is it working?" → System Overview: Healthy Services count
- "Why is it slow?" → System Overview: P95 Latency panel
- "Workers missing?" → Worker Status: Healthy Workers and Health Matrix
- "Operations failing?" → Operations: Success Rate and Status Distribution
Benefits of Observability-First Debugging
- Diagnosis in FIRST response (not 10+ messages later)
- Complete context (all services, all phases, all attributes)
- Objective evidence (no guessing or assumptions)
- Distributed visibility (Backend → Worker → Host Service)
- Performance insights (identify bottlenecks immediately)
- Root cause analysis (trace error from source to root)
Full Documentation
For comprehensive workflows and scenarios: docs/debugging/observability-debugging-workflows.md
Didn't find tool you were looking for?