TraceKitTraceKit Docs

Alert Rules

Configure alert rules in TraceKit to get notified about errors, latency spikes, and anomalies in your distributed systems.

Alert Rules

Get notified when your services need attention. Set up intelligent alerts based on error rates, latency, throughput, and health scores.

Quick Start

  1. Set up notification channels (Slack, Telegram, Discord, Teams, PagerDuty, or OpsGenie)
  2. Create an alert rule with conditions
  3. Get notified when thresholds are breached

Alert Types

Error Rate

Monitor the percentage of failed requests. Ideal for detecting when your service starts experiencing issues.

Example Use Case: Alert when authentication service error rate exceeds 5% over 5 minutes

SettingValue
Alert TypeError Rate
ScopeService -- auth-service
Conditionerror_rate > 5%
Time Window5 minutes
SeverityCritical

Best Practice: Set thresholds based on your baseline. 5-10% is typical for warning, 15%+ for critical.

Latency

Track response times and get alerted when requests are too slow. Choose from average, P50, P95, or P99 metrics.

Example Use Case: Alert when P95 latency exceeds 1000ms (1 second) for API endpoints

SettingValue
Alert TypeLatency
MetricP95
ScopeService -- api-gateway
Conditionp95 > 1000ms
Time Window5 minutes

Metric Guide:

  • Average: Good for overall trends
  • P95: Recommended for user experience (95% of requests)
  • P99: Catch worst-case scenarios

Throughput

Monitor requests per minute. Perfect for detecting when services stop processing traffic or get overwhelmed.

Service Down Detection:

SettingValue
Conditionreq_per_min < 1
Time Window10 minutes
SeverityCritical

Traffic Spike Detection:

SettingValue
Conditionreq_per_min > 1000
Time Window5 minutes
SeverityWarning

Throughput alerts are perfect for detecting service outages (low threshold) or DDoS attacks (high threshold).

Health Score

Composite metric combining error rate and latency into a single health score (0-100). Higher is better.

Example Use Case: Alert when overall service health drops below 70

SettingValue
Alert TypeHealth Score
ScopeGlobal (All Services)
Conditionhealth_score < 70
Time Window15 minutes

Formula: Health Score = (Error Rate Score x 50) + (Latency Score x 50). Score 100 = perfect, 0 = complete failure.

Scope Types

  • Global -- Monitor all services together. Good for overall system health.
  • Service -- Monitor a specific service. Most common use case.
  • Endpoint -- Monitor a specific endpoint like "POST /api/users".

Best Practices

Set Appropriate Time Windows

Short windows (1-5 min) detect issues quickly but may cause false positives. Longer windows (15-30 min) are more stable but slower to alert.

Use Cooldowns to Prevent Spam

Set cooldown periods (15-60 min) to avoid getting flooded with notifications for the same issue. You'll be notified periodically until the issue is resolved.

Layer Your Alerts

Combine multiple alert types: Error rate alerts catch failures, latency alerts catch slowdowns, and throughput alerts catch outages.

Start with Baselines

Monitor your services for a few days to understand normal behavior before setting alert thresholds. Use your P95 latency as a starting point.

SDK Setup Guides

Alerts work with trace data sent by any TraceKit SDK. Set up your SDK to start sending traces, then create alert rules:

Next Steps

Ready to set up your first alert?

  1. Set Up Channels
  2. Create Alert Rules

On this page