Alert Rules
Configure alert rules in TraceKit to get notified about errors, latency spikes, and anomalies in your distributed systems.
Alert Rules
Get notified when your services need attention. Set up intelligent alerts based on error rates, latency, throughput, and health scores.
Quick Start
- Set up notification channels (Slack, Telegram, Discord, Teams, PagerDuty, or OpsGenie)
- Create an alert rule with conditions
- Get notified when thresholds are breached
Alert Types
Error Rate
Monitor the percentage of failed requests. Ideal for detecting when your service starts experiencing issues.
Example Use Case: Alert when authentication service error rate exceeds 5% over 5 minutes
| Setting | Value |
|---|---|
| Alert Type | Error Rate |
| Scope | Service -- auth-service |
| Condition | error_rate > 5% |
| Time Window | 5 minutes |
| Severity | Critical |
Best Practice: Set thresholds based on your baseline. 5-10% is typical for warning, 15%+ for critical.
Latency
Track response times and get alerted when requests are too slow. Choose from average, P50, P95, or P99 metrics.
Example Use Case: Alert when P95 latency exceeds 1000ms (1 second) for API endpoints
| Setting | Value |
|---|---|
| Alert Type | Latency |
| Metric | P95 |
| Scope | Service -- api-gateway |
| Condition | p95 > 1000ms |
| Time Window | 5 minutes |
Metric Guide:
- Average: Good for overall trends
- P95: Recommended for user experience (95% of requests)
- P99: Catch worst-case scenarios
Throughput
Monitor requests per minute. Perfect for detecting when services stop processing traffic or get overwhelmed.
Service Down Detection:
| Setting | Value |
|---|---|
| Condition | req_per_min < 1 |
| Time Window | 10 minutes |
| Severity | Critical |
Traffic Spike Detection:
| Setting | Value |
|---|---|
| Condition | req_per_min > 1000 |
| Time Window | 5 minutes |
| Severity | Warning |
Throughput alerts are perfect for detecting service outages (low threshold) or DDoS attacks (high threshold).
Health Score
Composite metric combining error rate and latency into a single health score (0-100). Higher is better.
Example Use Case: Alert when overall service health drops below 70
| Setting | Value |
|---|---|
| Alert Type | Health Score |
| Scope | Global (All Services) |
| Condition | health_score < 70 |
| Time Window | 15 minutes |
Formula: Health Score = (Error Rate Score x 50) + (Latency Score x 50). Score 100 = perfect, 0 = complete failure.
Scope Types
- Global -- Monitor all services together. Good for overall system health.
- Service -- Monitor a specific service. Most common use case.
- Endpoint -- Monitor a specific endpoint like "POST /api/users".
Best Practices
Set Appropriate Time Windows
Short windows (1-5 min) detect issues quickly but may cause false positives. Longer windows (15-30 min) are more stable but slower to alert.
Use Cooldowns to Prevent Spam
Set cooldown periods (15-60 min) to avoid getting flooded with notifications for the same issue. You'll be notified periodically until the issue is resolved.
Layer Your Alerts
Combine multiple alert types: Error rate alerts catch failures, latency alerts catch slowdowns, and throughput alerts catch outages.
Start with Baselines
Monitor your services for a few days to understand normal behavior before setting alert thresholds. Use your P95 latency as a starting point.
SDK Setup Guides
Alerts work with trace data sent by any TraceKit SDK. Set up your SDK to start sending traces, then create alert rules:
Next Steps
Ready to set up your first alert?
Source Maps
Upload source maps to resolve minified JavaScript stack traces to original source locations. Supports CLI upload, Vite and Webpack build plugins, and automatic debug ID injection.
Notification Channels
Set up notification channels in TraceKit for Slack, Discord, Microsoft Teams, PagerDuty, OpsGenie, email, and Telegram to receive alert notifications.