Alert Threshold Recommender
Get data-driven alert thresholds based on your baseline metrics. Includes ready-to-paste configs for Prometheus, Grafana, and Datadog.
Your Baseline Metrics
Warning
300.00ms
Based on your baseline mean of 200ms and 2-sigma, warning at 300.00ms catches deviations beyond 95.4% of normal traffic.
Critical
350.00ms
Based on your baseline mean of 200ms and 3-sigma, critical at 350.00ms triggers only for extreme outliers beyond 99.7% of normal traffic.
groups:
- name: latency_alerts
rules:
- alert: HighLatencyWarning
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.3000
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected (warning)"
description: "P95 latency is above 300.0ms"
- alert: HighLatencyCritical
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.3500
for: 5m
labels:
severity: critical
annotations:
summary: "High latency detected (critical)"
description: "P99 latency is above 350.0ms"{
"name": "Latency Alert",
"type": "metric alert",
"query": "avg(last_5m):avg:trace.http.request.duration{env:production} > 350",
"message": "{{#is_warning}}Latency above warning threshold (300.0ms){{/is_warning}} {{#is_alert}}Latency above critical threshold (350.0ms){{/is_alert}}",
"options": {
"thresholds": {
"warning": 300,
"critical": 350
},
"notify_no_data": false,
"evaluation_delay": 60
}
}How Alert Thresholds Work
Good alerts fire on real problems without drowning you in noise. The two main approaches are statistical (standard deviation from the mean) and percentile-based (P95/P99 of your actual distribution).
Standard deviation works well when your metric is roughly normally distributed. The 2-sigma warning catches 95.4% of normal variation; 3-sigma catches 99.7%.
Percentile-based thresholds work better for skewed distributions (like latency). P95 and P99 are directly observed from your data.