TracekitTracekit

Alert Threshold Recommender

Get data-driven alert thresholds based on your baseline metrics. Includes ready-to-paste configs for Prometheus, Grafana, and Datadog.

Your Baseline Metrics

Warning

300.00ms

Based on your baseline mean of 200ms and 2-sigma, warning at 300.00ms catches deviations beyond 95.4% of normal traffic.

Critical

350.00ms

Based on your baseline mean of 200ms and 3-sigma, critical at 350.00ms triggers only for extreme outliers beyond 99.7% of normal traffic.

Prometheus Rules
groups:
- name: latency_alerts
  rules:
  - alert: HighLatencyWarning
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.3000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected (warning)"
      description: "P95 latency is above 300.0ms"
  - alert: HighLatencyCritical
    expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.3500
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High latency detected (critical)"
      description: "P99 latency is above 350.0ms"
Datadog Monitor
{
  "name": "Latency Alert",
  "type": "metric alert",
  "query": "avg(last_5m):avg:trace.http.request.duration{env:production} > 350",
  "message": "{{#is_warning}}Latency above warning threshold (300.0ms){{/is_warning}} {{#is_alert}}Latency above critical threshold (350.0ms){{/is_alert}}",
  "options": {
    "thresholds": {
      "warning": 300,
      "critical": 350
    },
    "notify_no_data": false,
    "evaluation_delay": 60
  }
}

How Alert Thresholds Work

Good alerts fire on real problems without drowning you in noise. The two main approaches are statistical (standard deviation from the mean) and percentile-based (P95/P99 of your actual distribution).

Standard deviation works well when your metric is roughly normally distributed. The 2-sigma warning catches 95.4% of normal variation; 3-sigma catches 99.7%.

Percentile-based thresholds work better for skewed distributions (like latency). P95 and P99 are directly observed from your data.

Learn how TraceKit automates alert configuration

Set up intelligent alerts in minutes -- Start free with TraceKit