Alert Threshold Recommender

Get data-driven alert thresholds based on your baseline metrics. Includes ready-to-paste configs for Prometheus, Grafana, and Datadog.

Signal Type

Method

Sensitivity

Your Baseline Metrics

Mean (ms)

Std Dev (ms)

P50 (ms)

P95 (ms)

P99 (ms)

Warning

300.00ms

Based on your baseline mean of 200ms and 2-sigma, warning at 300.00ms catches deviations beyond 95.4% of normal traffic.

Critical

350.00ms

Based on your baseline mean of 200ms and 3-sigma, critical at 350.00ms triggers only for extreme outliers beyond 99.7% of normal traffic.

Prometheus Rules

groups:
- name: latency_alerts
  rules:
  - alert: HighLatencyWarning
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.3000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected (warning)"
      description: "P95 latency is above 300.0ms"
  - alert: HighLatencyCritical
    expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.3500
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High latency detected (critical)"
      description: "P99 latency is above 350.0ms"

Datadog Monitor

{
  "name": "Latency Alert",
  "type": "metric alert",
  "query": "avg(last_5m):avg:trace.http.request.duration{env:production} > 350",
  "message": "{{#is_warning}}Latency above warning threshold (300.0ms){{/is_warning}} {{#is_alert}}Latency above critical threshold (350.0ms){{/is_alert}}",
  "options": {
    "thresholds": {
      "warning": 300,
      "critical": 350
    },
    "notify_no_data": false,
    "evaluation_delay": 60
  }
}

How Alert Thresholds Work

Good alerts fire on real problems without drowning you in noise. The two main approaches are statistical (standard deviation from the mean) and percentile-based (P95/P99 of your actual distribution).

Standard deviation works well when your metric is roughly normally distributed. The 2-sigma warning catches 95.4% of normal variation; 3-sigma catches 99.7%.

Percentile-based thresholds work better for skewed distributions (like latency). P95 and P99 are directly observed from your data.

Learn how TraceKit automates alert configuration

Set up intelligent alerts in minutes -- Start free with TraceKit