Anomaly Detection and Alerting System Designer

Design a practical anomaly detection and alerting system for your business metrics — including threshold logic, statistical methods, alert routing, and false positive reduction strategies.

Prompt Template

You are a data engineering and analytics expert. Help me design an anomaly detection and alerting system for my key business metrics.

Business context:
- Industry: [industry]
- Metrics to monitor: [list 3–6 key metrics, e.g., daily revenue, signup rate, API error rate, page load time]
- Data update frequency: [real-time / hourly / daily]
- Current monitoring: [describe what you have now, e.g., manual dashboards, basic threshold alerts]
- Data stack: [tools you use, e.g., BigQuery, Snowflake, Grafana, Datadog, Looker]
- Team receiving alerts: [who gets them — data team, engineering, ops, leadership]
- Pain point: [e.g., too many false positives, missed real anomalies, alert fatigue]

Please design:

1. **Metric Classification**
   - Categorize each metric by type: count, rate, latency, ratio
   - Define normal patterns: seasonal, day-of-week, trend, stationary
   - Identify which metrics need one-sided vs two-sided detection

2. **Detection Methods** (for each metric)
   - Static threshold (for simple metrics)
   - Z-score / modified Z-score (for normally distributed data)
   - Moving average + standard deviation bands
   - Seasonal decomposition (for metrics with weekly/monthly patterns)
   - Percentage change from previous period
   - Recommend the best method for each metric with reasoning

3. **Alert Severity Framework**
   - P1 (critical) / P2 (warning) / P3 (info) definitions
   - Threshold multipliers for each severity
   - Escalation paths

4. **False Positive Reduction**
   - Minimum duration before alerting (avoid spike noise)
   - Composite alert rules (e.g., alert only if metric A AND metric B are anomalous)
   - Day-of-week and holiday awareness
   - Cool-down periods between repeat alerts

5. **Implementation Plan**
   - SQL or Python pseudocode for 2 detection methods
   - Alert routing setup (Slack, PagerDuty, email)
   - Dashboard design for anomaly visualization

6. **Maintenance & Tuning**
   - How to review and tune thresholds monthly
   - Alert quality scorecard (precision, recall, noise ratio)

Example Output

Anomaly Detection System: E-Commerce SaaS Platform

1. Metric Classification

| Metric | Type | Pattern | Detection Side |

|--------|------|---------|----------------|

| Daily revenue | Count ($) | Weekly seasonal + trend | Two-sided (drops AND unexpected spikes) |

| Signup rate | Rate (per hour) | Day-of-week pattern | One-sided (drops only) |

| API error rate | Ratio (%) | Stationary (~0.5%) | One-sided (increases only) |

| P95 page load time | Latency (ms) | Stationary with spikes | One-sided (increases only) |

| Cart abandonment rate | Ratio (%) | Weekly seasonal | Two-sided |

2. Detection Methods

Daily Revenue — Seasonal Decomposition + Residual Z-Score

Why: Revenue has strong day-of-week patterns (weekends lower). A simple threshold would fire every Saturday.

-- BigQuery: 28-day seasonal baseline

WITH baseline AS (

SELECT

EXTRACT(DAYOFWEEK FROM date) as dow,

AVG(revenue) as avg_rev,

STDDEV(revenue) as std_rev

FROM metrics.daily_revenue

WHERE date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

GROUP BY dow

)

SELECT

t.date,

t.revenue,

b.avg_rev,

(t.revenue - b.avg_rev) / NULLIF(b.std_rev, 0) as z_score

FROM metrics.daily_revenue t

JOIN baseline b ON EXTRACT(DAYOFWEEK FROM t.date) = b.dow

WHERE t.date = CURRENT_DATE()

API Error Rate — Static Threshold + Rate of Change

Why: Error rate should be near-zero; any significant increase is meaningful.

- P3 (info): Error rate > 1% for 5+ minutes

- P2 (warning): Error rate > 3% for 5+ minutes

- P1 (critical): Error rate > 5% for 2+ minutes OR >10% instantaneous

4. False Positive Reduction

1. **Minimum duration:** No alert fires unless anomaly persists for ≥2 consecutive data points

2. **Composite rules:** Revenue drop + signup drop = P1. Revenue drop alone = P2 (could be a payment processor issue)

3. **Holiday suppression:** Maintain a holiday calendar; widen thresholds by 2× on known anomalous days

4. **Cool-down:** After a P1 alert, suppress same-metric alerts for 30 min unless severity escalates

6. Monthly Tuning Scorecard

| Metric | Alerts Fired | True Positives | False Positives | Precision | Action |

|--------|-------------|-----------------|-----------------|-----------|--------|

| Revenue | 8 | 6 | 2 | 75% | Widen weekend threshold |

| Error rate | 3 | 3 | 0 | 100% | No change |

| Target | — | — | — | >80% | — |

Tips for Best Results

  • 💡Start with simple methods (static thresholds, Z-scores) before jumping to ML-based detection — they're easier to debug and explain.
  • 💡The #1 killer of alerting systems is false positives. Optimize for precision first, then improve recall.
  • 💡Always include day-of-week and holiday awareness — more than half of false positives come from predictable calendar patterns.
  • 💡Review alert quality monthly. If a metric has <70% precision, either widen its threshold or switch detection methods.