Anomaly Detection and Alerting System Designer
Design a practical anomaly detection and alerting system for your business metrics — including threshold logic, statistical methods, alert routing, and false positive reduction strategies.
Prompt Template
You are a data engineering and analytics expert. Help me design an anomaly detection and alerting system for my key business metrics. Business context: - Industry: [industry] - Metrics to monitor: [list 3–6 key metrics, e.g., daily revenue, signup rate, API error rate, page load time] - Data update frequency: [real-time / hourly / daily] - Current monitoring: [describe what you have now, e.g., manual dashboards, basic threshold alerts] - Data stack: [tools you use, e.g., BigQuery, Snowflake, Grafana, Datadog, Looker] - Team receiving alerts: [who gets them — data team, engineering, ops, leadership] - Pain point: [e.g., too many false positives, missed real anomalies, alert fatigue] Please design: 1. **Metric Classification** - Categorize each metric by type: count, rate, latency, ratio - Define normal patterns: seasonal, day-of-week, trend, stationary - Identify which metrics need one-sided vs two-sided detection 2. **Detection Methods** (for each metric) - Static threshold (for simple metrics) - Z-score / modified Z-score (for normally distributed data) - Moving average + standard deviation bands - Seasonal decomposition (for metrics with weekly/monthly patterns) - Percentage change from previous period - Recommend the best method for each metric with reasoning 3. **Alert Severity Framework** - P1 (critical) / P2 (warning) / P3 (info) definitions - Threshold multipliers for each severity - Escalation paths 4. **False Positive Reduction** - Minimum duration before alerting (avoid spike noise) - Composite alert rules (e.g., alert only if metric A AND metric B are anomalous) - Day-of-week and holiday awareness - Cool-down periods between repeat alerts 5. **Implementation Plan** - SQL or Python pseudocode for 2 detection methods - Alert routing setup (Slack, PagerDuty, email) - Dashboard design for anomaly visualization 6. **Maintenance & Tuning** - How to review and tune thresholds monthly - Alert quality scorecard (precision, recall, noise ratio)
Example Output
Anomaly Detection System: E-Commerce SaaS Platform
1. Metric Classification
| Metric | Type | Pattern | Detection Side |
|--------|------|---------|----------------|
| Daily revenue | Count ($) | Weekly seasonal + trend | Two-sided (drops AND unexpected spikes) |
| Signup rate | Rate (per hour) | Day-of-week pattern | One-sided (drops only) |
| API error rate | Ratio (%) | Stationary (~0.5%) | One-sided (increases only) |
| P95 page load time | Latency (ms) | Stationary with spikes | One-sided (increases only) |
| Cart abandonment rate | Ratio (%) | Weekly seasonal | Two-sided |
2. Detection Methods
Daily Revenue — Seasonal Decomposition + Residual Z-Score
Why: Revenue has strong day-of-week patterns (weekends lower). A simple threshold would fire every Saturday.
-- BigQuery: 28-day seasonal baseline
WITH baseline AS (
SELECT
EXTRACT(DAYOFWEEK FROM date) as dow,
AVG(revenue) as avg_rev,
STDDEV(revenue) as std_rev
FROM metrics.daily_revenue
WHERE date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY dow
)
SELECT
t.date,
t.revenue,
b.avg_rev,
(t.revenue - b.avg_rev) / NULLIF(b.std_rev, 0) as z_score
FROM metrics.daily_revenue t
JOIN baseline b ON EXTRACT(DAYOFWEEK FROM t.date) = b.dow
WHERE t.date = CURRENT_DATE()
API Error Rate — Static Threshold + Rate of Change
Why: Error rate should be near-zero; any significant increase is meaningful.
- P3 (info): Error rate > 1% for 5+ minutes
- P2 (warning): Error rate > 3% for 5+ minutes
- P1 (critical): Error rate > 5% for 2+ minutes OR >10% instantaneous
4. False Positive Reduction
1. **Minimum duration:** No alert fires unless anomaly persists for ≥2 consecutive data points
2. **Composite rules:** Revenue drop + signup drop = P1. Revenue drop alone = P2 (could be a payment processor issue)
3. **Holiday suppression:** Maintain a holiday calendar; widen thresholds by 2× on known anomalous days
4. **Cool-down:** After a P1 alert, suppress same-metric alerts for 30 min unless severity escalates
6. Monthly Tuning Scorecard
| Metric | Alerts Fired | True Positives | False Positives | Precision | Action |
|--------|-------------|-----------------|-----------------|-----------|--------|
| Revenue | 8 | 6 | 2 | 75% | Widen weekend threshold |
| Error rate | 3 | 3 | 0 | 100% | No change |
| Target | — | — | — | >80% | — |
Tips for Best Results
- 💡Start with simple methods (static thresholds, Z-scores) before jumping to ML-based detection — they're easier to debug and explain.
- 💡The #1 killer of alerting systems is false positives. Optimize for precision first, then improve recall.
- 💡Always include day-of-week and holiday awareness — more than half of false positives come from predictable calendar patterns.
- 💡Review alert quality monthly. If a metric has <70% precision, either widen its threshold or switch detection methods.
Related Prompts
Dataset Summary and Insights
Paste or describe a dataset and get an instant summary of key statistics, patterns, anomalies, and actionable insights.
SQL Query Writer for Business Reports
Generate SQL queries for common business reporting needs — revenue trends, cohort analysis, funnel metrics, and more.
Dashboard KPI Definition Framework
Define the right KPIs for your business dashboard with clear formulas, targets, and data sources.