Skip to content

Statistical anomaly detection as input to the analysis engine #693

@erikdarlingdata

Description

@erikdarlingdata

Summary

PerformanceMonitor's inference engine uses rule-based analysis: collect facts, score them with amplifiers, traverse relationship graphs to produce findings. This works well for known patterns but can miss novel problems. Statistical anomaly detection complements rule-based analysis by answering a different question: rules say "this pattern means X," anomaly detection says "this metric is behaving unusually" — even when no rule exists for that specific situation.

The industry consensus is that rule-based analysis + statistical anomaly detection outperforms either approach alone.

How It Would Work

Anomaly detection runs as an additional scoring input to the existing analysis engine, not as a replacement:

  1. Collect metric snapshots (already happening)
  2. Compare current values against historical baselines (ties into dynamic baselines — issue tracking that separately)
  3. Score the degree of deviation (z-score, percentile rank, or similar)
  4. Feed anomaly scores into the inference engine as facts alongside the existing rule-based facts
  5. Amplify rule-based findings when anomaly detection confirms the deviation is statistically unusual

Example Flow

  • Rule detects: "CXPACKET waits are > 30% of total waits" (known pattern, scores Medium)
  • Anomaly detection adds: "CXPACKET waits are 4.2 standard deviations above the baseline for this time window" (statistically unusual)
  • Combined: The finding's severity is amplified because both the rule and the anomaly detector agree this is significant
  • vs: "CXPACKET waits are > 30% of total waits" but anomaly score is low (this is normal for this workload) → severity stays Medium or is dampened

What to Detect Anomalies On

Metric-level anomalies (is this value unusual?)

  • CPU utilization
  • Wait stat proportions and absolute values
  • Batch requests/sec (sudden drops or spikes)
  • Query duration aggregates
  • Memory utilization
  • I/O latency
  • Session counts
  • Blocking/deadlock event counts

Pattern-level anomalies (is this combination unusual?)

  • CPU high + batch requests low = something is stuck (not just busy)
  • Wait profile shift: top wait type changed from SOS_SCHEDULER_YIELD to LCK_M_X
  • Query mix change: execution count distribution across databases shifted

Statistical Approaches (start simple)

Phase 1: Z-score against rolling baseline

  • For each metric, maintain a rolling mean and standard deviation (last 30 days, same hour-of-day, same day-of-week)
  • Current z-score = (current_value - mean) / std_dev
  • Flag anything beyond ±3σ as anomalous
  • This is simple, interpretable, and effective for most time-series metrics

Phase 2 (future): More sophisticated methods

  • Seasonal decomposition (STL) for metrics with strong weekly patterns
  • Isolation Forest for detecting multivariate anomalies (unusual combinations of metrics)
  • CUSUM or exponentially weighted moving average for detecting gradual drifts

Integration Points

Analysis Engine (both Dashboard and Lite)

  • Add an `AnomalyScorer` that runs alongside existing fact collectors
  • Produce `AnomalyFact` objects with the metric name, current value, baseline, z-score, and severity
  • Existing amplifier system can use anomaly scores to boost or dampen finding severity

MCP Tools

  • `analyze_server` output could include anomaly context: "CPU utilization 92% (4.1σ above baseline for Tuesday 2pm)"
  • `get_analysis_facts` could expose anomaly scores alongside rule-based scores
  • New tool or parameter: `get_anomalies` to list all currently anomalous metrics

UI (both Dashboard and Lite)

  • Anomalous metrics highlighted in the progressive summary view (issue Progressive server summary as landing view per server #689)
  • Trend charts could mark anomalous data points (dot color change or marker)
  • Pairs with baseline bands (from dynamic baselines issue) — anomalies are the points outside the band

Design Notes

  • This is an enhancement to the existing engine architecture, not a new system
  • Phase 1 (z-score) requires only basic statistics — no ML libraries needed
  • The key value is combining anomaly scores with rule-based findings, not replacing rules
  • False positive management: anomaly detection will flag things that are unusual but not problematic. The rule engine provides the "is this actually bad?" judgment. Anomalies without matching rules should be surfaced at lower severity (informational)
  • Requires sufficient historical data (2-4 weeks minimum) — same constraint as dynamic baselines
  • Applies to both Dashboard and Lite, plus MCP analysis tools

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions