-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Description
Summary
PerformanceMonitor's inference engine uses rule-based analysis: collect facts, score them with amplifiers, traverse relationship graphs to produce findings. This works well for known patterns but can miss novel problems. Statistical anomaly detection complements rule-based analysis by answering a different question: rules say "this pattern means X," anomaly detection says "this metric is behaving unusually" — even when no rule exists for that specific situation.
The industry consensus is that rule-based analysis + statistical anomaly detection outperforms either approach alone.
How It Would Work
Anomaly detection runs as an additional scoring input to the existing analysis engine, not as a replacement:
- Collect metric snapshots (already happening)
- Compare current values against historical baselines (ties into dynamic baselines — issue tracking that separately)
- Score the degree of deviation (z-score, percentile rank, or similar)
- Feed anomaly scores into the inference engine as facts alongside the existing rule-based facts
- Amplify rule-based findings when anomaly detection confirms the deviation is statistically unusual
Example Flow
- Rule detects: "CXPACKET waits are > 30% of total waits" (known pattern, scores Medium)
- Anomaly detection adds: "CXPACKET waits are 4.2 standard deviations above the baseline for this time window" (statistically unusual)
- Combined: The finding's severity is amplified because both the rule and the anomaly detector agree this is significant
- vs: "CXPACKET waits are > 30% of total waits" but anomaly score is low (this is normal for this workload) → severity stays Medium or is dampened
What to Detect Anomalies On
Metric-level anomalies (is this value unusual?)
- CPU utilization
- Wait stat proportions and absolute values
- Batch requests/sec (sudden drops or spikes)
- Query duration aggregates
- Memory utilization
- I/O latency
- Session counts
- Blocking/deadlock event counts
Pattern-level anomalies (is this combination unusual?)
- CPU high + batch requests low = something is stuck (not just busy)
- Wait profile shift: top wait type changed from SOS_SCHEDULER_YIELD to LCK_M_X
- Query mix change: execution count distribution across databases shifted
Statistical Approaches (start simple)
Phase 1: Z-score against rolling baseline
- For each metric, maintain a rolling mean and standard deviation (last 30 days, same hour-of-day, same day-of-week)
- Current z-score = (current_value - mean) / std_dev
- Flag anything beyond ±3σ as anomalous
- This is simple, interpretable, and effective for most time-series metrics
Phase 2 (future): More sophisticated methods
- Seasonal decomposition (STL) for metrics with strong weekly patterns
- Isolation Forest for detecting multivariate anomalies (unusual combinations of metrics)
- CUSUM or exponentially weighted moving average for detecting gradual drifts
Integration Points
Analysis Engine (both Dashboard and Lite)
- Add an `AnomalyScorer` that runs alongside existing fact collectors
- Produce `AnomalyFact` objects with the metric name, current value, baseline, z-score, and severity
- Existing amplifier system can use anomaly scores to boost or dampen finding severity
MCP Tools
- `analyze_server` output could include anomaly context: "CPU utilization 92% (4.1σ above baseline for Tuesday 2pm)"
- `get_analysis_facts` could expose anomaly scores alongside rule-based scores
- New tool or parameter: `get_anomalies` to list all currently anomalous metrics
UI (both Dashboard and Lite)
- Anomalous metrics highlighted in the progressive summary view (issue Progressive server summary as landing view per server #689)
- Trend charts could mark anomalous data points (dot color change or marker)
- Pairs with baseline bands (from dynamic baselines issue) — anomalies are the points outside the band
Design Notes
- This is an enhancement to the existing engine architecture, not a new system
- Phase 1 (z-score) requires only basic statistics — no ML libraries needed
- The key value is combining anomaly scores with rule-based findings, not replacing rules
- False positive management: anomaly detection will flag things that are unusual but not problematic. The rule engine provides the "is this actually bad?" judgment. Anomalies without matching rules should be surfaced at lower severity (informational)
- Requires sufficient historical data (2-4 weeks minimum) — same constraint as dynamic baselines
- Applies to both Dashboard and Lite, plus MCP analysis tools
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels