Statistical anomaly detection as input to the analysis engine

## Summary

PerformanceMonitor's inference engine uses rule-based analysis: collect facts, score them with amplifiers, traverse relationship graphs to produce findings. This works well for known patterns but can miss novel problems. Statistical anomaly detection complements rule-based analysis by answering a different question: rules say "this pattern means X," anomaly detection says "this metric is behaving unusually" — even when no rule exists for that specific situation.

The industry consensus is that **rule-based analysis + statistical anomaly detection outperforms either approach alone.**

## How It Would Work

Anomaly detection runs as an additional scoring input to the existing analysis engine, not as a replacement:

1. **Collect** metric snapshots (already happening)
2. **Compare** current values against historical baselines (ties into dynamic baselines — issue tracking that separately)
3. **Score** the degree of deviation (z-score, percentile rank, or similar)
4. **Feed** anomaly scores into the inference engine as facts alongside the existing rule-based facts
5. **Amplify** rule-based findings when anomaly detection confirms the deviation is statistically unusual

### Example Flow
- Rule detects: "CXPACKET waits are > 30% of total waits" (known pattern, scores Medium)
- Anomaly detection adds: "CXPACKET waits are 4.2 standard deviations above the baseline for this time window" (statistically unusual)
- Combined: The finding's severity is amplified because both the rule and the anomaly detector agree this is significant
- vs: "CXPACKET waits are > 30% of total waits" but anomaly score is low (this is normal for this workload) → severity stays Medium or is dampened

## What to Detect Anomalies On

### Metric-level anomalies (is this value unusual?)
- CPU utilization
- Wait stat proportions and absolute values
- Batch requests/sec (sudden drops or spikes)
- Query duration aggregates
- Memory utilization
- I/O latency
- Session counts
- Blocking/deadlock event counts

### Pattern-level anomalies (is this combination unusual?)
- CPU high + batch requests low = something is stuck (not just busy)
- Wait profile shift: top wait type changed from SOS_SCHEDULER_YIELD to LCK_M_X
- Query mix change: execution count distribution across databases shifted

## Statistical Approaches (start simple)

### Phase 1: Z-score against rolling baseline
- For each metric, maintain a rolling mean and standard deviation (last 30 days, same hour-of-day, same day-of-week)
- Current z-score = (current_value - mean) / std_dev
- Flag anything beyond ±3σ as anomalous
- This is simple, interpretable, and effective for most time-series metrics

### Phase 2 (future): More sophisticated methods
- Seasonal decomposition (STL) for metrics with strong weekly patterns
- Isolation Forest for detecting multivariate anomalies (unusual combinations of metrics)
- CUSUM or exponentially weighted moving average for detecting gradual drifts

## Integration Points

### Analysis Engine (both Dashboard and Lite)
- Add an \`AnomalyScorer\` that runs alongside existing fact collectors
- Produce \`AnomalyFact\` objects with the metric name, current value, baseline, z-score, and severity
- Existing amplifier system can use anomaly scores to boost or dampen finding severity

### MCP Tools
- \`analyze_server\` output could include anomaly context: "CPU utilization 92% (4.1σ above baseline for Tuesday 2pm)"
- \`get_analysis_facts\` could expose anomaly scores alongside rule-based scores
- New tool or parameter: \`get_anomalies\` to list all currently anomalous metrics

### UI (both Dashboard and Lite)
- Anomalous metrics highlighted in the progressive summary view (issue #689)
- Trend charts could mark anomalous data points (dot color change or marker)
- Pairs with baseline bands (from dynamic baselines issue) — anomalies are the points outside the band

## Design Notes

- This is an enhancement to the existing engine architecture, not a new system
- Phase 1 (z-score) requires only basic statistics — no ML libraries needed
- The key value is *combining* anomaly scores with rule-based findings, not replacing rules
- False positive management: anomaly detection will flag things that are unusual but not problematic. The rule engine provides the "is this actually bad?" judgment. Anomalies without matching rules should be surfaced at lower severity (informational)
- Requires sufficient historical data (2-4 weeks minimum) — same constraint as dynamic baselines
- Applies to both Dashboard and Lite, plus MCP analysis tools

## References
- [Database Anomaly Detection — SolarWinds](https://www.solarwinds.com/database-performance-analyzer/use-cases/database-anomaly-detection)
- [Anomaly Detection with CloudWatch Database Insights — AWS](https://aws.amazon.com/about-aws/whats-new/2025/11/amazon-cloudwatch-database-insights-anomaly-detection/)
- [AIOps Survey in the Era of LLMs — ACM](https://dl.acm.org/doi/10.1145/3746635)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical anomaly detection as input to the analysis engine #693

Summary

How It Would Work

Example Flow

What to Detect Anomalies On

Metric-level anomalies (is this value unusual?)

Pattern-level anomalies (is this combination unusual?)

Statistical Approaches (start simple)

Phase 1: Z-score against rolling baseline

Phase 2 (future): More sophisticated methods

Integration Points

Analysis Engine (both Dashboard and Lite)

MCP Tools

UI (both Dashboard and Lite)

Design Notes

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Statistical anomaly detection as input to the analysis engine #693

Description

Summary

How It Would Work

Example Flow

What to Detect Anomalies On

Metric-level anomalies (is this value unusual?)

Pattern-level anomalies (is this combination unusual?)

Statistical Approaches (start simple)

Phase 1: Z-score against rolling baseline

Phase 2 (future): More sophisticated methods

Integration Points

Analysis Engine (both Dashboard and Lite)

MCP Tools

UI (both Dashboard and Lite)

Design Notes

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions