Machine-Learning/Statistics at main · SamavartaX5/Machine-Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
1. What is Statistics?

Statistics is the study of collecting, organizing, analyzing, and interpreting data.

2. Types of Statistics

Descriptive Statistics → summarizes data (mean, median, variance)
Inferential Statistics → uses sample data to make conclusions about population

3. Population vs Sample

Population = entire data (N)
Sample = subset of population (n)

4. Measures of Central Tendency

Mean = sum of values / n
Median = middle value after sorting
Mode = most frequent value

5. Measures of Dispersion

Population Variance = sum of (xi − mean)^2 / N
Sample Variance = sum of (xi − sample_mean)^2 / (n − 1)

Reason for (n − 1): gives better estimate (Bessel’s correction)

Standard Deviation = square root of variance

6. Types of Variables

Quantitative (numeric)

Discrete → countable (1, 2, 3)
Continuous → measurable (height, weight)

Qualitative (categorical)

Non-numeric (gender, color, type)
7. Histogram

A graph showing frequency distribution of data
X-axis → value ranges (bins)
Y-axis → frequency

8. Percentiles

A percentile is a value below which P% of data lies

Position = (P/100) × (n + 1)

Example: 50th percentile = median

9. Quartiles

Q1 → 25%
Q2 → 50% (median)
Q3 → 75%

10. Interquartile Range (IQR)

IQR = Q3 − Q1
Represents spread of middle 50%

11. Outliers

Lower Fence = Q1 − 1.5 × IQR
Upper Fence = Q3 + 1.5 × IQR

Values outside this range = outliers

12. Five Number Summary

Minimum, Q1, Median, Q3, Maximum

13. Covariance

Cov(X, Y) = sum((xi − x_mean)(yi − y_mean)) / (n − 1)

Interpretation:
Positive → move together
Negative → move opposite
Zero → no relation

14. Correlation

Correlation = Cov(X, Y) / (std_dev_X × std_dev_Y)

Range: -1 to 1

1 → perfect positive
-1 → perfect negative
0 → no relation

Limitation: only captures linear relationships

15. Types of Correlation

Pearson → linear relationship
Spearman → rank-based, handles non-linear (monotonic)

16. Key Insight

Sample statistics approximate population values
Sample variance uses (n − 1) for accuracy

17. Quick Example

Data: 1,2,2,3,3,4,5,5,6,6,7,8,8,9

Median = 5
Q1 ≈ 3
Q3 ≈ 7
IQR = 4

Lower Fence = -3
Upper Fence = 13

Any value outside this → outlier (e.g., 27)