-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathchapter5.html
More file actions
590 lines (572 loc) · 69 KB
/
chapter5.html
File metadata and controls
590 lines (572 loc) · 69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Another Book on Data Science - Linear Regression</title>
<meta name="description" content="data science, R, Python, programming, machine learning">
<meta name="author" content="Nailong Zhang">
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
<script src="https://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!-- Le styles -->
<link rel="stylesheet" href="bootstrap-1.1.0.min.css">
<link rel="stylesheet" href="style.css">
<link rel="stylesheet" href="small-screens.css">
<link rel="stylesheet" href="vs.css">
<link rel="stylesheet" href="code.css">
<link rel="stylesheet" href="application.css">
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-142297640-1', 'anotherbookondatascience.com');
ga('send', 'pageview');
</script>
</head>
<body>
<div class="topbar">
<div class="fill">
<div class="container-fluid fixed">
<h3><a href="index.html">Linear Regression</a></h3>
<ul class="nav secondary-nav">
<li><a href="chapter4.html">«Previous</a></li>
<li><a href="chapter6.html">Next»</a></li>
</ul>
</div>
</div>
</div>
<div class="container-fluid" style="padding-top: 60px;">
<p>Sections in this Chapter:</p>
<ul>
<li><a href="#basics">Basics of linear regression</a></li>
<li><a href="#hypothesis">Linear hypothesis testing</a></li>
<li><a href="#ridge">Ridge regression</a></li>
</ul>
<p>There are numerous books <sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup> <sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup> available on the theory of linear regression. What is the purpose to write a chapter about these models in another book of data science? Many audience would be interested in how to implement their own regression models rather than using the off-the-shelf software packages. By the end of this chapter, we will build up our own linear regression models in R/Python. We would also see how we would reuse some functions in different regressions, such as linear regression, and these regressions with L2 penalties.</p>
<h2 id="basics">Basics of linear regression</h2>
<p>We start from the matrix form of linear regression.</p>
<p>$$<br />
\begin{equation}<br />
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},<br />
\label{eq:1}<br />
\end{equation}<br />
$$</p>
<p>where the column vector $\boldsymbol{y}$ contains $n$ observations on the dependent variable, $\boldsymbol{X}$ is a $n\times (p+1)$ matrix ($n>p$) of independent variables with constant vector $\boldsymbol{1}$ in the first column, $\boldsymbol{\beta}$ is a column vector of unknown population parameters to estimate based on the data, and $\boldsymbol{\epsilon}$ is the error term (or noise). For the sake of illustration, \eqref{eq:1} can be extended as in</p>
<p>$$<br />
\begin{equation}\label{eq:2}<br />
\begin{bmatrix}<br />
\boldsymbol{y}_1 \\<br />
\boldsymbol{y}_2 \\<br />
\vdots \\<br />
\boldsymbol{y}_n<br />
\end{bmatrix}<br />
=<br />
\begin{bmatrix}<br />
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\<br />
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\<br />
\vdots & \vdots & \vdots & \vdots & \vdots \\<br />
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn}<br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
\boldsymbol{\beta}_0 \\<br />
\boldsymbol{\beta}_1 \\<br />
\vdots \\<br />
\boldsymbol{\beta}_p<br />
\end{bmatrix}<br />
+<br />
\begin{bmatrix}<br />
\boldsymbol{\epsilon}_1 \\<br />
\boldsymbol{\epsilon}_2 \\<br />
\vdots \\<br />
\boldsymbol{\epsilon}_n<br />
\end{bmatrix}<br />
\end{equation}<br />
$$</p>
<p>We apply ordinary least squares (<span class="caps">OLS</span>)<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup> approach to estimate the model parameter $\boldsymbol{\beta}$ since it requires fewer assumptions than other estimation methods such as maximum likelihood estimation<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>. Suppose the estimated model parameter is denoted as $\boldsymbol{\hat\beta}$, we define the residual vector of the system as $\boldsymbol{e}=\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}$. The idea of <span class="caps">OLS</span> is to find $\boldsymbol{\hat\beta}$ which can minimize the sum of squared residuals (<span class="caps">SSR</span>), i.e.,</p>
<p>$$<br />
\begin{equation}\label{eq:3}<br />
\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e}<br />
\end{equation}<br />
$$</p>
<p>Now the question is how to solve the optimization problem \eqref{eq:2}. First, let’s expand the <span class="caps">SSR</span>.</p>
<p>$$<br />
\begin{equation}\label{eq:4}<br />
\begin{split}<br />
\boldsymbol{e}’\boldsymbol{e}&=(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta})‘(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\hat\beta}) \\<br />
&= \boldsymbol{y}’\boldsymbol{y}-2\hat{\beta}’\boldsymbol{X}’\boldsymbol{y}+\hat{\beta}’\boldsymbol{X}’\boldsymbol{X}\hat{\boldsymbol{\beta}}<br />
\end{split}<br />
\end{equation}<br />
$$</p>
<p>The first and second order derivatives are calculated as follows</p>
<p>$$<br />
\begin{equation}\label{eq:5}<br />
\begin{cases}<br />
\frac{\partial \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta}}<br />
=-2\boldsymbol{X}’\boldsymbol{y}+2\boldsymbol{X}’\boldsymbol{X}\boldsymbol{\hat\beta}\\<br />
\frac{\partial^2 \boldsymbol{e}’\boldsymbol{e}}{\partial \boldsymbol{\hat\beta} \partial \boldsymbol{\hat\beta}’}<br />
=2\boldsymbol{X}’\boldsymbol{X}.<br />
\end{cases}<br />
\end{equation}<br />
$$</p>
<p>We see that the second order derivative is positive semidefinite which implies the <span class="caps">SSR</span> in <span class="caps">OLS</span> is a convex function (see section 3.1.4<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup>) and for an unconstrained convex optimization problem, the necessary as well as sufficient condition for optimality is that the first order derivative equals 0 (see section 4.2.3<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup>). Optimization of convex function is very important in machine learning. Actually, the parameter estimations of many machine learning models are convex optimization problems.</p>
<p>Based on the analysis above, the solution of \eqref{eq:3} is given in \eqref{eq:6}.</p>
<p>$$<br />
\begin{equation}\label{eq:6}<br />
\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y}<br />
\end{equation}<br />
$$</p>
<p>Now it seems we are ready to write our own linear regression model in R/Python. The solution in \eqref{eq:6} involves matrix transportation, multiplication and inversion, all of which are supported in both R and Python. In Python, we can use the <code>numpy</code> module for the matrix operations.</p>
<p>However, in practice we don’t solve linear regression with \eqref{eq:6} directly. Why? Let’s see an example with</p>
<p>$$<br />
x=\begin{bmatrix}<br />
1e+6 & -1 \\<br />
-1 & 1e-6 <br />
\end{bmatrix}.<br />
$$</p>
<div class="codewrapper">
<div class="codeleft">
<language>R</language>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno">1 </span><span class="o">></span> x<span class="o">=</span><span class="kt">array</span><span class="p">(</span><span class="kt">c</span><span class="p">(</span><span class="m">10</span><span class="o">^</span><span class="m">6</span><span class="p">,</span> <span class="m">-1</span><span class="p">,</span> <span class="m">-1</span><span class="p">,</span> <span class="m">10</span><span class="o">^</span><span class="m">-6</span><span class="p">),</span> <span class="kt">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span>
<span class="lineno">2 </span><span class="o">></span> <span class="kp">solve</span><span class="p">(</span><span class="kp">t</span><span class="p">(</span>x<span class="p">)</span> <span class="o">%*%</span> x<span class="p">)</span> <span class="c1"># solve() would return the inverse matrix</span>
<span class="lineno">3 </span>Error <span class="kr">in</span> <span class="kp">solve.default</span><span class="p">(</span><span class="kp">t</span><span class="p">(</span>x<span class="p">)</span> <span class="o">%*%</span> x<span class="p">)</span> <span class="o">:</span>
<span class="lineno">4 </span> system is computationally singular<span class="o">:</span> reciprocal condition number <span class="o">=</span> <span class="m">2.22044e-28</span>\end<span class="p">{</span>rl<span class="p">}</span>
<span class="lineno">5 </span><span class="p">}</span></code></pre></figure></div>
<div class="coderight">
<language>Python</language>
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno">1 </span><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="lineno">2 </span><span class="o">>>></span> <span class="n">x</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">1e6</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mf">1e-6</span><span class="p">]])</span>
<span class="lineno">3 </span><span class="o">>>></span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">x</span><span class="p">))</span>
<span class="lineno">4 </span><span class="n">array</span><span class="p">([[</span><span class="mf">4.50359963e+03</span><span class="p">,</span> <span class="mf">4.50359963e+09</span><span class="p">],</span>
<span class="lineno">5 </span> <span class="p">[</span><span class="mf">4.50359963e+09</span><span class="p">,</span> <span class="mf">4.50359963e+15</span><span class="p">]])</span></code></pre></figure></div>
</div>
<p>The R code above throws an error because of the singularity of $\boldsymbol{X}’\boldsymbol{X}$. It’s interesting that the corresponding Python code doesn’t behave in the same way as R, which has been reported as an issue on github<sup class="footnote" id="fnr6"><a href="#fn6">6</a></sup>.</p>
<p>When the matrix $\boldsymbol{X}’\boldsymbol{X}$ is singular, how to solve the <span class="caps">OLS</span> problem? In this book, we would focus on the QR decomposition based solution. Singular value decomposition (<span class="caps">SVD</span>) can also be used to solve <span class="caps">OLS</span>, which would not be covered in this book.</p>
<p>In linear algebra, a QR decomposition<sup class="footnote" id="fnr7"><a href="#fn7">7</a></sup> of matrix $\boldsymbol{X}$ would factorize $\boldsymbol{X}$ into a product, i.e., $\boldsymbol{X}=\boldsymbol{Q}\boldsymbol{R}$ where $\boldsymbol{Q}$ are orthogonal matrices and $\boldsymbol{R}$ is an upper triangular matrix. Since the matrix $\boldsymbol{Q}$ is orthogonal ($\boldsymbol{Q}’=\boldsymbol{Q}^{-1}$), we have</p>
<p>$$<br />
\begin{equation}<br />
\begin{split}<br />
\boldsymbol{\hat\beta}&=(\boldsymbol{X}’\boldsymbol{X})^{-1}\boldsymbol{X}’\boldsymbol{y} \\<br />
& = (\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{Q}\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\<br />
& = (\boldsymbol{R}’\boldsymbol{R})^{-1}\boldsymbol{R}’\boldsymbol{Q}’\boldsymbol{y} \\<br />
& = \boldsymbol{R}^{-1}\boldsymbol{Q}’\boldsymbol{y} <br />
\end{split}<br />
\label{eq:7}<br />
\end{equation}<br />
$$</p>
<p>Now we are ready to write our simple R/Python functions for linear regression with the help of QR decomposition according to $\eqref{eq:7}$.</p>
<div class="codewrapper">
<div class="codeleft">
<language>R</language>
<p>chapter4/qr_solver.R<br />
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno">1 </span>qr_solver<span class="o">=</span><span class="kr">function</span><span class="p">(</span>x<span class="p">,</span>y<span class="p">){</span>
<span class="lineno">2 </span> <span class="kp">qr.coef</span><span class="p">(</span><span class="kp">qr</span><span class="p">(</span>x<span class="p">),</span>y<span class="p">)</span>
<span class="lineno">3 </span><span class="p">}</span></code></pre></figure></p>
</div>
<div class="coderight">
<language>Python</language>
<p>chapter4/qr_solver.py<br />
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno">1 </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="lineno">2 </span>
<span class="lineno">3 </span><span class="k">def</span> <span class="nf">qr_solver</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
<span class="lineno">4 </span> <span class="n">q</span><span class="p">,</span><span class="n">r</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">qr</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="lineno">5 </span> <span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">q</span><span class="o">.</span><span class="n">T</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="lineno">6 </span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span><span class="n">r</span><span class="p">),</span><span class="n">p</span><span class="p">)</span></code></pre></figure></p>
</div>
</div>
<p>Of course, we don’t need to implement our own <span class="caps">OLS</span> solvers in a production environment; but if you do, still you may find some well-written and well-tested functions such as <code>np.linalg.lstsq</code> to save your time and effort from doing it from scratch.</p>
<p>Ok, now we have finished the training part of a linear regression model in both R and Python. After we train a model we want to use it, i.e., to make predictions based on the model. For most of machine learning models, training is much more complex than prediction (Exceptions include Lazy-learning models such as <span class="caps">KNN</span>). Let’s continue developing our own linear regression model by including the prediction function and enclose everything in an object.</p>
<language>R</language>
<p>chapter4/linear_regression.R<br />
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno"> 1 </span><span class="kn">library</span><span class="p">(</span>R6<span class="p">)</span>
<span class="lineno"> 2 </span>LR <span class="o">=</span> R6Class<span class="p">(</span>
<span class="lineno"> 3 </span> <span class="s">"LR"</span><span class="p">,</span>
<span class="lineno"> 4 </span> public <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>
<span class="lineno"> 5 </span> coef <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span>
<span class="lineno"> 6 </span> initialize <span class="o">=</span> <span class="kr">function</span><span class="p">()</span> <span class="p">{</span>
<span class="lineno"> 7 </span>
<span class="lineno"> 8 </span> <span class="p">},</span>
<span class="lineno"> 9 </span> fit <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> y<span class="p">)</span> <span class="p">{</span>
<span class="lineno">10 </span> self<span class="o">$</span>qr_solver<span class="p">(</span><span class="kp">cbind</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> x<span class="p">),</span> y<span class="p">)</span>
<span class="lineno">11 </span> <span class="p">},</span>
<span class="lineno">12 </span> qr_solver <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> y<span class="p">)</span> <span class="p">{</span>
<span class="lineno">13 </span> self<span class="o">$</span>coef <span class="o">=</span> <span class="kp">qr.coef</span><span class="p">(</span><span class="kp">qr</span><span class="p">(</span>x<span class="p">),</span> y<span class="p">)</span>
<span class="lineno">14 </span> <span class="p">},</span>
<span class="lineno">15 </span> predict <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>new_x<span class="p">)</span> <span class="p">{</span>
<span class="lineno">16 </span> <span class="kp">cbind</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> new_x<span class="p">)</span> <span class="o">%*%</span> self<span class="o">$</span>coef
<span class="lineno">17 </span> <span class="p">}</span>
<span class="lineno">18 </span> <span class="p">)</span>
<span class="lineno">19 </span><span class="p">)</span></code></pre></figure></p>
<language>Python</language>
<p>chapter4/linear_regression.py<br />
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno"> 1 </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="lineno"> 2 </span>
<span class="lineno"> 3 </span><span class="k">class</span> <span class="nc">LR</span><span class="p">:</span>
<span class="lineno"> 4 </span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="lineno"> 5 </span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span> <span class="o">=</span> <span class="kc">None</span>
<span class="lineno"> 6 </span>
<span class="lineno"> 7 </span> <span class="k">def</span> <span class="nf">qr_solver</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="lineno"> 8 </span> <span class="n">q</span><span class="p">,</span> <span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">qr</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="lineno"> 9 </span> <span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">q</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno">10 </span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span><span class="n">r</span><span class="p">),</span> <span class="n">p</span><span class="p">)</span>
<span class="lineno">11 </span>
<span class="lineno">12 </span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="lineno">13 </span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">qr_solver</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">x</span><span class="p">)),</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno">14 </span>
<span class="lineno">15 </span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="lineno">16 </span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">x</span><span class="p">)),</span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span><span class="p">)</span></code></pre></figure></p>
<p>Now let’s try to use our linear regression model to solve a real regression problem with the Boston dataset<sup class="footnote" id="fnr8"><a href="#fn8">8</a></sup>, and check the results.</p>
<language>R</language>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno"> 1 </span><span class="o">></span> <span class="kn">source</span><span class="p">(</span><span class="s">'linear_regression.R'</span><span class="p">)</span>
<span class="lineno"> 2 </span><span class="o">></span>
<span class="lineno"> 3 </span><span class="o">></span> <span class="kn">library</span><span class="p">(</span>MASS<span class="p">)</span> <span class="c1"># load Boston data from this package</span>
<span class="lineno"> 4 </span><span class="o">></span>
<span class="lineno"> 5 </span><span class="o">></span> lr <span class="o">=</span> LR<span class="o">$</span>new<span class="p">()</span>
<span class="lineno"> 6 </span><span class="o">></span> <span class="c1"># -i means excluding the ith column from the data.frame</span>
<span class="lineno"> 7 </span><span class="o">></span> lr<span class="o">$</span>fit<span class="p">(</span><span class="kp">data.matrix</span><span class="p">(</span>Boston<span class="p">[,</span><span class="o">-</span><span class="kp">ncol</span><span class="p">(</span>Boston<span class="p">)]),</span>Boston<span class="o">$</span>medv<span class="p">)</span>
<span class="lineno"> 8 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>lr<span class="o">$</span>coef<span class="p">)</span>
<span class="lineno"> 9 </span> crim zn indus chas
<span class="lineno">10 </span> <span class="m">3.645949e+01</span> <span class="m">-1.080114e-01</span> <span class="m">4.642046e-02</span> <span class="m">2.055863e-02</span> <span class="m">2.686734e+00</span>
<span class="lineno">11 </span> nox rm age dis rad
<span class="lineno">12 </span><span class="m">-1.776661e+01</span> <span class="m">3.809865e+00</span> <span class="m">6.922246e-04</span> <span class="m">-1.475567e+00</span> <span class="m">3.060495e-01</span>
<span class="lineno">13 </span> tax ptratio black lstat
<span class="lineno">14 </span><span class="m">-1.233459e-02</span> <span class="m">-9.527472e-01</span> <span class="m">9.311683e-03</span> <span class="m">-5.247584e-01</span>
<span class="lineno">15 </span><span class="o">></span> <span class="c1"># let's make prediction on the same data</span>
<span class="lineno">16 </span><span class="o">></span> pred<span class="o">=</span>lr<span class="o">$</span>predict<span class="p">(</span><span class="kp">data.matrix</span><span class="p">(</span>Boston<span class="p">[,</span><span class="o">-</span><span class="kp">ncol</span><span class="p">(</span>Boston<span class="p">)]))</span>
<span class="lineno">17 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>pred<span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">])</span>
<span class="lineno">18 </span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">30.00384</span> <span class="m">25.02556</span> <span class="m">30.56760</span> <span class="m">28.60704</span> <span class="m">27.94352</span>
<span class="lineno">19 </span><span class="o">></span>
<span class="lineno">20 </span><span class="o">></span> <span class="c1"># compare it with the R built-in linear regression model</span>
<span class="lineno">21 </span><span class="o">></span> rlm <span class="o">=</span> lm<span class="p">(</span>medv <span class="o">~</span> <span class="m">.</span><span class="p">,</span> data<span class="o">=</span>Boston<span class="p">)</span>
<span class="lineno">22 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>rlm<span class="o">$</span>coef<span class="p">)</span>
<span class="lineno">23 </span> <span class="p">(</span>Intercept<span class="p">)</span> crim zn indus chas
<span class="lineno">24 </span> <span class="m">3.645949e+01</span> <span class="m">-1.080114e-01</span> <span class="m">4.642046e-02</span> <span class="m">2.055863e-02</span> <span class="m">2.686734e+00</span>
<span class="lineno">25 </span> nox rm age dis rad
<span class="lineno">26 </span><span class="m">-1.776661e+01</span> <span class="m">3.809865e+00</span> <span class="m">6.922246e-04</span> <span class="m">-1.475567e+00</span> <span class="m">3.060495e-01</span>
<span class="lineno">27 </span> tax ptratio black lstat
<span class="lineno">28 </span><span class="m">-1.233459e-02</span> <span class="m">-9.527472e-01</span> <span class="m">9.311683e-03</span> <span class="m">-5.247584e-01</span>
<span class="lineno">29 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>rlm<span class="o">$</span>fitted<span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">])</span>
<span class="lineno">30 </span> <span class="m">1</span> <span class="m">2</span> <span class="m">3</span> <span class="m">4</span> <span class="m">5</span>
<span class="lineno">31 </span><span class="m">30.00384</span> <span class="m">25.02556</span> <span class="m">30.56760</span> <span class="m">28.60704</span> <span class="m">27.94352</span> </code></pre></figure><language>Python</language>
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno"> 1 </span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="k">import</span> <span class="n">load_boston</span>
<span class="lineno"> 2 </span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">linear_regression</span> <span class="k">import</span> <span class="n">LR</span>
<span class="lineno"> 3 </span><span class="o">>>></span> <span class="n">boston</span> <span class="o">=</span> <span class="n">load_boston</span><span class="p">()</span>
<span class="lineno"> 4 </span><span class="o">>>></span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">boston</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">boston</span><span class="o">.</span><span class="n">target</span>
<span class="lineno"> 5 </span><span class="o">>>></span> <span class="c1"># first, let's run our own linear regression</span>
<span class="lineno"> 6 </span><span class="o">...</span> <span class="n">lr</span> <span class="o">=</span> <span class="n">LR</span><span class="p">()</span>
<span class="lineno"> 7 </span><span class="o">>>></span> <span class="n">lr</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno"> 8 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">coef</span><span class="p">)</span>
<span class="lineno"> 9 </span><span class="p">[</span> <span class="mf">3.64594884e+01</span> <span class="o">-</span><span class="mf">1.08011358e-01</span> <span class="mf">4.64204584e-02</span> <span class="mf">2.05586264e-02</span>
<span class="lineno">10 </span> <span class="mf">2.68673382e+00</span> <span class="o">-</span><span class="mf">1.77666112e+01</span> <span class="mf">3.80986521e+00</span> <span class="mf">6.92224640e-04</span>
<span class="lineno">11 </span> <span class="o">-</span><span class="mf">1.47556685e+00</span> <span class="mf">3.06049479e-01</span> <span class="o">-</span><span class="mf">1.23345939e-02</span> <span class="o">-</span><span class="mf">9.52747232e-01</span>
<span class="lineno">12 </span> <span class="mf">9.31168327e-03</span> <span class="o">-</span><span class="mf">5.24758378e-01</span><span class="p">]</span>
<span class="lineno">13 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">lr</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)[:</span><span class="mi">5</span><span class="p">])</span>
<span class="lineno">14 </span><span class="p">[</span><span class="mf">30.00384338</span> <span class="mf">25.02556238</span> <span class="mf">30.56759672</span> <span class="mf">28.60703649</span> <span class="mf">27.94352423</span><span class="p">]</span>
<span class="lineno">15 </span><span class="o">>>></span>
<span class="lineno">16 </span><span class="o">>>></span> <span class="c1"># now, use sklearn's linear regression model</span>
<span class="lineno">17 </span><span class="o">...</span> <span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="k">import</span> <span class="n">LinearRegression</span>
<span class="lineno">18 </span><span class="o">>>></span> <span class="n">reg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno">19 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">reg</span><span class="o">.</span><span class="n">coef_</span><span class="p">)</span>
<span class="lineno">20 </span><span class="p">[</span><span class="o">-</span><span class="mf">1.08011358e-01</span> <span class="mf">4.64204584e-02</span> <span class="mf">2.05586264e-02</span> <span class="mf">2.68673382e+00</span>
<span class="lineno">21 </span> <span class="o">-</span><span class="mf">1.77666112e+01</span> <span class="mf">3.80986521e+00</span> <span class="mf">6.92224640e-04</span> <span class="o">-</span><span class="mf">1.47556685e+00</span>
<span class="lineno">22 </span> <span class="mf">3.06049479e-01</span> <span class="o">-</span><span class="mf">1.23345939e-02</span> <span class="o">-</span><span class="mf">9.52747232e-01</span> <span class="mf">9.31168327e-03</span>
<span class="lineno">23 </span> <span class="o">-</span><span class="mf">5.24758378e-01</span><span class="p">]</span>
<span class="lineno">24 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">reg</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)[:</span><span class="mi">5</span><span class="p">])</span>
<span class="lineno">25 </span><span class="p">[</span><span class="mf">30.00384338</span> <span class="mf">25.02556238</span> <span class="mf">30.56759672</span> <span class="mf">28.60703649</span> <span class="mf">27.94352423</span><span class="p">]</span></code></pre></figure><p>The results from our own linear regression models are almost identical to the results from <code>lm()</code> function or the <code>sklearn.linear_model</code> module, which means we have done a great job so far.</p>
<h2 id="hypothesis">Linear hypothesis testing</h2>
<p>I’m not a big fan of applying hypothesis testing in data science or machine learning. But sometimes it is irreplaceable, or at least useful, for example in the Design of Experiment (<span class="caps">DOE</span>).</p>
<p>Most of applications of hypothesis testing in machine learning models are about feature/variable selections. There are lots of debates on whether hypothesis-testings-based feature selections are good or bad. The major criticism of such approach is that the entire process is built on the data used for model training and the model performance on testing data is not considered at all.</p>
<p>I think it is still worth giving a brief introduction of hypothesis testing in linear regression, as it is still popular among data scientists with statistician’s mindset. I would assume the readers already have basic ideas of hypothesis-testing, p-value, significance level.</p>
<p>If you have done linear regressions using a computer software (R, Stata, <span class="caps">SPSS</span>, Minitab etc.), you may have noticed that the outputs of these softwares contain the p-values<sup class="footnote" id="fnr9"><a href="#fn9">9</a></sup> and t-statistics of the coefficient of each variable. If the p-value is less than a pre-determined significance level (usually 0.1 or 0.05 are used in practice), <br />
the null hypothesis (always denoted as $H_0$) should be rejected. The hypothesis against the null hypothesis is called alternative hypothesis (denoted as $H_1$). An example of $H_0$ and $H_1$ regarding model \eqref{eq:2} could be stated as:</p>
<p>$$<br />
\begin{equation}<br />
\begin{split}<br />
H_0: \boldsymbol{\beta}_1&=0 \\<br />
H_1:\boldsymbol{\beta}_1 &\neq0.<br />
\end{split}<br />
\label{eq:8}<br />
\end{equation}<br />
$$</p>
<p>If the p-value of $\boldsymbol{\beta}_1$ suggests not to reject $H_0$, we may exclude the corresponding feature and re-fit the linear model.</p>
<p>The theory behind the hypothesis testing in \eqref{eq:8} is not complex. But we have to make an additional assumption for our linear regression model. More specifically, we assume the error term $\boldsymbol{\epsilon}$ follows a multivariate normal distribution, i.e., $\boldsymbol{\epsilon}{\sim}N(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Based on this assumption, we could conclude that $\boldsymbol{\hat\beta}$ also follows a multivariate normal distribution. Furthermore, we can construct a test statistic<sup class="footnote" id="fnr10"><a href="#fn10">10</a></sup> containing $\boldsymbol{\hat\beta_j}$ which follows a t-distribution<sup class="footnote" id="fnr11"><a href="#fn11">11</a></sup>. The additional assumption we made for hypothesis testing is not required for least square estimation. It is also not one of assumptions for the Gauss-Markov theorem<sup class="footnote" id="fnr12"><a href="#fn12">12</a></sup>.</p>
<p>The example given in \eqref{eq:8} focuses on a single coefficient. Actually, it is a special case of a more general hypothesis testing, which is called linear hypothesis testing. In linear hypothesis testing, people are interested in a linear combination of the coefficients, i.e.,</p>
<p>$$<br />
\begin{equation}<br />
\begin{split}<br />
H_0: \boldsymbol{A}\boldsymbol{\beta}+\boldsymbol{b}&=0\\<br />
H_1: \boldsymbol{A}\boldsymbol{\beta} +\boldsymbol{b}&\neq0.<br />
\end{split}<br />
\label{eq:9}<br />
\end{equation}<br />
$$</p>
<p>The linear hypothesis testing is usually conducted by constructing a F-distributed test statistic. The general hypothesis testings is very powerful. For example, we may start from a full model with a larger list of variables to train a linear regression model; we may also exclude some variables from the full list to train a reduced model. By setting proper values of $\boldsymbol{A}$ and $\boldsymbol{b}$ in \eqref{eq:9} we can conduct a linear hypothesis testing to accept or reject the reduced model. For example, if the full model involves three variables and in the reduced model we only keep the first variable, we will set</p>
<p>$$\boldsymbol{A}=\begin{bmatrix}<br />
0 & 0 & 0 \\<br />
0 & 1 & 0 \\<br />
0 & 0 & 1 <br />
\end{bmatrix}<br />
$$<br />
and <br />
$$\boldsymbol{b}=\begin{bmatrix}<br />
0 \\<br />
0 \\<br />
0 \\<br />
\end{bmatrix}.<br />
$$</p>
<h2 id="ridge">Ridge regression</h2>
<p>What is the problem of solving linear regression model specified by \eqref{eq:3}? There is nothing wrong at all with that approach. But I would like to cite my favorite quote – “Essentially, all models are wrong, but some are useful”<sup class="footnote" id="fnr13"><a href="#fn13">13</a></sup>. \eqref{eq:3} provides one solution to model \eqref{eq:1}. There are some alternative solutions, such as lasso regression and ridge regression. In this section, let’s focus on ridge regression.</p>
<p>What is ridge regression? Ridge regression doesn’t change the model itself. Instead, it changes the way to estimate model parameters. By naive <span class="caps">OLS</span>, we minimize the <span class="caps">SSR</span> directly. In ridge regression, we change the objective function (which is commonly called loss function in machine learning models) by adding an additional penalty</p>
<p>$$<br />
\begin{equation}\label{eq:10}<br />
\min_{\boldsymbol{\hat\beta}} \quad \boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}.<br />
\end{equation}<br />
$$</p>
<p>The optimization problem \eqref{eq:10} is still an unconstrained optimization problem and it is convex. It can also be formulated as a constrained convex optimization problem as</p>
<p>$$<br />
\begin{equation}<br />
\begin{split}<br />
\min_{\boldsymbol{\hat\beta}}& \quad \boldsymbol{e}’\boldsymbol{e}\\<br />
subject\quad to\quad &\boldsymbol{\beta}’\boldsymbol{\beta}\leq{t}.<br />
\end{split}<br />
\label{eq:11}<br />
\end{equation}<br />
$$</p>
<p>The theory behind ridge regression can be found from The Elements of Statistical Learning<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>. Let’s turn our attention to the implementation of ridge regression. The solution to \eqref{eq:11} can be obtained in the same way as the solution to \eqref{eq:3}, i.e.,</p>
<p>$$<br />
\begin{equation}\label{eq:12}<br />
\boldsymbol{\hat\beta}=(\boldsymbol{X}’\boldsymbol{X}+\lambda\boldsymbol{I})^{-1}\boldsymbol{X}’\boldsymbol{y}.<br />
\end{equation}<br />
$$</p>
<p>Again, in practice we don’t use \eqref{eq:12} to implement ridge regression for the same reasons that we don’t use \eqref{eq:6} to solve linear regression without penalty.</p>
<p>Actually, we don’t need new techniques to solve \eqref{eq:10}. Let’s make some transformation on the objective function in \eqref{eq:10}:</p>
<p>$$<br />
\begin{equation}<br />
\begin{split}<br />
&\boldsymbol{e}’\boldsymbol{e} + \lambda\boldsymbol{\beta}’\boldsymbol{\beta}=\sum_{i=1}^{n}(\boldsymbol{y}_i-\boldsymbol{x}’_i\boldsymbol{\beta})^2+\sum_{i=1}^{p}(0-\sqrt{\lambda}\boldsymbol{\beta}_i)^2 ,<br />
\end{split}<br />
\label{eq:13}<br />
\end{equation}<br />
$$</p>
<p>where $\boldsymbol{x}_i=[1,\boldsymbol{X}_{1i},\boldsymbol{X}_{2i},…,\boldsymbol{X}_{pi}]’$.</p>
<p>Let’s define an augmented data set:</p>
<p>$$<br />
\begin{equation}\label{eq:14}<br />
\boldsymbol{X}_\lambda=<br />
\begin{bmatrix}<br />
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\<br />
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\<br />
\vdots & \vdots & \vdots & \vdots & \vdots \\<br />
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\<br />
\sqrt{\lambda} & 0 & 0 & \cdots & 0\\<br />
0 & \sqrt{\lambda} & 0 & \cdots & 0\\<br />
0 & 0 & \sqrt{\lambda} & \cdots & 0\\<br />
\vdots & \vdots & \vdots & \ddots & \vdots\\<br />
0 & 0 & 0 & 0 & \sqrt{\lambda}\\<br />
\end{bmatrix} , \quad <br />
\boldsymbol{y}_{\lambda} =<br />
\begin{bmatrix}<br />
\boldsymbol{y}_1 \\<br />
\boldsymbol{y}_2 \\<br />
\vdots \\<br />
\boldsymbol{y}_n\\<br />
0\\<br />
0\\<br />
0\\<br />
\vdots\\<br />
0<br />
\end{bmatrix}.<br />
\end{equation}<br />
$$</p>
<p>If we regress $\boldsymbol{y}_{\lambda}$ on $\boldsymbol{X}_\lambda$, the <span class="caps">OLS</span> solution is just what we are looking after. However, usually the penalty is not applied on the intercept. Thus, we modify $\boldsymbol{y}_{\lambda}$ and $\boldsymbol{X}_\lambda$ to</p>
<p>$$<br />
\begin{equation}\label{eq:15}<br />
\boldsymbol{X}_\lambda=<br />
\begin{bmatrix}<br />
1 & \boldsymbol{X}_{11} & \boldsymbol{X}_{21} & \cdots & \boldsymbol{X}_{p1} \\<br />
1 & \boldsymbol{X}_{12} & \boldsymbol{X}_{22} & \cdots & \boldsymbol{X}_{p2} \\<br />
\vdots & \vdots & \vdots & \vdots & \vdots \\<br />
1 & \boldsymbol{X}_{1n} & \boldsymbol{X}_{2n} & \cdots & \boldsymbol{X}_{pn} \\<br />
0 & \sqrt{\lambda} & 0 & \cdots & 0\\<br />
0 & 0 & \sqrt{\lambda} & \cdots & 0\\<br />
\vdots & \vdots & \vdots & \ddots & \vdots\\<br />
0 & 0 & 0 & 0 & \sqrt{\lambda}\\<br />
\end{bmatrix} , \quad <br />
\boldsymbol{y}_{\lambda} =<br />
\begin{bmatrix}<br />
\boldsymbol{y}_1 \\<br />
\boldsymbol{y}_2 \\<br />
\vdots \\<br />
\boldsymbol{y}_n\\<br />
0\\<br />
0\\<br />
\vdots\\<br />
0<br />
\end{bmatrix}.<br />
\end{equation}<br />
$$</p>
<p>Now we are ready to implement our own ridge regression model based on the description above. It is also common to normalize<sup class="footnote" id="fnr14"><a href="#fn14">14</a></sup> the independent variables before applying ridge regression to make different variables in the same order of magnitude.</p>
<language>R</language>
<p>chapter4/linear_regression_ridge.R<br />
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno"> 1 </span><span class="kn">library</span><span class="p">(</span>R6<span class="p">)</span>
<span class="lineno"> 2 </span>LR_Ridge <span class="o">=</span> R6Class<span class="p">(</span>
<span class="lineno"> 3 </span> <span class="s">"LR_Ridge"</span><span class="p">,</span>
<span class="lineno"> 4 </span> public <span class="o">=</span> <span class="kt">list</span><span class="p">(</span>
<span class="lineno"> 5 </span> coef <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span>
<span class="lineno"> 6 </span> mu <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span>
<span class="lineno"> 7 </span> sd <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span>
<span class="lineno"> 8 </span> lambda <span class="o">=</span> <span class="kc">NULL</span><span class="p">,</span>
<span class="lineno"> 9 </span> initialize <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>lambda<span class="p">)</span> <span class="p">{</span>
<span class="lineno">10 </span> self<span class="o">$</span>lambda <span class="o">=</span> lambda
<span class="lineno">11 </span> <span class="p">},</span>
<span class="lineno">12 </span> scale <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span>
<span class="lineno">13 </span> self<span class="o">$</span>mu <span class="o">=</span> <span class="kp">apply</span><span class="p">(</span>x<span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="kp">mean</span><span class="p">)</span>
<span class="lineno">14 </span> self<span class="o">$</span>sd <span class="o">=</span> <span class="kp">apply</span><span class="p">(</span>x<span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span>e<span class="p">)</span> <span class="p">{</span>
<span class="lineno">15 </span> <span class="kp">sqrt</span><span class="p">((</span><span class="kp">length</span><span class="p">(</span>e<span class="p">)</span> <span class="o">-</span> <span class="m">1</span><span class="p">)</span> <span class="o">/</span> <span class="kp">length</span><span class="p">(</span>e<span class="p">))</span> <span class="o">*</span> sd<span class="p">(</span>e<span class="p">)</span>
<span class="lineno">16 </span> <span class="p">})</span>
<span class="lineno">17 </span> <span class="p">},</span>
<span class="lineno">18 </span> transform <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">)</span> <span class="p">{</span>
<span class="lineno">19 </span> <span class="kr">return</span><span class="p">(</span><span class="kp">t</span><span class="p">((</span><span class="kp">t</span><span class="p">(</span>x<span class="p">)</span> <span class="o">-</span> self<span class="o">$</span>mu<span class="p">)</span> <span class="o">/</span> self<span class="o">$</span>sd<span class="p">))</span>
<span class="lineno">20 </span> <span class="p">},</span>
<span class="lineno">21 </span> fit <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> y<span class="p">)</span> <span class="p">{</span>
<span class="lineno">22 </span> self<span class="o">$</span><span class="kp">scale</span><span class="p">(</span>x<span class="p">)</span>
<span class="lineno">23 </span> x_transformed <span class="o">=</span> self<span class="o">$</span><span class="kp">transform</span><span class="p">(</span>x<span class="p">)</span>
<span class="lineno">24 </span> x_lambda <span class="o">=</span> <span class="kp">rbind</span><span class="p">(</span>x_transformed<span class="p">,</span> <span class="kp">diag</span><span class="p">(</span><span class="kp">rep</span><span class="p">(</span><span class="kp">sqrt</span><span class="p">(</span>self<span class="o">$</span>lambda<span class="p">),</span> <span class="kp">ncol</span><span class="p">(</span>x<span class="p">))))</span>
<span class="lineno">25 </span> y_lambda <span class="o">=</span> <span class="kt">c</span><span class="p">(</span>y<span class="p">,</span> <span class="kp">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="kp">ncol</span><span class="p">(</span>x<span class="p">)))</span>
<span class="lineno">26 </span> self<span class="o">$</span>qr_solver<span class="p">(</span><span class="kp">cbind</span><span class="p">(</span><span class="kt">c</span><span class="p">(</span><span class="kp">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="kp">nrow</span><span class="p">(</span>
<span class="lineno">27 </span> x
<span class="lineno">28 </span> <span class="p">)),</span> <span class="kp">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="kp">ncol</span><span class="p">(</span>
<span class="lineno">29 </span> x
<span class="lineno">30 </span> <span class="p">))),</span> x_lambda<span class="p">),</span> y_lambda<span class="p">)</span>
<span class="lineno">31 </span> <span class="p">},</span>
<span class="lineno">32 </span> qr_solver <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>x<span class="p">,</span> y<span class="p">)</span> <span class="p">{</span>
<span class="lineno">33 </span> self<span class="o">$</span>coef <span class="o">=</span> <span class="kp">qr.coef</span><span class="p">(</span><span class="kp">qr</span><span class="p">(</span>x<span class="p">),</span> y<span class="p">)</span>
<span class="lineno">34 </span> <span class="p">},</span>
<span class="lineno">35 </span> predict <span class="o">=</span> <span class="kr">function</span><span class="p">(</span>new_x<span class="p">)</span> <span class="p">{</span>
<span class="lineno">36 </span> new_x_transformed <span class="o">=</span> self<span class="o">$</span><span class="kp">transform</span><span class="p">(</span>new_x<span class="p">)</span>
<span class="lineno">37 </span> <span class="kp">cbind</span><span class="p">(</span><span class="kp">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="kp">nrow</span><span class="p">(</span>new_x<span class="p">)),</span> new_x_transformed<span class="p">)</span> <span class="o">%*%</span> self<span class="o">$</span>coef
<span class="lineno">38 </span> <span class="p">}</span>
<span class="lineno">39 </span> <span class="p">)</span>
<span class="lineno">40 </span><span class="p">)</span></code></pre></figure></p>
<language>Python</language>
<p>chapter4/linear_regression_ridge.py<br />
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno"> 1 </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="lineno"> 2 </span><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="k">import</span> <span class="n">StandardScaler</span>
<span class="lineno"> 3 </span>
<span class="lineno"> 4 </span>
<span class="lineno"> 5 </span><span class="k">class</span> <span class="nc">LR_Ridge</span><span class="p">:</span>
<span class="lineno"> 6 </span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">l</span><span class="p">):</span>
<span class="lineno"> 7 </span> <span class="bp">self</span><span class="o">.</span><span class="n">l</span> <span class="o">=</span> <span class="n">l</span>
<span class="lineno"> 8 </span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span> <span class="o">=</span> <span class="kc">None</span>
<span class="lineno"> 9 </span> <span class="bp">self</span><span class="o">.</span><span class="n">scaler</span> <span class="o">=</span> <span class="n">StandardScaler</span><span class="p">()</span>
<span class="lineno">10 </span>
<span class="lineno">11 </span> <span class="k">def</span> <span class="nf">qr_solver</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="lineno">12 </span> <span class="n">q</span><span class="p">,</span> <span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">qr</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="lineno">13 </span> <span class="n">p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">q</span><span class="o">.</span><span class="n">T</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno">14 </span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">inv</span><span class="p">(</span><span class="n">r</span><span class="p">),</span> <span class="n">p</span><span class="p">)</span>
<span class="lineno">15 </span>
<span class="lineno">16 </span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="lineno">17 </span> <span class="n">x_transformed</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="lineno">18 </span> <span class="n">x_lambda</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">(</span>
<span class="lineno">19 </span> <span class="p">(</span><span class="n">x_transformed</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">l</span><span class="o">**</span><span class="mf">0.5</span><span class="p">]</span><span class="o">*</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])))</span>
<span class="lineno">20 </span> <span class="n">x_lambda</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">(</span>
<span class="lineno">21 </span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">)))),</span> <span class="n">x_lambda</span><span class="p">))</span>
<span class="lineno">22 </span> <span class="n">y_lambda</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">y</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],))))</span>
<span class="lineno">23 </span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">qr_solver</span><span class="p">(</span><span class="n">x_lambda</span><span class="p">,</span> <span class="n">y_lambda</span><span class="p">)</span>
<span class="lineno">24 </span>
<span class="lineno">25 </span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="lineno">26 </span> <span class="n">new_x_transformed</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">scaler</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="lineno">27 </span> <span class="n">new_x_transformed</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">(</span>
<span class="lineno">28 </span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">)),</span> <span class="n">new_x_transformed</span><span class="p">)</span>
<span class="lineno">29 </span> <span class="p">)</span>
<span class="lineno">30 </span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">new_x_transformed</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">coef</span><span class="p">)</span></code></pre></figure></p>
<p>In R, we implement our own scaler; but in the python implementation, we use <code>StandardScaler</code> from <code>sklearn.preprocessing</code> module, which is very handy. Please pay attention to how we calculation the standard deviation<sup class="footnote" id="fnr15"><a href="#fn15">15</a></sup> in R – we are using the formula of population standard deviation rather than the formula of sample standard deviation. Actually using which formula to calculate the standard deviation doesn’t matter at all. I made the choice to use population standard deviation formula in order to generate consistent result of the <code>StandardScaler</code> since in <code>StandardScaler</code> uses the formula of population standard deviation.</p>
<p>The selection of best $\lambda$ requires solving the <span class="caps">OLS</span> problem repeatedly with different values of $\lambda$, which implies the QR decomposition procedure would be called multiple times. Using <span class="caps">SVD</span> decomposition could be more efficient in terms of selection of best $\lambda$. But it wouldn’t be covered in the current version of this book.</p>
<p>We are ready to run our own ridge regression on the Boston dataset.</p>
<language>R</language>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span></span><span class="lineno"> 1 </span><span class="o">></span> <span class="kn">source</span><span class="p">(</span><span class="s">'linear_regression_ridge.R'</span><span class="p">)</span>
<span class="lineno"> 2 </span><span class="o">></span>
<span class="lineno"> 3 </span><span class="o">></span> <span class="kn">library</span><span class="p">(</span>MASS<span class="p">)</span> <span class="c1"># load Boston data from this package</span>
<span class="lineno"> 4 </span><span class="o">></span>
<span class="lineno"> 5 </span><span class="o">></span> <span class="c1"># let's try lmabda = 0.5</span>
<span class="lineno"> 6 </span><span class="o">></span> ridge <span class="o">=</span> LR_Ridge<span class="o">$</span>new<span class="p">(</span><span class="m">0.5</span><span class="p">)</span>
<span class="lineno"> 7 </span><span class="o">></span> ridge<span class="o">$</span>fit<span class="p">(</span><span class="kp">data.matrix</span><span class="p">(</span>Boston<span class="p">[,</span><span class="o">-</span><span class="kp">ncol</span><span class="p">(</span>Boston<span class="p">)]),</span>Boston<span class="o">$</span>medv<span class="p">)</span>
<span class="lineno"> 8 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>ridge<span class="o">$</span>coef<span class="p">)</span>
<span class="lineno"> 9 </span> crim zn indus chas nox
<span class="lineno">10 </span><span class="m">22.53280632</span> <span class="m">-0.92396151</span> <span class="m">1.07393055</span> <span class="m">0.12895159</span> <span class="m">0.68346136</span> <span class="m">-2.04275750</span>
<span class="lineno">11 </span> rm age dis rad tax ptratio
<span class="lineno">12 </span> <span class="m">2.67854971</span> <span class="m">0.01627328</span> <span class="m">-3.09063352</span> <span class="m">2.62636926</span> <span class="m">-2.04312573</span> <span class="m">-2.05646414</span>
<span class="lineno">13 </span> black lstat
<span class="lineno">14 </span> <span class="m">0.84905910</span> <span class="m">-3.73711409</span>
<span class="lineno">15 </span><span class="o">></span> <span class="c1"># let's make prediction on the same data</span>
<span class="lineno">16 </span><span class="o">></span> pred<span class="o">=</span>ridge<span class="o">$</span>predict<span class="p">(</span><span class="kp">data.matrix</span><span class="p">(</span>Boston<span class="p">[,</span><span class="o">-</span><span class="kp">ncol</span><span class="p">(</span>Boston<span class="p">)]))</span>
<span class="lineno">17 </span><span class="o">></span> <span class="kp">print</span><span class="p">(</span>pred<span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">])</span>
<span class="lineno">18 </span><span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">30.01652</span> <span class="m">25.02429</span> <span class="m">30.56839</span> <span class="m">28.61521</span> <span class="m">27.95385</span></code></pre></figure><language>Python</language>
<figure class="highlight"><pre><code class="language-python3" data-lang="python3"><span></span><span class="lineno"> 1 </span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="k">import</span> <span class="n">load_boston</span>
<span class="lineno"> 2 </span><span class="o">>>></span> <span class="kn">from</span> <span class="nn">linear_regression_ridge</span> <span class="k">import</span> <span class="n">LR_Ridge</span>
<span class="lineno"> 3 </span><span class="o">>>></span>
<span class="lineno"> 4 </span><span class="o">>>></span> <span class="n">boston</span> <span class="o">=</span> <span class="n">load_boston</span><span class="p">()</span>
<span class="lineno"> 5 </span><span class="o">>>></span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">boston</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">boston</span><span class="o">.</span><span class="n">target</span>
<span class="lineno"> 6 </span><span class="o">>>></span> <span class="c1"># first, let's run our own linear regression</span>
<span class="lineno"> 7 </span><span class="o">...</span>
<span class="lineno"> 8 </span><span class="o">>>></span> <span class="n">ridge</span> <span class="o">=</span> <span class="n">LR_Ridge</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
<span class="lineno"> 9 </span><span class="o">>>></span> <span class="n">ridge</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="lineno">10 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">ridge</span><span class="o">.</span><span class="n">coef</span><span class="p">)</span>
<span class="lineno">11 </span><span class="p">[</span> <span class="mf">2.25328063e+01</span> <span class="o">-</span><span class="mf">9.23961511e-01</span> <span class="mf">1.07393055e+00</span> <span class="mf">1.28951591e-01</span>
<span class="lineno">12 </span> <span class="mf">6.83461360e-01</span> <span class="o">-</span><span class="mf">2.04275750e+00</span> <span class="mf">2.67854971e+00</span> <span class="mf">1.62732755e-02</span>
<span class="lineno">13 </span> <span class="o">-</span><span class="mf">3.09063352e+00</span> <span class="mf">2.62636926e+00</span> <span class="o">-</span><span class="mf">2.04312573e+00</span> <span class="o">-</span><span class="mf">2.05646414e+00</span>
<span class="lineno">14 </span> <span class="mf">8.49059103e-01</span> <span class="o">-</span><span class="mf">3.73711409e+00</span><span class="p">]</span>
<span class="lineno">15 </span><span class="o">>>></span> <span class="nb">print</span><span class="p">(</span><span class="n">ridge</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)[:</span><span class="mi">5</span><span class="p">])</span>
<span class="lineno">16 </span><span class="p">[</span><span class="mf">30.01652397</span> <span class="mf">25.02429359</span> <span class="mf">30.56839459</span> <span class="mf">28.61520864</span> <span class="mf">27.95385422</span><span class="p">]</span></code></pre></figure><p>It’s excited to see the outputs from R and Python are quite consistent.</p>
<p>Linear regression is not as simple as it seems. To learn more about it, I recommend the book Econometric Analysis<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup>.</p>
<hr />
<p style="vertical-align:middle;" class="footnote" id="fn1"><a href="#fnr1"><sup>1</sup></a> Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, NY, <span class="caps">USA</span>:, second edition, 2009.</p>
<p class="footnote" id="fn2"><a href="#fnr2"><sup>2</sup></a> William H Greene. Econometric analysis. Pearson Education India, 2003.</p>
<p class="footnote" id="fn3"><a href="#fnr3"><sup>3</sup></a> https://en.wikipedia.org/wiki/Ordinary_least_squares</p>
<p class="footnote" id="fn4"><a href="#fnr4"><sup>4</sup></a> https://en.wikipedia.org/wiki/Maximum_likelihood_estimation</p>
<p class="footnote" id="fn5"><a href="#fnr5"><sup>5</sup></a> Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.</p>
<p class="footnote" id="fn6"><a href="#fnr6"><sup>6</sup></a> https://github.com/numpy/numpy/issues/10471</p>
<p class="footnote" id="fn7"><a href="#fnr7"><sup>7</sup></a> https://en.wikipedia.org/wiki/QR_decomposition</p>
<p class="footnote" id="fn8"><a href="#fnr8"><sup>8</sup></a> https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html</p>
<p class="footnote" id="fn9"><a href="#fnr9"><sup>9</sup></a> https://www.statsdirect.com/help/basics/p_values.htm</p>
<p class="footnote" id="fn10"><a href="#fnr10"><sup>10</sup></a> https://en.wikipedia.org/wiki/Test_statistic</p>
<p class="footnote" id="fn11"><a href="#fnr11"><sup>11</sup></a> https://en.wikipedia.org/wiki/Student\%27s_t-distribution</p>
<p class="footnote" id="fn12"><a href="#fnr12"><sup>12</sup></a> https://en.wikipedia.org/wiki/Gauss-Markov_theorem</p>
<p class="footnote" id="fn13"><a href="#fnr13"><sup>13</sup></a> https://en.wikipedia.org/wiki/All_models_are_wrong</p>
<p class="footnote" id="fn14"><a href="#fnr14"><sup>14</sup></a> https://en.wikipedia.org/wiki/Normalization_(statistics)</p>
</div> <!-- /container-fluid -->
<!-- Footer -->
<div id="footer">
<div class="inner">
<div class="container-fluid">
<div class="row">
<div class="span16">
<p style="text-align:center">
© 2022 Anotherbookondatascience.com
</p>
</div>
</div>
</div>
</div>
</div>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'] ],
processEscapes: true
},
TeX: { equationNumbers: { autoNumber: "AMS" },
Macros: {
sb: "_"
} }
});
</script>
<script
type="text/javascript"
charset="utf-8"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
>
</script>
</body>
</html>