Lab 2

Introduction to Statistical Learning - PISE

Author

Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Linear Regression

Libraries

library() loads packages
Packages contain functions and datasets not always included in base R
Basic linear regression tools are in base R
More specialized tools require additional packages
Here we load:
- MASS
- ISLR2

library(MASS)
library(ISLR2)


Attaching package: 'ISLR2'

The following object is masked from 'package:MASS':

    Boston

If a package is not installed, library() returns an error
Some packages (e.g. MASS) usually come with R
Others (e.g. ISLR2) may need to be installed first
Install once with install.packages("ISLR2")
Then load the package in each new R session with library()

Simple Linear Regression

Boston is a dataset in ISLR2
Response variable: medv (median house value)
Predictors include variables such as:
- rm
- age
- lstat

head(Boston)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

Use ?Boston to see the dataset documentation
Fit a simple linear regression with lm()
Syntax: lm(y ~ x, data = dataset)
Here:
- response = medv
- predictor = lstat

lm.fit <- lm(medv ~ lstat)

Error in eval(predvars, data, env): object 'medv' not found

This gives an error because R does not know where medv and lstat are stored
Fix:
- specify data = Boston, or
- use attach(Boston)

lm.fit <- lm(medv ~ lstat, data = Boston)
attach(Boston)
lm.fit <- lm(medv ~ lstat)

Typing lm.fit gives basic model output
summary(lm.fit) gives more details:
- coefficients
- standard errors
- p-values
- R^2
- F-statistic

lm.fit


Call:
lm(formula = medv ~ lstat)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95

summary(lm.fit)


Call:
lm(formula = medv ~ lstat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

Use names() to see what is stored in the fitted model object
Use extractor functions such as coef() to access components safely

names(lm.fit)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

coef(lm.fit)

(Intercept)       lstat 
 34.5538409  -0.9500494

Use confint() to compute confidence intervals for coefficients

confint(lm.fit)

                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505

Use predict() to compute:
- confidence intervals for the mean response
- prediction intervals for a new observation

predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
    interval = "confidence")

       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461

predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
    interval = "prediction")

       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846

Confidence and prediction intervals are centered at the same fitted value
Prediction intervals are wider than confidence intervals
Plot the data and add the least-squares regression line

plot(lstat, medv)
abline(lm.fit)

The plot suggests some possible non-linearity
We will revisit this later
abline() can add any line to a plot
abline(a, b) draws a line with:
- intercept = a
- slope = b
Plot options:
- lwd controls line width
- col controls color
- pch controls plotting symbol

plot(lstat, medv)
abline(lm.fit, lwd = 3)
abline(lm.fit, lwd = 3, col = "red")

plot(lstat, medv, col = "red")

plot(lstat, medv, pch = 20)

plot(lstat, medv, pch = "+")

plot(1:20, 1:20, pch = 1:20)

Non-linear Transformations of the Predictors

lm() can include transformed predictors
Use I() when arithmetic operators would otherwise be interpreted specially in a formula
Example: I(lstat^2) adds a quadratic term

lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)


Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The quadratic term appears to improve the model
Compare the linear and quadratic models with anova()

lm.fit <- lm(medv ~ lstat, data = Boston)
anova(lm.fit, lm.fit2)

Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1: linear fit with lstat
Model 2: quadratic fit with lstat and lstat^2
anova() tests whether the larger model improves fit significantly
Very small p-value indicates strong evidence in favor of the quadratic model
Check the residual diagnostics for the quadratic model

par(mfrow = c(2, 2))
plot(lm.fit2)

With the quadratic term included, residual patterns are reduced
For higher-order polynomials, poly() is more convenient than writing each power manually
Example: fifth-order polynomial fit

lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)


Call:
lm(formula = medv ~ poly(lstat, 5), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,    Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

Higher-order polynomial terms can improve fit
In this example, terms beyond fifth order are not significant
By default, poly() uses orthogonal polynomials
These are not the raw powers of the predictor
Fitted values are the same as with raw polynomials, but coefficients differ
Use raw = TRUE if raw powers are desired
We are not restricted to polynomial transformations
Example: log transformation

summary(lm(medv ~ log(rm), data = Boston))


Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -76.488      5.028  -15.21   <2e-16 ***
log(rm)       54.055      2.739   19.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

Cross-Validation

In this lab, we explore resampling methods
Focus:
- validation set approach
- leave-one-out cross-validation (LOOCV)
- k-fold cross-validation
Some commands may take a while to run

The Validation Set Approach

Use the validation set approach to estimate test error
Work with the Auto data set
Set a random seed for reproducibility
This ensures the same training/validation split each time

library(ISLR2)
set.seed(1)
train <- sample(392, 196)

Split the data into two halves
Here, train contains the indices of the training observations
Fit the model only on the training set using subset = train

lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)

Use predict() to compute fitted values for all observations
Compute validation MSE on the observations not in the training set
-train selects the validation observations

attach(Auto)
mean((mpg - predict(lm.fit, Auto))[-train]^2)

[1] 23.26601

Estimate test MSE for quadratic and cubic models
Use poly() to fit polynomial regressions

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, 
    subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)

[1] 18.71646

lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, 
    subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)

[1] 18.79401

Different training/validation splits give slightly different results
Repeat with a different random seed

set.seed(2)
train <- sample(392, 196)
lm.fit <- lm(mpg ~ horsepower, subset = train)
mean((mpg - predict(lm.fit, Auto))[-train]^2)

[1] 25.72651

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, 
    subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)

[1] 20.43036

lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, 
    subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)

[1] 20.38533

Main result:
- quadratic model improves over linear
- little evidence that cubic improves further

Leave-One-Out Cross-Validation

LOOCV can be computed with cv.glm()
cv.glm() works with models fit using glm()
Without a family argument, glm() performs linear regression
So glm() and lm() give the same fit here

glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)

(Intercept)  horsepower 
 39.9358610  -0.1578447

lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)

(Intercept)  horsepower 
 39.9358610  -0.1578447

Use glm() because it works with cv.glm()
cv.glm() is in the boot package

library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta

[1] 24.23151 24.23114

cv.glm() returns a list
delta contains the cross-validation estimates
For LOOCV, the two values are essentially the same
Compute LOOCV error for polynomial models of degree 1 to 10
Store the results in a vector

cv.error <- rep(0, 10)
for (i in 1:10) {
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error

 [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 18.96115
 [9] 19.06863 19.49093

Main pattern:
- large drop from linear to quadratic
- little improvement beyond degree 2

k-Fold Cross-Validation

cv.glm() can also perform k-fold CV
Here we use:
- k = 10
Again evaluate polynomial models of degree 1 to 10

set.seed(17)
cv.error.10 <- rep(0, 10)
for (i in 1:10) {
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
}
cv.error.10

 [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
 [9] 18.87013 20.95520

k-fold CV is faster to compute than LOOCV here
Results are similar:
- quadratic model improves over linear
- higher-order terms add little
With LOOCV, the two values in delta are almost identical
With k-fold CV, they differ slightly:
- first value = standard k-fold CV estimate
- second value = bias-corrected estimate
On this data set, the difference is very small