Lab 2

Introduction to Statistical Learning - PISE

Aldo Solari

Ca’ Foscari University of Venice

Setup

  • Set chunk options
  • Allow errors in output

Linear Regression

Libraries

  • library() loads packages
  • Packages contain functions and datasets not always included in base R
  • Basic linear regression tools are in base R
  • More specialized tools require additional packages
  • Here we load:
    • MASS
    • ISLR2
  • If a package is not installed, library() returns an error
  • Some packages (e.g. MASS) usually come with R
  • Others (e.g. ISLR2) may need to be installed first
  • Install once with install.packages("ISLR2")
  • Then load the package in each new R session with library()

Simple Linear Regression

  • Boston is a dataset in ISLR2
  • Response variable: medv (median house value)
  • Predictors include variables such as:
    • rm
    • age
    • lstat
     crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
  • Use ?Boston to see the dataset documentation

  • Fit a simple linear regression with lm()

  • Syntax: lm(y ~ x, data = dataset)

  • Here:

    • response = medv
    • predictor = lstat
Error in eval(predvars, data, env): object 'medv' not found
  • This gives an error because R does not know where medv and lstat are stored
  • Fix:
    • specify data = Boston, or
    • use attach(Boston)
  • Typing lm.fit gives basic model output
  • summary(lm.fit) gives more details:
    • coefficients
    • standard errors
    • p-values
    • R^2
    • F-statistic

Call:
lm(formula = medv ~ lstat)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  

Call:
lm(formula = medv ~ lstat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
  • Use names() to see what is stored in the fitted model object
  • Use extractor functions such as coef() to access components safely
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
(Intercept)       lstat 
 34.5538409  -0.9500494 
  • Use confint() to compute confidence intervals for coefficients
                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505
  • Use predict() to compute:
    • confidence intervals for the mean response
    • prediction intervals for a new observation
       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
       fit       lwr      upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310  8.077742 32.52846
  • Confidence and prediction intervals are centered at the same fitted value

  • Prediction intervals are wider than confidence intervals

  • Plot the data and add the least-squares regression line

  • The plot suggests some possible non-linearity

  • We will revisit this later

  • abline() can add any line to a plot

  • abline(a, b) draws a line with:

    • intercept = a
    • slope = b
  • Plot options:

    • lwd controls line width
    • col controls color
    • pch controls plotting symbol

Diagnostic Plots

  • plot(lm.fit) produces standard diagnostic plots for a linear model
  • By default, plots appear one at a time
  • Use par(mfrow = c(2, 2)) to show four plots in one window

  • Residuals can also be extracted directly
  • residuals() returns residuals
  • rstudent() returns studentized residuals

  • Residual plots again suggest some non-linearity

  • Use hatvalues() to compute leverage values

  • Use which.max() to find the largest one

375 
375 
  • which.max() returns the index of the largest element

Multiple Linear Regression

  • Fit multiple regression models with the same lm() function
  • Syntax: lm(y ~ x1 + x2 + x3, data = ...)

Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
  • To use all predictors, write . as shorthand

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1304  -2.7673  -0.5814   1.9414  26.2526 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
crim         -0.121389   0.033000  -3.678 0.000261 ***
zn            0.046963   0.013879   3.384 0.000772 ***
indus         0.013468   0.062145   0.217 0.828520    
chas          2.839993   0.870007   3.264 0.001173 ** 
nox         -18.758022   3.851355  -4.870 1.50e-06 ***
rm            3.658119   0.420246   8.705  < 2e-16 ***
age           0.003611   0.013329   0.271 0.786595    
dis          -1.490754   0.201623  -7.394 6.17e-13 ***
rad           0.289405   0.066908   4.325 1.84e-05 ***
tax          -0.012682   0.003801  -3.337 0.000912 ***
ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.798 on 493 degrees of freedom
Multiple R-squared:  0.7343,    Adjusted R-squared:  0.7278 
F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16
  • Components of the summary object can be extracted by name
  • Examples:
    • summary(lm.fit)$r.sq
    • summary(lm.fit)$sigma
  • vif() from the car package computes variance inflation factors
  • VIFs help assess multicollinearity
    crim       zn    indus     chas      nox       rm      age      dis 
1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037 
     rad      tax  ptratio    lstat 
7.445301 9.002158 1.797060 2.870777 
  • Exclude one predictor using - in the formula
  • Example: remove age

Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1851  -2.7330  -0.6116   1.8555  26.3838 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.525128   4.919684   8.441 3.52e-16 ***
crim         -0.121426   0.032969  -3.683 0.000256 ***
zn            0.046512   0.013766   3.379 0.000785 ***
indus         0.013451   0.062086   0.217 0.828577    
chas          2.852773   0.867912   3.287 0.001085 ** 
nox         -18.485070   3.713714  -4.978 8.91e-07 ***
rm            3.681070   0.411230   8.951  < 2e-16 ***
dis          -1.506777   0.192570  -7.825 3.12e-14 ***
rad           0.287940   0.066627   4.322 1.87e-05 ***
tax          -0.012653   0.003796  -3.333 0.000923 ***
ptratio      -0.934649   0.131653  -7.099 4.39e-12 ***
lstat        -0.547409   0.047669 -11.483  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.794 on 494 degrees of freedom
Multiple R-squared:  0.7343,    Adjusted R-squared:  0.7284 
F-statistic: 124.1 on 11 and 494 DF,  p-value: < 2.2e-16
  • Alternatively, update an existing model with update()

Interaction Terms

  • Include interaction terms with :
  • Example: lstat:age
  • Use * as shorthand for:
    • main effects
    • interaction term
  • lstat * age is shorthand for:
    • lstat + age + lstat:age

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Non-linear Transformations of the Predictors

  • lm() can include transformed predictors
  • Use I() when arithmetic operators would otherwise be interpreted specially in a formula
  • Example: I(lstat^2) adds a quadratic term

Call:
lm(formula = medv ~ lstat + I(lstat^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
  • The quadratic term appears to improve the model
  • Compare the linear and quadratic models with anova()
Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Model 1: linear fit with lstat

  • Model 2: quadratic fit with lstat and lstat^2

  • anova() tests whether the larger model improves fit significantly

  • Very small p-value indicates strong evidence in favor of the quadratic model

  • Check the residual diagnostics for the quadratic model

  • With the quadratic term included, residual patterns are reduced

  • For higher-order polynomials, poly() is more convenient than writing each power manually

  • Example: fifth-order polynomial fit


Call:
lm(formula = medv ~ poly(lstat, 5))

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,    Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16
  • Higher-order polynomial terms can improve fit

  • In this example, terms beyond fifth order are not significant

  • By default, poly() uses orthogonal polynomials

  • These are not the raw powers of the predictor

  • Fitted values are the same as with raw polynomials, but coefficients differ

  • Use raw = TRUE if raw powers are desired

  • We are not restricted to polynomial transformations

  • Example: log transformation


Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -76.488      5.028  -15.21   <2e-16 ***
log(rm)       54.055      2.739   19.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

Qualitative Predictors

  • Carseats is another dataset in ISLR2
  • Response variable: Sales
  • Predictors include both quantitative and qualitative variables
  Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
1  9.50       138     73          11        276   120       Bad  42        17
2 11.22       111     48          16        260    83      Good  65        10
3 10.06       113     35          10        269    80    Medium  59        12
4  7.40       117    100           4        466    97    Medium  55        14
5  4.15       141     64           3        340   128       Bad  38        13
6 10.81       124    113          13        501    72       Bad  78        16
  Urban  US
1   Yes Yes
2   Yes Yes
3   Yes Yes
4   Yes Yes
5   Yes  No
6    No Yes
  • ShelveLoc is a qualitative predictor
  • Levels:
    • Bad
    • Medium
    • Good
  • R automatically creates dummy variables for qualitative predictors

Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
Income              0.0108940  0.0026044   4.183 3.57e-05 ***
Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
Population          0.0001592  0.0003679   0.433 0.665330    
Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
Age                -0.0579466  0.0159506  -3.633 0.000318 ***
Education          -0.0208525  0.0196131  -1.063 0.288361    
UrbanYes            0.1401597  0.1124019   1.247 0.213171    
USYes              -0.1575571  0.1489234  -1.058 0.290729    
Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
Price:Age           0.0001068  0.0001333   0.801 0.423812    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8719 
F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
  • Use contrasts() to see the dummy-variable coding used by R
       Good Medium
Bad       0      0
Good      1      0
Medium    0      1
  • Use ?contrasts for more information

  • Here:

    • ShelveLocGood = 1 if location is good, 0 otherwise
    • ShelveLocMedium = 1 if location is medium, 0 otherwise
    • Bad is the baseline category
  • Positive coefficients indicate higher sales relative to the baseline category

Writing Functions

  • R includes many built-in functions

  • We can also write our own functions when needed

  • Example: create a function LoadLibraries()

  • Before defining it, calling the function gives an error

Error: object 'LoadLibraries' not found
Error in LoadLibraries(): could not find function "LoadLibraries"
  • Define the function with function() { ... }
  • The braces {} contain the commands in the function body
  • Typing the function name prints its definition
<srcref: file "" chars 1:18 to 5:1>
  • Calling the function executes the commands
[1] "The libraries have been loaded."