| Volume | Diameter | Height |
|---|---|---|
| 10.3 | 8.3 | 70 |
| 10.3 | 8.6 | 65 |
| 10.2 | 8.8 | 63 |
| 16.4 | 10.5 | 72 |
| 18.8 | 10.7 | 81 |
| 19.7 | 10.8 | 83 |
Introduction to Statistical Learning - PISE
Ca’ Foscari University of Venice
This unit will cover the following topics:
These data give the Volume (cubic feet), Height (feet) and Diameter (inches) (at 54 inches above ground) for a sample of n=31 black cherry trees in the Allegheny National Forest, Pennsylvania.
The data were collected in order to find an estimate for the volume of a tree Y (and therefore for the timber yield), given its diameter X_1 and height X_2.
| Volume | Diameter | Height |
|---|---|---|
| 10.3 | 8.3 | 70 |
| 10.3 | 8.6 | 65 |
| 10.2 | 8.8 | 63 |
| 16.4 | 10.5 | 72 |
| 18.8 | 10.7 | 81 |
| 19.7 | 10.8 | 83 |
Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X_1,X_2,\ldots,X_p is linear.
The linear model is given by Y = \beta_0 + \beta_1X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \varepsilon
Given estimates \hat{\beta}_0,\hat{\beta}_1,\ldots,\hat{\beta}_p we can make predictions using the formula \hat{y} = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 + \ldots + \hat\beta_p x_p
We estimate \beta_0,\beta_1,\ldots,\beta_p as the values that minimize the sum of squared residuals
\mathrm{RSS} = \sum_{i=1}^{n} (y_i - \hat y_i)^2= \sum_{i=1}^{n} (y_i - \hat{\beta}_0- \hat{\beta}_1 x_{i1}- \hat{\beta}_2 x_{i2}- \ldots - \hat{\beta}_p x_{ip})^2
We interpret \beta_j as the average effect on Y of a one unit increase in X_j, holding all other predictors fixed
But predictors usually change together! For example, in the tree data, larger Diameter is typically associated with greater Height. Holding Height fixed while increasing Diameter may be unrealistic.
\mathrm{Volume} = \beta_0 + \beta_1 \times \mathrm{Diameter} + \beta_2 \times \mathrm{Height} + \varepsilon
| Estimate | Std_Error | t_value | p_value | |
|---|---|---|---|---|
| (Intercept) | -57.9877 | 8.6382 | -6.7129 | 0.0000 |
| Diameter | 4.7082 | 0.2643 | 17.8161 | 0.0000 |
| Height | 0.3393 | 0.1302 | 2.6066 | 0.0145 |
When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following:
Non-linearity of the response-predictor relationships.
Correlation of error terms.
Non-constant variance of error terms.
Outliers.
High-leverage points.
Collinearity.
Residual plots are a useful graphical tool for identifying non-linearity
Plot the residuals, e_i = y_i - \hat{y}_i, against the fitted values \hat{y}_i.
A systematic pattern in this plot may indicate violations of the linearity assumption
A cube-root transformation of the response leads to the model \mathrm{Volume}^{1/3} = \beta_0 + \beta_1 \times \mathrm{Diameter} + \beta_2 \times \mathrm{Height} + \varepsilon in which both sides have the dimension of length.
After fitting the model, we obtain the predicted cube-root volume
\widehat{\mathrm{Volume}^{1/3}} = \hat\beta_0 + \hat\beta_1 \mathrm{Diameter} + \hat\beta_2 \mathrm{Height}.
Volume on the original scale, we cube the fitted value:\widehat{\mathrm{Volume}} = \left( \widehat{\mathrm{Volume}^{1/3}} \right)^3.
A tree trunk can be viewed as a cylinder (or a cone): \mathrm{Volume} = (\text{Area of base}) \times (\text{Height}). Since the base area is proportional to the square of the diameter, \text{Area of base} = \pi \left(\frac{\mathrm{Diameter}}{2}\right)^2 we obtain \mathrm{Volume} = \frac{\pi}{4} \times \mathrm{Diameter}^2 \times \mathrm{Height}.
The previous reasoning suggests the regression model \mathrm{Volume} = \beta \times \mathrm{Diameter}^2 \times \mathrm{Height} + \varepsilon, where \beta = \frac{\pi}{4} or \beta = \frac{\pi}{12} if trees were perfect cylinder or cones.
| Estimate | Std_Error | t_value | p_value |
|---|---|---|---|
| 0.0021 | 0 | 77.4365 | 0 |
Model 4 applies a logarithmic transformation to both the response and predictors:
\log \mathrm{Volume} = \beta_0 + \beta_1 \times \log \mathrm{Diameter} + \beta_2 \times \log \mathrm{Height} + \varepsilon.
For the tree data, geometry suggests
\mathrm{Volume} = \beta \times \mathrm{Diameter}^2 \times \mathrm{Height},
and taking logs gives
\log \mathrm{Volume} = \log \beta + 2 \times \log \mathrm{Diameter} + 1 \times \log \mathrm{Height}, Thus the log–log specification allows the exponents to be estimated from the data.
| Estimate | Std_Error | t_value | p_value | |
|---|---|---|---|---|
| (Intercept) | -6.6316 | 0.7998 | -8.2917 | 0 |
| log(Diameter) | 1.9826 | 0.0750 | 26.4316 | 0 |
| log(Height) | 1.1171 | 0.2044 | 5.4644 | 0 |
\log Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon.
If \varepsilon \sim N(0,\sigma^2), then \log Y \mid X \sim N(\mu,\sigma^2), where \mu = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p and X=(X_1,\dots,X_p)^T.
Then Y \mid X \sim \text{Lognormal}(\mu,\sigma^2). For a lognormal random variable: \text{Median}(Y \mid X) = \exp(\mu), \qquad \mathbb{E}(Y \mid X) = \exp(\mu + \sigma^2/2).
Thus \exp(\hat\mu)= \exp(\hat\beta_0+\hat\beta_1 X_1+\cdots+\hat\beta_p X_p) estimates \text{Median}(Y \mid X), while \mathbb{E}(Y \mid X) is larger because the distribution of Y|X is right-skewed.
To estimate \mathbb{E}(Y \mid X), use
\exp(\hat\mu)\exp(\hat\sigma^2/2).
To compare the three models using LOOCV (leave-one-out cross-validation) on the original Volume scale, we:
Leave one tree out
Fit the model on the remaining 30 observations
Predict the left-out tree
Compute squared error on the original Volume (ft³) scale
Repeat for all 31 trees
Compare root mean squared prediction error (LOOCV RMSE)
| Model | LOOCV_MSE | LOOCV_RMSE | |
|---|---|---|---|
| Model1 | Baseline | 18.158 | 4.261 |
| Model2 | Cube-root | 7.314 | 2.705 |
| Model3 | Cylinder | 6.399 | 2.530 |
| Model4 | Log-Log | 6.984 | 2.643 |
| Model4bc | Log-Log (bias-corrected) | 6.958 | 2.638 |
The response variable is Balance (average credit card debt for each individual).
The quantitative predictors are:
Age — age in years
Cards — number of credit cards
Education — years of education
Income — income (in thousands of dollars)
Limit — credit limit
Rating — credit rating
Some predictors are not quantitative but qualitative, meaning they take values from a discrete set of categories.
These are also called categorical predictors or factor variables.
In regression, qualitative predictors must be represented using indicator (dummy) variables.
In the Credit data, in addition to the seven quantitative variables shown in the scatterplot matrix, there are four qualitative variables:
GenderStudent (student status)Married (marital status)Ethnicity (Caucasian, African American, or Asian)Ethnicity has three levels, and we create two dummy variables:x_{i1} = \begin{cases} 1 & \text{if the } i\text{th person is Asian} \\ 0 & \text{otherwise} \end{cases}, \qquad x_{i2} = \begin{cases} 1 & \text{if the } i\text{th person is Caucasian} \\ 0 & \text{otherwise} \end{cases}
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i.
y_i = \begin{cases} \beta_0 + \beta_1 + \varepsilon_i & \text{if Asian} \\ \beta_0 + \beta_2 + \varepsilon_i & \text{if Caucasian} \\ \beta_0 + \varepsilon_i & \text{if African American} \end{cases}
There is always one fewer dummy variable than the number of levels.
The level without a dummy variable (African American here) is called the baseline level.
The coefficients \beta_1 and \beta_2 measure differences relative to the baseline.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 531.000 | 46.319 | 11.464 | 0.000 |
| EthnicityAsian | -18.686 | 65.021 | -0.287 | 0.774 |
| EthnicityCaucasian | -12.503 | 56.681 | -0.221 | 0.826 |
Consider the Credit data set. Suppose we wish to predict
Balance using Income (quantitative) and
Student (qualitative).
Define an indicator variable x_i = \begin{cases} 1 & \text{if the } i\text{th person is a student} \\ 0 & \text{otherwise} \end{cases}
Without an interaction term, the model is \text{balance}_i = \beta_0 + \beta_1 \text{income}_i +\beta_2 x_i+\varepsilon_i.
This implies
\text{balance}_i = \begin{cases} (\beta_0 + \beta_2) + \beta_1 \text{income}_i + \varepsilon_i & \text{if student} \\ \beta_0 + \beta_1 \text{income}_i + \varepsilon_i & \text{if not student} \end{cases}
Include an interaction between Income and Student:
\text{balance}_i = \beta_0 + \beta_1 \text{income}_i + \beta_2 x_i + \beta_3 (\text{income}_i x_i) + \varepsilon_i,
This implies
\text{balance}_i = \begin{cases} (\beta_0 + \beta_2) + (\beta_1 + \beta_3)\text{income}_i + \varepsilon_i & \text{if student} \\ \beta_0 + \beta_1 \text{income}_i + \varepsilon_i & \text{if not student} \end{cases}
Credit data; Left: no interaction between Income and Student. Right: with an interaction term between Income and Student.
The most direct approach is called best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.
However we often can’t examine all possible models, since they are 2^p of them; for example when p= 40 there are over a billion models!
Let \mathcal M_0 denote the null model, which contains no predictors. This model predicts the sample mean of the response for all observations.
For k = 1, 2, \dots, p:
From the models \mathcal M_0, \mathcal M_1, \dots, \mathcal M_p, choose a final model using cross-validated prediction error, Cp (AIC), BIC, or adjusted R^2
For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R^2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R^2. Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables.
Forward stepwise selection starts with the null model (no predictors).
Predictors are added one at a time.
At each step, the predictor that provides the greatest additional improvement in model fit is added.
The procedure continues until all predictors are included.
Let \mathcal M_0 denote the null model (no predictors).
For k = 0, 1, \dots, p-1:
From the sequence of models \mathcal M_0, \mathcal M_1, \dots, \mathcal M_p, select a final model using cross-validated prediction error, Cp (AIC), BIC, or adjusted R^2
The first four selected models using best subset selection and forward stepwise selection:
| # Variables | Best Subset | Forward Stepwise |
|---|---|---|
| One | rating |
rating |
| Two | rating, income |
rating, income |
| Three | rating, income, student |
rating, income, student |
| Four | cards, income, student, limit |
rating, income, student, limit |
The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ.
The model containing all of the predictors will always have the smallest RSS and the largest R^2, since these quantities are related to the training error.
We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error.
Therefore, RSS and R^2 are not suitable for selecting the best model among a collection of models with different numbers of predictors.
We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures
These techniques adjust the training error for model size, and can be used to select among a set of models with different numbers of variables.
The next figure displays C_p, BIC, and adjusted R^2 for the best model of each size produced by best subset selection on the Credit data set.
C_p = \frac{1}{n} \left( \text{RSS} + 2 d \hat{\sigma}^2 \right),
where d is the total number of parameters used and \hat{\sigma}^2 is an estimate of the variance of the error \varepsilon associated with each response measurement.
\text{BIC} = \frac{1}{n} \left( \text{RSS} + \log(n)\, d \hat{\sigma}^2 \right).
Like C_p, the Bayesian Information Criterion (BIC) tends to be small for models with low test error.
We typically select the model with the lowest BIC.
Notice that BIC replaces the 2 d \hat{\sigma}^2 term used by C_p with a \log(n)\, d \hat{\sigma}^2 term, where n is the number of observations.
Since \log(n) > 2 for any n > 7, BIC generally places a heavier penalty on models with many variables.
As a result, BIC tends to select smaller models than C_p.
Each model selection procedure returns a sequence of models \mathcal M_k indexed by model size k = 0, 1, 2, \dots. Our goal is to choose \hat{k}.
Once selected, we return the model \mathcal M_{\hat{k}}.
We compute the validation set error or the cross-validation error for each model \mathcal M_k, and select the value of k for which the estimated test error is smallest.
This approach has an advantage over AIC, BIC, C_p, and adjusted R^2: it provides a direct estimate of test error and does not require an estimate of the error variance \sigma^2.
It can also be applied in a wider range of model selection problems, including settings where the model degrees of freedom are difficult to determine or where estimating \sigma^2 is challenging.
Video SL 3.1 Simple Linear Regression - 13:02
Video SL 3.3 Multiple Linear Regression - 15:38
Video SL 3.4 Some Important Questions - 14:52
Video SL 3.5 Extensions of the Linear Model - 14:17
Video SL 6.1 Introduction and Best Subset Selection - 13:45
Video SL 6.3 Backward Stepwise Selection - 5:27