Lab 4

Decision Trees, Bagging, Random Forests, and Boosting

Author

Affiliation

Aldo Solari

Ca’ Foscari University of Venice

Decision Trees

Fitting Classification Trees

The tree library in R is used to construct regression trees

library(tree)

Fit a regression tree using the Boston dataset
Split data into a training set
Train the tree model on the training data

library(MASS)
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
tree.boston <- tree(medv ~ ., Boston, subset = train)
summary(tree.boston)


Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "rm"    "lstat" "crim"  "age"  
Number of terminal nodes:  7 
Residual mean deviance:  10.38 = 2555 / 246 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-10.1800  -1.7770  -0.1775   0.0000   1.9230  16.5800

summary() shows only 4 variables are used in the tree
In regression trees, deviance = sum of squared errors (SSE)

plot(tree.boston)
text(tree.boston, pretty = 0)

lstat: % of lower socioeconomic status
rm: average number of rooms
Interpretation:
- Higher rm → higher house prices
- Lower lstat → higher house prices
- Example: rm ≥ 7.553 → predicted price ≈ $45,400
Model flexibility:
- Larger tree can be grown using tree.control(..., mindev = 0)
Next step:
- Use cv.tree() to evaluate pruning via cross-validation

cv.boston <- cv.tree(tree.boston)
plot(cv.boston$size, cv.boston$dev, type = "b")

Cross-validation selects the most complex tree in this case
Tree can still be pruned manually if desired

prune.boston <- prune.tree(tree.boston, best = 5)
plot(prune.boston)
text(prune.boston, pretty = 0)

Based on cross-validation, the unpruned tree is selected
Use the unpruned tree to make predictions on the test set

yhat <- predict(tree.boston, newdata = Boston[-train, ])
boston.test <- Boston[-train, "medv"]
plot(yhat, boston.test)
abline(0, 1)

mean((yhat - boston.test)^2)

[1] 35.28688

Test set MSE ≈ 35.29
RMSE ≈ 5.941
Predictions are on average within ~$5,941 of true median house values

Bagging and Random Forests

Apply bagging and random forests using the randomForest package in R
Bagging = special case of random forest with m = p (all predictors used at each split)

library(randomForest)

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.

set.seed(1)
bag.boston <- randomForest(medv ~ ., data = Boston,
    subset = train, mtry = 12, importance = TRUE)
bag.boston


Call:
 randomForest(formula = medv ~ ., data = Boston, mtry = 12, importance = TRUE,      subset = train) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 12

          Mean of squared residuals: 11.10176
                    % Var explained: 85.56

mtry = 12 → all 12 predictors used at each split
This corresponds to bagging
Evaluate model performance on the test set

yhat.bag <- predict(bag.boston, newdata = Boston[-train, ])
plot(yhat.bag, boston.test)
abline(0, 1)

mean((yhat.bag - boston.test)^2)

[1] 23.38773

Bagged model test MSE ≈ 23.42
Significantly lower than single tree (~2/3 of its error)
Number of trees can be controlled with the ntree parameter

bag.boston <- randomForest(medv ~ ., data = Boston,
    subset = train, mtry = 12, ntree = 25)
yhat.bag <- predict(bag.boston, newdata = Boston[-train, ])
mean((yhat.bag - boston.test)^2)

[1] 25.19144

Random forests are built like bagging but with smaller mtry
Default mtry:
- Regression: p/3
Here, mtry = 6 is used

set.seed(1)
rf.boston <- randomForest(medv ~ ., data = Boston,
    subset = train, mtry = 6, importance = TRUE)
yhat.rf <- predict(rf.boston, newdata = Boston[-train, ])
mean((yhat.rf - boston.test)^2)

[1] 19.62021

Random forest test MSE ≈ 20.07
Improves over bagging performance
Use importance() to evaluate variable importance

importance(rf.boston)

          %IncMSE IncNodePurity
crim    16.697017    1076.08786
zn       3.625784      88.35342
indus    4.968621     609.53356
chas     1.061432      52.21793
nox     13.518179     709.87339
rm      32.343305    7857.65451
age     13.272498     612.21424
dis      9.032477     714.94674
rad      2.878434      95.80598
tax      9.118801     364.92479
ptratio  8.467062     823.93341
black    7.579482     275.62272
lstat   27.129817    6027.63740

Two measures of variable importance:
- Mean decrease in accuracy (via permutation on out-of-bag samples)
- Mean decrease in node impurity (averaged over all trees)

varImpPlot(rf.boston)

Most important variables in the random forest:
- lstat (wealth / socioeconomic status)
- rm (house size / number of rooms)

Boosting

Use the gbm package to fit boosted regression trees
Apply gbm() to the Boston dataset
Set distribution = "gaussian" for regression problems

library(gbm)

Loaded gbm 2.2.2

This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

set.seed(1)
boost.boston <- gbm(medv ~ ., data = Boston[train, ],
    distribution = "gaussian", n.trees = 5000,
    interaction.depth = 4)

summary() provides:
- Relative influence plot
- Numerical measures of variable importance

summary(boost.boston)

            var    rel.inf
rm           rm 43.9919329
lstat     lstat 33.1216941
crim       crim  4.2604167
dis         dis  4.0111090
nox         nox  3.4353017
black     black  2.8267554
age         age  2.6113938
ptratio ptratio  2.5403035
tax         tax  1.4565654
indus     indus  0.8008740
rad         rad  0.6546400
zn           zn  0.1446149
chas       chas  0.1443986

lstat and rm are the most important variables
Partial dependence plots can be generated for these variables

plot(boost.boston, i = "rm")

plot(boost.boston, i = "lstat")

Partial dependence plots show marginal effect of variables on response
Effects observed:
- Higher rm → higher house prices
- Higher lstat → lower house prices
Use boosted model to predict medv on the test set

yhat.boost <- predict(boost.boston,
    newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)

[1] 18.84709

Boosting test MSE ≈ 18.39
Outperforms random forests and bagging
Model can be tuned via shrinkage parameter \lambda
Example: use \lambda = 0.2

boost.boston <- gbm(medv ~ ., data = Boston[train, ],
    distribution = "gaussian", n.trees = 5000,
    interaction.depth = 4, shrinkage = 0.2, verbose = FALSE)
yhat.boost <- predict(boost.boston,
    newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)

[1] 18.33455

Using \lambda = 0.2 results in lower test MSE than \lambda = 0.001