Decision Trees, Bagging, Random Forests, and Boosting
Ca’ Foscari University of Venice
tree library in R is used to construct regression treesBoston datasetlibrary(MASS)
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
tree.boston <- tree(medv ~ ., Boston, subset = train)
summary(tree.boston)
Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "rm" "lstat" "crim" "age"
Number of terminal nodes: 7
Residual mean deviance: 10.38 = 2555 / 246
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.1800 -1.7770 -0.1775 0.0000 1.9230 16.5800
summary() shows only 4 variables are used in the treelstat: % of lower socioeconomic status
rm: average number of rooms
Interpretation:
rm → higher house priceslstat → higher house pricesrm ≥ 7.553 → predicted price ≈ $45,400Model flexibility:
tree.control(..., mindev = 0)Next step:
cv.tree() to evaluate pruning via cross-validationyhat <- predict(tree.boston, newdata = Boston[-train, ])
boston.test <- Boston[-train, "medv"]
plot(yhat, boston.test)
abline(0, 1)
[1] 35.28688
randomForest package in Rlibrary(randomForest)
set.seed(1)
bag.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = 12, importance = TRUE)
bag.boston
Call:
randomForest(formula = medv ~ ., data = Boston, mtry = 12, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 12
Mean of squared residuals: 11.10176
% Var explained: 85.56
mtry = 12 → all 12 predictors used at each splityhat.bag <- predict(bag.boston, newdata = Boston[-train, ])
plot(yhat.bag, boston.test)
abline(0, 1)
[1] 23.38773
ntree parameterbag.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = 12, ntree = 25)
yhat.bag <- predict(bag.boston, newdata = Boston[-train, ])
mean((yhat.bag - boston.test)^2)[1] 25.19144
mtrymtry:
mtry = 6 is usedset.seed(1)
rf.boston <- randomForest(medv ~ ., data = Boston,
subset = train, mtry = 6, importance = TRUE)
yhat.rf <- predict(rf.boston, newdata = Boston[-train, ])
mean((yhat.rf - boston.test)^2)[1] 19.62021
importance() to evaluate variable importance %IncMSE IncNodePurity
crim 16.697017 1076.08786
zn 3.625784 88.35342
indus 4.968621 609.53356
chas 1.061432 52.21793
nox 13.518179 709.87339
rm 32.343305 7857.65451
age 13.272498 612.21424
dis 9.032477 714.94674
rad 2.878434 95.80598
tax 9.118801 364.92479
ptratio 8.467062 823.93341
black 7.579482 275.62272
lstat 27.129817 6027.63740
lstat (wealth / socioeconomic status)rm (house size / number of rooms)gbm package to fit boosted regression treesgbm() to the Boston datasetdistribution = "gaussian" for regression problemssummary() provides:

var rel.inf
rm rm 43.9919329
lstat lstat 33.1216941
crim crim 4.2604167
dis dis 4.0111090
nox nox 3.4353017
black black 2.8267554
age age 2.6113938
ptratio ptratio 2.5403035
tax tax 1.4565654
indus indus 0.8008740
rad rad 0.6546400
zn zn 0.1446149
chas chas 0.1443986
lstat and rm are the most important variablesrm → higher house priceslstat → lower house pricesmedv on the test setyhat.boost <- predict(boost.boston,
newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)[1] 18.84709
boost.boston <- gbm(medv ~ ., data = Boston[train, ],
distribution = "gaussian", n.trees = 5000,
interaction.depth = 4, shrinkage = 0.2, verbose = FALSE)
yhat.boost <- predict(boost.boston,
newdata = Boston[-train, ], n.trees = 5000)
mean((yhat.boost - boston.test)^2)[1] 18.33455