In this practical work, we will build some decision trees for both regression and classification problems. Note that there are many packages to do this in . The
tree package is the basic package to do so, while the
rpart17 package seems more widely suggested and provides better plotting features. So we will use the
It is recommended for correct and better using of functions that you consult their documentations. Every function is well documented indeed. You can do so by writing
help(function_name)in the console.
Especially for functions with multiple use, for example,
glm() is a function that fits generalizes linear models, one of them is logistic regression when
type = "binomial".
Another example, is the function
predict(), that is a generic function for predictions from the results of various model fitting functions. Its first argument is a object, a model, and the rest of the arguments depends on the nature of the object. If you want to consult the documentation about using
predict() for a tree built with
If you want to run a function from a certain package without loading the package, you can write
package::function(). For example
rpart::rpart(). It is also helpful to remember the name of the package in which the function is defined.
To demonstrate regression trees, we will use the
Boston dataset that we used during the first two practical works, from the
MASS package. Recall that
medv is the response.
1. Load the Boston dataset from
MASS package. Split the dataset randomly in half.
library(MASS) library(caTools) set.seed(18) Boston_idx = sample(1:nrow(Boston), nrow(Boston) / 2) # You don't know what we just did? # open the documentation of the function sample by # writing ?sample in the R console. # Note that this is one of the ways to split it randomly and it is not necessary the best. Boston_train = Boston[Boston_idx,] Boston_test = Boston[-Boston_idx,]
2. Fit a regression tree to the training data using the
rpart() function from the
rpart package. Name the tree
3. Plot the obtained tree using the following code.
4. A better plot can be obtained using the
rpart.plot18 package. Re-plot the tree using it. You can use the
rpart.plot() function which by default, when the output is continuous, each node shows: the predicted value, and the percentage of observations in the node. You can also use the
5. Print the obtained tree and print its summary. Between the things that you can see in the summary, the CP (complexity parameter) table and the importance of each variable in the model. Print the CP table using the
printcp() function to see the cross validation results. Plot a comparison figure using the
You will notice the obtained tree is pruned. This is because
rpart prunes the tree by default by performing 10-fold cross-validation.
rpart keeps track of something called the complexity of a tree. The complexity measure is a combination of the size of a tree and the ability of the tree to separate the classes of the target variable. If the next best split in growing a tree does not reduce the tree’s overall complexity by a certain amount,
rpart will terminate the growing process. This amount is specified by the complexity parameter,
cp, in the call to
cp to a negative amount (like -1) ensures that the tree will be fully grown. You can try it and then plot the tree.
Notice that the default
cp value may over prune the tree (the default is the one with the lowest
xerror). As a rule of thumb, it’s best to prune a decision tree using the
cp of smallest tree that is within one standard deviation of the tree with the smallest
xerror. In the example above (see the CP table and the figure obtained with
plotcp()), the best
xerror is 0.30517 with standard deviation 0.056180. So, we want the smallest tree with
xerror less than 0.30517 + 0.05618 = 0.361. This is the tree with
cp = 0.025293, so we’ll want to prune our tree with a
cp slightly greater than
Next we will compare this regression tree to a linear model and will use RMSE as our metric. RMSE is the Root Mean Square Error, which is the square root of the MSE.
6. Write a function that returns the RMSE of two vectors.
7. Use the function
predict() to predict the response on the test set. Then calculate the RMSE obtained with tree model.
8. Fit a linear regression model on the training set. Then predict the response on the test set using the linear model. Calculate the RMSE and compare the performance of the tree and the linear regression model.
Here the most obvious linear regression beats the tree! We’ll improve on this tree by considering ensembles of trees.
You can visually compare the performance of both models by plotting the Actual (reality) response values against the predicted values. The model with closer points are to the diagonal (
y=x) line is the better one. You can try to reproduce the figure below.
By aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved. We will now use these concepts, called ensemble methods.
Bagging, or Bootstrap aggregation, is a general-purpose procedure for reducing the variance of a statistical learning method, it is particularly useful and frequently used in the context of decision trees. The idea is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Generally we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.
To apply bagging to regression trees, we simply construct \(B\) regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these \(B\) trees reduces the variance.
9. Fit a bagged model, using the
randomForest() function from the
Bagging is actually a special case of a random forest where
mtry is equal to \(p\), the number of predictors.
10. Predict the response on the test set using the bagging model. Calculate the RMSE. Is the performance of the model better than linear regression or a simple tree?
Note that the “Mean of squared residuals” which is output by
randomForest() is the Out of Bag estimate of the error. Here is its plot:
Now try a random forest. For regression, on suggestion is to use
mtry equal to \(p/3\).19
11. Fit a random forest on the training set and compare its performance with the previous models by calculating the predictions and the RMSE.
12. Use the function
importance() from the
randomForest package to see the most important predictors in the obtained random forest model. What are the three most important predictors? Did you find the same results when you selected the best predictors for the linear regression model during session 2?
13. Plot the importance of the predictors to the model using the
Last and not least, let us try a boosted model, which by default will produce a nice variable importance plot as well as plots of the marginal effects of the predictors. To do so, we will use the
14. Using the
gbm() function like following, fit a boosted model on the training set. Then compare its performance with the previous models by calculating the predictions and the RMSE.
#ans> Using 5000 trees...
15. Show the summary of the boosted model. A figure of the variable importance will be shown.
16. Reproduce the following comparison: A table in which we show the obtained RMSE with each tested model, you can create a \(5 \times 2\) data.frame in which you put the names of the models and the corresponding RMSE. To visualize the data frame in the compiled html report you can use the
kable() function from the
knitr package. Or, compare the models by plotting the Actual (reality) response values against the predicted values.
A classification tree is very similar to a regression tree, except that the classification tree is used to predict a qualitative response rather than a quantitative one. Recall that for a regression tree, the predicted response for an observation is given by the mean response of the training observations that belong to the same terminal node. In contrast, for a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
For the rest of this PW, you must:
- Import the
spamdataset and explore it. Be aware that it is preferable that the response column is of type factor.
- Split the dataset into training and test sets (choose your own seed when using
- Fit (using
- A logistic regression model.
- A simple classification tree.
- Bagging, Random Forests22, and Boosting models.
- For each model, predict the response on the test set and evaluate the performance of the model, using the prediction accuracy (create a function that returns the accuracy for two binary vectors).
This dataset consists of information from 4601 email messages, in a study to try to predict whether the email was junk email, or “spam”. For all 4601 email messages, the true outcome, spam or not, is available, along with 57 predictors as described below:
- 48 quantitative predictors: the percentage of words in the email that match a given word. Examples include business, address, internet; etc.
- 6 quantitative predictors: the percentage of characters in the email that match a given character. The characters are
- The average length of uninterrupted sequences of capital letters:
- The length of the longest uninterrupted sequence of capital letters:
- The sum of the length of uninterrupted sequences of capital letters:
Note that the spam dataset given here is already treated and ready to be explored. To achieve this stage, some steps are required to treat the raw data, like Tokenization, Stemming, and Lemmatization. In this dataset the most important words are already selected and other variables are added. Curious students can read more about these steps. Two famous packages for text mining are tm and tidytext.
So far in this PW, we fit bagging, boosting and random forest models, but did not tune any of them, we simply used certain, somewhat arbitrary, parameters. Actually, to make these models better the parameters should be tuned. The parameters include:
- Bagging: Actually just a subset of Random Forest with
- Random Forest:
caret package provides excellent functions to accomplish this. Note that with these tree-based ensemble methods there are two resampling solutions for tuning the model:
- Out of Bag
Using Out of Bag samples is advantageous with these methods as compared to Cross-Validation since it removes the need to refit the model and is thus much more computationally efficient. Unfortunately OOB methods cannot be used with
gbm models. See the
caret documentation: Short intro, Long intro for details.