# PW 5

In this practical work, we will build some decision trees for both *regression* and *classification* problems. Note that there are many packages to do this in . The `tree`

package is the basic package to do so, while the `rpart`

^{17} package seems more widely suggested and provides better plotting features. So we will use the `rpart`

package.

It is recommended for correct and better using of functions that you consult their documentations. Every function is well documented indeed. You can do so by writing `?function_name`

or `help(function_name)`

in the console.

Especially for functions with multiple use, for example, `glm()`

is a function that fits generalizes linear models, one of them is logistic regression when `type = "binomial"`

.

Another example, is the function `predict()`

, that is a generic function for predictions from the results of various model fitting functions. Its first argument is a object, a model, and the rest of the arguments depends on the nature of the object. If you want to consult the documentation about using `predict()`

for a tree built with `rpart()`

, do `?predict.rpart`

or `help(predict.rpart)`

If you want to run a function from a certain package without loading the package, you can write `package::function()`

. For example `MASS::lda()`

or `rpart::rpart()`

. It is also helpful to remember the name of the package in which the function is defined.

## Regression Trees

### Single tree

To demonstrate regression trees, we will use the `Boston`

dataset that we used during the first two practical works, from the `MASS`

package. Recall that `medv`

is the response.

**1**. Load the Boston dataset from `MASS`

package. Split the dataset randomly in half.

```
library(MASS)
library(caTools)
set.seed(18)
= sample(1:nrow(Boston), nrow(Boston) / 2)
Boston_idx # You don't know what we just did?
# open the documentation of the function sample by
# writing ?sample in the R console.
# Note that this is one of the ways to split it randomly and it is not necessary the best.
= Boston[Boston_idx,]
Boston_train = Boston[-Boston_idx,] Boston_test
```

**2**. Fit a regression tree to the training data using the `rpart()`

function from the `rpart`

package. Name the tree `Boston_tree`

.

**3**. Plot the obtained tree using the following code.

```
plot(Boston_tree)
text(Boston_tree, pretty = 0)
title(main = "Regression Tree")
```

**4**. A better plot can be obtained using the `rpart.plot`

^{18} package. Re-plot the tree using it. You can use the `rpart.plot()`

function which by default, when the output is continuous, each node shows: the predicted value, and the percentage of observations in the node. You can also use the `prp()`

function.

```
#ans> Error in library(rpart.plot): there is no package called 'rpart.plot'
#ans> Error in rpart.plot(Boston_tree): impossible de trouver la fonction "rpart.plot"
#ans> Error in prp(Boston_tree): impossible de trouver la fonction "prp"
```

**5**. Print the obtained tree and print its summary. Between the things that you can see in the summary, the CP (complexity parameter) table and the importance of each variable in the model. Print the CP table using the `printcp()`

function to see the cross validation results. Plot a comparison figure using the `plotcp()`

function.

You will notice the obtained tree is **pruned**. This is because `rpart`

prunes the tree by default by performing 10-fold cross-validation.

`rpart`

keeps track of something called the complexity of a tree. The complexity measure is a combination of the size of a tree and the ability of the tree to separate the classes of the target variable. If the next best split in growing a tree does not reduce the tree’s overall complexity by a certain amount, `rpart`

will terminate the growing process. This amount is specified by the complexity parameter, `cp`

, in the call to `rpart()`

. Setting `cp`

to a negative amount (like -1) ensures that the tree will be fully grown. You can try it and then plot the tree.

Notice that the default `cp`

value may over prune the tree (the default is the one with the lowest `xerror`

). As a rule of thumb, it’s best to prune a decision tree using the `cp`

of smallest tree that is within one standard deviation of the tree with the smallest `xerror`

. In the example above (see the CP table and the figure obtained with `plotcp()`

), the best `xerror`

is 0.30517 with standard deviation 0.056180. So, we want the smallest tree with `xerror`

less than 0.30517 + 0.05618 = 0.361. This is the tree with `cp = 0.025293`

, so we’ll want to prune our tree with a `cp`

slightly greater than `0.025293`

.

Next we will compare this regression tree to a linear model and will use RMSE as our metric. RMSE is the **R**oot **M**ean **S**quare **E**rror, which is the square root of the MSE.

**6**. Write a function that returns the RMSE of two vectors.

**7**. Use the function `predict()`

to predict the response on the test set. Then calculate the RMSE obtained with tree model.

**8**. Fit a linear regression model on the training set. Then predict the response on the test set using the linear model. Calculate the RMSE and compare the performance of the tree and the linear regression model.

Here the most obvious linear regression beats the tree! We’ll improve on this tree by considering ensembles of trees.

You can visually compare the performance of both models by plotting the Actual (reality) response values against the predicted values. The model with closer points are to the diagonal (`y=x`

) line is the better one. You can try to reproduce the figure below.

By aggregating many decision trees, using methods like *bagging*,
*random forests*, and *boosting*, the predictive performance of trees can be
substantially improved. We will now use these concepts, called *ensemble methods*.

### Bagging

Bagging, or *Bootstrap aggregation*, is a general-purpose procedure for reducing the variance of a statistical learning method, it is particularly useful and frequently used in the context of decision trees. The idea is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Generally we do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

To apply bagging to regression trees, we simply construct \(B\) regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these \(B\) trees reduces the variance.

**9**. Fit a bagged model, using the `randomForest()`

function from the `randomForest`

package.

Bagging is actually a special case of a random forest where `mtry`

is equal to \(p\), the number of predictors.

**10**. Predict the response on the test set using the bagging model. Calculate the RMSE. Is the performance of the model better than linear regression or a simple tree?

Note that the “Mean of squared residuals” which is output by `randomForest()`

is the **Out of Bag** estimate of the error. Here is its plot:

### Random Forests

Now try a random forest. For regression, on suggestion is to use `mtry`

equal to \(p/3\).^{19}

**11**. Fit a random forest on the training set and compare its performance with the previous models by calculating the predictions and the RMSE.

**12**. Use the function `importance()`

from the `randomForest`

package to see the most important predictors in the obtained random forest model. What are the three most important predictors? Did you find the same results when you selected the best predictors for the linear regression model during session 2?

**13**. Plot the importance of the predictors to the model using the `varImpPlot()`

function.

### Boosting

Last and not least, let us try a *boosted* model, which by default will produce a nice **variable importance** plot as well as plots of the marginal effects of the predictors. To do so, we will use the `gbm`

package^{20}.

**14**. Using the `gbm()`

function like following, fit a boosted model on the training set. Then compare its performance with the previous models by calculating the predictions and the RMSE.

```
library(gbm)
#ans> Error in library(gbm): there is no package called 'gbm'
= gbm(medv ~ ., data = Boston_train, distribution = "gaussian",
Boston_boost n.trees = 5000, interaction.depth = 4, shrinkage = 0.01)
#ans> Error in gbm(medv ~ ., data = Boston_train, distribution = "gaussian", : impossible de trouver la fonction "gbm"
```

`#ans> Error in predict(Boston_boost, newdata = Boston_test): objet 'Boston_boost' introuvable`

**15**. Show the summary of the boosted model. A figure of the variable importance will be shown.

### Comparison

**16**. Reproduce the following comparison: A table in which we show the obtained RMSE with each tested model, you can create a \(5 \times 2\) data.frame in which you put the names of the models and the corresponding RMSE. To visualize the data frame in the compiled html report you can use the `kable()`

function from the `knitr`

package. Or, compare the models by plotting the Actual (reality) response values against the predicted values.

`#ans> Error in plot(Boston_boost_pred, Boston_test$medv, xlab = "Predicted", : objet 'Boston_boost_pred' introuvable`

## Classification Trees

A classification tree is very similar to a regression tree, except that the classification tree is used to predict a qualitative response rather than a quantitative one. Recall that for a regression tree, the predicted response for an observation is given by the mean response of the training observations that belong to the
same terminal node. In contrast, for a classification tree, we predict that
each observation belongs to the ** most commonly occurring class** of training
observations in the region to which it belongs.

To construct classification trees, we will use the **spam**^{21} dataset, available here . A description of the dataset is given below.

For the rest of this PW, you must:

- Import the
`spam`

dataset and explore it. Be aware that it is preferable that the response column is of type factor. - Split the dataset into training and test sets (choose your own seed when using
`set.seed()`

). - Fit (using
`rpart`

and`gbm`

packages):- A
**logistic**regression model. - A simple classification tree.
- Bagging, Random Forests
^{22}, and Boosting models.

- A
- For each model, predict the response on the test set and evaluate the performance of the model, using the
*prediction accuracy*(create a function that returns the accuracy for two binary vectors).

### The Spam dataset

This dataset consists of information from 4601 email messages, in a study to try to predict whether the email was junk email, or “spam.” For all 4601 email messages, the true outcome, spam or not, is available, along with 57 predictors as described below:

- 48 quantitative predictors: the
*percentage*of words in the email that match a given word. Examples include*business*,*address*,*internet*; etc. - 6 quantitative predictors: the percentage of characters in the email that match a given character. The characters are
`;`

,`(`

,`[`

,`!`

,`$`

and`#`

. - The average length of uninterrupted sequences of capital letters:
`crl.ave`

. - The length of the longest uninterrupted sequence of capital letters:
`crl.long`

. - The sum of the length of uninterrupted sequences of capital letters:
`crl.tot`

.

Note that the spam dataset given here is already treated and ready to be explored. To achieve this stage, some steps are required to treat the raw data, like Tokenization, Stemming, and Lemmatization. In this dataset the most important words are already selected and other variables are added. Curious students can read more about these steps. Two famous packages for text mining are tm and tidytext.

### Extra: Tuning

So far in this PW, we fit bagging, boosting and random forest models, but did not **tune** any of them, we simply used certain, somewhat arbitrary, parameters. Actually, to make these models better the parameters should be tuned. The parameters include:

- Bagging: Actually just a subset of Random Forest with
`mtry`

= \(p\). - Random Forest:
`mtry`

- Boosting:
`n.trees`

,`interaction.depth`

,`shrinkage`

,`n.minobsinnode`

The `caret`

package provides excellent functions to accomplish this. Note that with these tree-based ensemble methods there are two resampling solutions for tuning the model:

- Out of Bag
- Cross-Validation

Using Out of Bag samples is advantageous with these methods as compared to Cross-Validation since it removes the need to refit the model and is thus much more computationally efficient. Unfortunately OOB methods cannot be used with `gbm`

models. See the `caret`

documentation: Short intro, Long intro for details.

◼

An Introduction to Recursive Partitioning Using the

`rpart`

Routines - Details of the`rpart`

package.↩︎`rpart.plot`

Package - Detailed manual on plotting with`rpart`

using the`rpart.plot`

package.↩︎For classification a suggestion is

`mtry`

= \(\sqrt{p}\).↩︎**g**eneralized**b**oosted**m**odels package↩︎For classification, the suggested

`mtry`

for a random forest is \(\sqrt{p}\).↩︎