Multiple Linear Regression
In this practical work, we will continue the analysis of the Boston data set that we started last week (section 1.9.2). Recall that this dataset records the median value of houses for 506 neighborhoods around Boston. Our task is to predict the median house value (
1. Load the Boston dataset from
2. Split the dataset into traning set and testing set. (keep all the variables of the Boston data set)
3. Check if there is a linear relationship between the variables
4. Fit a model of housing prices in function of age and plot the observations and the regression line.
5. Train a regression model using both
age as predictors of median house value. (Remember that we transformed
lstat, use the same transformation here). What is the obtained model?
6. Print the summary of the obtained regression model.
7. Is the model as a whole significant? Answer on this question must be detailed.
8. Are the predictors significant ?
9. Train a new model using all the variables of the dataset. (We can use
. as a short cut instead of writing down all the variables names)
10. When using all the variables as predictors, we didn’t transform
lstat. Re train the model using
log(lstat) instead of
11. Did \(R^2\) improve ?
12. To see if there is correlated variables print the correlation matrix using the
cor() function (round the correlations with 2 digits).
13. Visualize the correlations using the
corrplot package. To do so, install the
corrplot package, load it, then use the function
corrplot.mixed(). See this link for examples and to understand how to use it.
14. What is the correlation between
15. Run the model again without
tax. What happens to the \(R^2\) ? and for the F-statistic?
Of course \(R^2\) should go a little lower because we deleted one of the variables. But check for the model significance (F-statistic) gets higher, which means the p-values gets lower and thus the model is more significant without
16. Calculate the mean squared error (MSE) for the last model.
Next we will apply an analysis of variances (ANOVA) in order to test if there is a significant difference of means between two groups \(i\) and \(j\) (Consider group \(i\) is the suburbs bounding the river and \(j\) the suburbs which not). The hypotheses are
\[ H_0 : \mu_i = \mu_j \]
\[ H_1 : \mu_i \neq \mu_j \]
Where \(\mu_i\) is the mean of
medv in group \(i\).
17. In the Boston data set there is a categorical variable
chas which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Use command
str() to see how this variable is present in the dataset. How many of the suburbs in this data set bound the Charles river?
18. Create Boxplots of the median value of houses with respect to the variable
chas. Do we observe some difference between the median value of houses with respect to the neighborhood to Charles River?
19. Calculate \(\mu_i\) and \(\mu_j\) (in one line using the function
20. Apply an ANOVA test of
medv whith respect to
chas (use the function
aov()). Print the result and the summary of it. what do you conclude ?
We are going to use the categorical variable
chas which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Using the
str() command you can notice that this variable is not codified as a factor, but it has values 0 or 1, so it is already dummyfied.
21. Fit a new model where the predictors are the Charles River and the Crime Rate. Interpret the coefficients of this model and conclude if the presence of the river adds a valuable information for explaining the house price.
chas significant as well in the presence of more predictors?
As you saw in section 2.3.1 we may sometimes try models with interaction terms. Let’s say we have two predictors \(X_1\) and \(X_2\), the way of adding these interactions in
lm is through
*. The operator
: only adds the term \(X_1X_2\) and
* adds \(X_1\), \(X_2\), and \(X_1X_2\).
23. Fit a model whith first order interaction term where predictors are
age. Print its summary.
24. Fit a model with all the first order interaction terms.
In there is some packages to make it easy to create reproducible web-based reports. To do so, click on
File -> Knit document or
File -> Compile report... The output is a html report containing the results of your codes.
If your file is named
report.R, your report is named
25. Compile a report based on your script.
- Make sure to have the latest version of
- If you have problems with compiling (problem in installing packages, etc..) close your
Rstudioand reopen it with administrative tools and retry.
- Be ready to submit your report (your
.htmlfile) at the end of each class.
- You report must be named: