PW 2
Multiple Linear Regression
In this practical work, we will continue the analysis of the Boston data set that we started last week (section 1.9.2). Recall that this dataset records the median value of houses for 506 neighborhoods around Boston. Our task is to predict the median house value (medv
).
1. Load the Boston dataset from MASS
package.
2. Split the dataset into traning set and testing set. (keep all the variables of the Boston data set)
3. Check if there is a linear relationship between the variables medv
and age
. (use cor()
function).
4. Fit a model of housing prices in function of age and plot the observations and the regression line.
5. Train a regression model using both lstat
and age
as predictors of median house value. (Remember that we transformed lstat
, use the same transformation here). What is the obtained model?
6. Print the summary of the obtained regression model.
7. Are the predictors significant ?
8. Is the model as a whole significant? Answer on this question must be detailed.
9. Train a new model using all the variables of the dataset. (We can use .
as a short cut instead of writing down all the variables names)
10. When using all the variables as predictors, we didn’t transform lstat
. Re train the model using log(lstat)
instead of lstat
.
11. Did \(R^2\) improve ?
12. To see if there is correlated variables print the correlation matrix using the cor()
function (round the correlations with 2 digits).
13. Visualize the correlations using the corrplot
package. To do so, install the corrplot
package, load it, then use the function corrplot.mixed()
. See this link for examples and to understand how to use it.
14. What is the correlation between tax
and rad
?
15. Run the model again without tax
. What happens to the \(R^2\) ? and for the F-statistic?
Of course \(R^2\) should go a little lower because we deleted one of the variables. But check for the model significance (F-statistic) gets higher, which means the p-values gets lower and thus the model is more significant without rad
.
16. Calculate the mean squared error (MSE) for the last model.
Anova
Next we will apply an analysis of variances (ANOVA) in order to test if there is a significant difference of means between two groups \(i\) and \(j\) (Consider group \(i\) is the suburbs bounding the river and \(j\) the suburbs which not). The hypotheses are
\[ H_0 : \mu_i = \mu_j \]
\[ H_1 : \mu_i \neq \mu_j \]
Where \(\mu_i\) is the mean of medv
in group \(i\).
17. In the Boston data set there is a categorical variable chas
which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Use command str()
to see how this variable is present in the dataset. How many of the suburbs in this data set bound the Charles river?
18. Create Boxplots of the median value of houses with respect to the variable chas
. Do we observe some difference between the median value of houses with respect to the neighborhood to Charles River?
19. Calculate \(\mu_i\) and \(\mu_j\) (in one line using the function aggregate()
).
20. Apply an ANOVA test of medv
whith respect to chas
(use the function aov()
). Print the result and the summary of it. what do you conclude ?
Qualitative predictors
Before starting the next question, please read section 2.3.1 and Appendix C about using qualitative predictors in regression.
We are going to use the categorical variable chas
which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Using the str()
command you can notice that this variable is not codified as a factor, but it has values 0 or 1, so it is already dummyfied.
21. Fit a new model where the predictors are the Charles River and the Crime Rate. Interpret the coefficients of this model and conclude if the presence of the river adds a valuable information for explaining the house price.
22. Is chas
is significant as well in the presence of more predictors?
Interaction terms
As you saw in section 2.3.1 we may sometimes try models with intercation terms. Let’s say we have two predictors \(X_1\) and \(X_2\), the way of adding these interactions in lm
is through :
and *
. The operator :
only adds the term \(X_1X_2\) and *
adds \(X_1\), \(X_2\), and \(X_1X_2\).
23. Fit a model whith first order interaction term where predictors are lstat
and age
. Print its summary.
24. Fit a model with all the first order interaction terms.
Reporting
In there is some packages to make it easy to create reproducible web-based reports. To do so, click on File -> Knit document
or File -> Compile report..
. The output is a html report containing the results of your codes.
If your file is named report.R
, your report is named report.html
.
25. Compile a report based on your script.
-
Make sure to have the latest version of
Rstudio
. -
If you have problems with compiling (problem in installing packages, etc..) close your
Rstudio
and reopen it with administrative tools and retry. -
Be ready to submit your report (your
.html
file) at the end of each class. -
You report must be named:
YouLastName_YourFirstName_WeekNumber.html
◼