PW 2

Multiple Linear Regression

In this practical work, we will continue the analysis of the Boston data set that we started last week (section 1.9.2). Recall that this dataset records the median value of houses for 506 neighborhoods around Boston. Our task is to predict the median house value (medv).

1. Load the Boston dataset from MASS package.

2. Split the dataset into training set and testing set. (keep all the variables of the Boston data set)

3. Check if there is a linear relationship between the variables medv and age. (use cor() function).

4. Fit a model of housing prices in function of age and plot the observations and the regression line.

5. Train a regression model using both lstat and age as predictors of median house value. (Remember that we transformed lstat, use the same transformation here). What is the obtained model?

6. Print the summary of the obtained regression model.

7. Is the model as a whole significant? Answer on this question must be detailed.

8. Are the predictors significant ?

9. Train a new model using all the variables of the dataset. (We can use . as a short cut instead of writing down all the variables names)

10. When using all the variables as predictors, we didn’t transform lstat. Re train the model using log(lstat) instead of lstat.

11. Did \(R^2\) improve ?

12. To see if there is correlated variables print the correlation matrix using the cor() function (round the correlations with 2 digits).

13. Visualize the correlations using the corrplot package. To do so, install the corrplot package, load it, then use the function corrplot.mixed(). See this link for examples and to understand how to use it.

14. What is the correlation between tax and rad?

15. Run the model again without tax. What happens to the \(R^2\) ? and for the F-statistic?

Of course \(R^2\) should go a little lower because we deleted one of the variables. But check for the model significance (F-statistic) gets higher, which means the p-values gets lower and thus the model is more significant without tax.

16. Calculate the mean squared error (MSE) for the last model.

Anova

Next we will apply an analysis of variances (ANOVA) in order to test if there is a significant difference of means between two groups \(i\) and \(j\) (Consider group \(i\) is the suburbs bounding the river and \(j\) the suburbs which not). The hypotheses are

\[ H_0 : \mu_i = \mu_j \]

\[ H_1 : \mu_i \neq \mu_j \]

Where \(\mu_i\) is the mean of medv in group \(i\).

17. In the Boston data set there is a categorical variable chas which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Use command str() to see how this variable is present in the dataset. How many of the suburbs in this data set bound the Charles river?

18. Create Boxplots of the median value of houses with respect to the variable chas. Do we observe some difference between the median value of houses with respect to the neighborhood to Charles River?

19. Calculate \(\mu_i\) and \(\mu_j\) (in one line using the function aggregate()).

20. Apply an ANOVA test of medv with respect to chas (use the function aov()). Print the result and the summary of it. what do you conclude ?

Qualitative predictors

Before starting the next question, please read section 2.3.1 and Appendix D about using qualitative predictors in regression.

We are going to use the categorical variable chas which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Using the str() command you can notice that this variable is not codified as a factor, but it has values 0 or 1, so it is already dummyfied.

21. Fit a new model where the predictors are the Charles River and the Crime Rate. Interpret the coefficients of this model and conclude if the presence of the river adds a valuable information for explaining the house price.

22. Is chas significant as well in the presence of more predictors?

Interaction terms

As you saw in section 2.3.1 we may sometimes try models with interaction terms. Let’s say we have two predictors \(X_1\) and \(X_2\), the way of adding these interactions in lm is through : and *. The operator : only adds the term \(X_1X_2\) and * adds \(X_1\), \(X_2\), and \(X_1X_2\).

23. Fit a model with first order interaction term where predictors are lstat and age. Print its summary.

24. Fit a model with all the first order interaction terms.

Reporting

In there is some packages to make it easy to create reproducible web-based reports. To do so, click on File -> Knit document or File -> Compile report... The output is a html report containing the results of your codes. If your file is named report.R, your report is named report.html.

25. Compile a report based on your script.

  • Make sure to have the latest version of Rstudio.
  • If you have problems with compiling (problem in installing packages, etc..) close your Rstudio and reopen it with administrative tools and retry.
  • Be ready to submit your report (your .html file) at the end of each class.
  • You report must be named: YouLastName_YourFirstName_WeekNumber.html