# PW 2

## Multiple Linear Regression

In this practical work, we will continue the analysis of the Boston data set that we started last week (section 1.9.2). Recall that this dataset records the median value of houses for 506 neighborhoods around Boston. Our task is to predict the median house value (`medv`

).

**1**. Load the Boston dataset from `MASS`

package.

**2**. Split the dataset into traning set and testing set. (keep all the variables of the Boston data set)

**3**. Check if there is a linear relationship between the variables `medv`

and `age`

. (use `cor()`

function).

**4**. Fit a model of housing prices in function of age and plot the observations and the regression line.

**5**. Train a regression model using both `lstat`

and `age`

as predictors of median house value. (Remember that we transformed `lstat`

, use the same transformation here). What is the obtained model?

**6**. Print the summary of the obtained regression model.

**7**. Is the model as a whole significant? Answer on this question must be detailed.

**8**. Are the predictors significant ?

**9**. Train a new model using all the variables of the dataset. (We can use `.`

as a short cut instead of writing down all the variables names)

**10**. When using all the variables as predictors, we didn’t transform `lstat`

. Re train the model using `log(lstat)`

instead of `lstat`

.

**11**. Did \(R^2\) improve ?

**12**. To see if there is correlated variables print the correlation matrix using the `cor()`

function (round the correlations with 2 digits).

**13**. Visualize the correlations using the `corrplot`

package. To do so, install the `corrplot`

package, load it, then use the function `corrplot.mixed()`

. See this link for examples and to understand how to use it.

**14**. What is the correlation between `tax`

and `rad`

?

**15**. Run the model again without `tax`

. What happens to the \(R^2\) ? and for the F-statistic?

Of course \(R^2\) should go a little lower because we deleted one of the variables. But check for the model significance (F-statistic) gets higher, which means the p-values gets lower and thus the model is more significant without `tax`

.

**16**. Calculate the mean squared error (MSE) for the last model.

**Anova**

Next we will apply an analysis of variances (ANOVA) in order to test if there is a significant difference of means between two groups \(i\) and \(j\) (Consider group \(i\) is the suburbs bounding the river and \(j\) the suburbs which not). The hypotheses are

\[ H_0 : \mu_i = \mu_j \]

\[ H_1 : \mu_i \neq \mu_j \]

Where \(\mu_i\) is the mean of `medv`

in group \(i\).

**17**. In the Boston data set there is a categorical variable `chas`

which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Use command `str()`

to see how this variable is present in the dataset. How many of the suburbs in this data set bound the Charles river?

**18**. Create Boxplots of the median value of houses with respect to the variable `chas`

. Do we observe some difference between the median value of houses with respect to the neighborhood to Charles River?

**19**. Calculate \(\mu_i\) and \(\mu_j\) (in one line using the function `aggregate()`

).

**20**. Apply an ANOVA test of `medv`

whith respect to `chas`

(use the function `aov()`

). Print the result and the summary of it. what do you conclude ?

**Qualitative predictors**

**Before starting the next question, please read section 2.3.1 and Appendix D about using qualitative predictors in regression**.

We are going to use the categorical variable `chas`

which corresponds to Charles River (= 1 if a suburb bounds the river; 0 otherwise). Using the `str()`

command you can notice that this variable is not codified as a factor, but it has values 0 or 1, so it is already dummyfied.

**21**. Fit a new model where the predictors are the Charles River and the Crime Rate. Interpret the coefficients of this model and conclude if the presence of the river adds a valuable information for explaining the house price.

**22**. Is `chas`

significant as well in the presence of more predictors?

**Interaction terms**

As you saw in section 2.3.1 we may sometimes try models with interaction terms. Let’s say we have two predictors \(X_1\) and \(X_2\), the way of adding these interactions in `lm`

is through `:`

and `*`

. The operator `:`

only adds the term \(X_1X_2\) and `*`

adds \(X_1\), \(X_2\), and \(X_1X_2\).

**23**. Fit a model whith first order interaction term where predictors are `lstat`

and `age`

. Print its summary.

**24**. Fit a model with all the first order interaction terms.

## Reporting

In there is some packages to make it **easy to create reproducible web-based reports**. To do so, click on `File -> Knit document`

or `File -> Compile report..`

. The output is a html report containing the results of your codes.
If your file is named `report.R`

, your report is named `report.html`

.

**25**. Compile a report based on your script.

- Make sure to have the latest version of
`Rstudio`

. - If you have problems with compiling (problem in installing packages, etc..) close your
`Rstudio`

and reopen it with administrative tools and retry. - Be ready to submit your report (your
`.html`

file) at the end of each class. - You report must be named:
`YouLastName_YourFirstName_WeekNumber.html`

◼