Practical Work 1

1.8 Some basics

1.8.1 Basic Commands

uses functions to perform operations. To run a function called funcname,we type funcname(input1, input2) , where the inputs (or arguments) input1 and input2 tell how to run the function. A function can have any number of inputs. For example, to create a vector of numbers, we use the function c() (for concatenate).

Note that the > is not part of the command; rather, it is printed by to indicate that it is ready for another command to be entered. We can also save things using = rather than <-. Note that the answer in the code above is followed by #ans> while in the console it is not.

Hitting the up arrow multiple times will display the previous commands, which can then be edited. This is useful since one often wishes to repeat a similar command.

The ls() function allows us to look at a list of all of the objects, such ls() as data and functions, that we have saved so far. The rm() function can be used to delete any object that we don’t want.

1.8.3 Matrices, data frames and lists

# A matrix is an array of vectors
A <- matrix(1:4, nrow = 2, ncol = 2)
A
#ans>      [,1] [,2]
#ans> [1,]    1    3
#ans> [2,]    2    4

# Another matrix
B <- matrix(1:4, nrow = 2, ncol = 2, byrow = TRUE)
B
#ans>      [,1] [,2]
#ans> [1,]    1    2
#ans> [2,]    3    4

# Binding by rows or columns
rbind(1:3, 4:6)
#ans>      [,1] [,2] [,3]
#ans> [1,]    1    2    3
#ans> [2,]    4    5    6
cbind(1:3, 4:6)
#ans>      [,1] [,2]
#ans> [1,]    1    4
#ans> [2,]    2    5
#ans> [3,]    3    6

# Entry-wise operations
A + 1
#ans>      [,1] [,2]
#ans> [1,]    2    4
#ans> [2,]    3    5
A * B
#ans>      [,1] [,2]
#ans> [1,]    1    6
#ans> [2,]    6   16

# Accessing elements
A[2, 1] # Element (2, 1)
#ans> [1] 2
A[1, ] # First row
#ans> [1] 1 3
A[, 2] # Second column
#ans> [1] 3 4

# A data frame is a matrix with column names
# Useful when you have multiple variables
myDf <- data.frame(var1 = 1:2, var2 = 3:4)
myDf
#ans>   var1 var2
#ans> 1    1    3
#ans> 2    2    4

# You can change names
names(myDf) <- c("newname1", "newname2")
myDf
#ans>   newname1 newname2
#ans> 1        1        3
#ans> 2        2        4

# The nice thing is that you can access variables by its name with the $ operator
myDf$newname1
#ans> [1] 1 2

# And create new variables also (it has to be of the same
# length as the rest of variables)
myDf$myNewVariable <- c(0, 1)
myDf
#ans>   newname1 newname2 myNewVariable
#ans> 1        1        3             0
#ans> 2        2        4             1

# A list is a collection of arbitrary variables
myList <- list(vec = vec, A = A, myDf = myDf)

# Access elements by names
myList$vec
#ans> [1] -4.12 -1.00  1.10  1.00  3.00  4.00
myList$A
#ans>      [,1] [,2]
#ans> [1,]    1    3
#ans> [2,]    2    4
myList$myDf
#ans>   newname1 newname2 myNewVariable
#ans> 1        1        3             0
#ans> 2        2        4             1

# Reveal the structure of an object
str(myList)
#ans> List of 3
#ans>  $ vec : num [1:6] -4.12 -1 1.1 1 3 4
#ans>  $ A   : int [1:2, 1:2] 1 2 3 4
#ans>  $ myDf:'data.frame': 2 obs. of  3 variables:
#ans>   ..$ newname1     : int [1:2] 1 2
#ans>   ..$ newname2     : int [1:2] 3 4
#ans>   ..$ myNewVariable: num [1:2] 0 1
str(myDf)
#ans> 'data.frame': 2 obs. of  3 variables:
#ans>  $ newname1     : int  1 2
#ans>  $ newname2     : int  3 4
#ans>  $ myNewVariable: num  0 1

# A less lengthy output
names(myList)
#ans> [1] "vec"  "A"    "myDf"

1.8.4 Graphics

The plot() function is the primary way to plot data in . For instance, plot(x,y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the x-axis. To find out more information about the plot() function, type ?plot.

1.8.5 Distributions

Do the following:

  • Compute the 90%, 95% and 99% quantiles of a \(F\) distribution with df1 = 1 and df2 = 5. (Answer: c(4.060420, 6.607891, 16.258177))
  • Sample 100 points from a Poisson with lambda = 5.
  • Plot the density of a \(t\) distribution with df = 1 (use a sequence spanning from -4 to 4). Add lines of different colors with the densities for df = 5, df = 10, df = 50 and df = 100.

1.8.6 Working directory

Your working directory is the folder on your computer in which you are currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.

To set your working directory within RStudio you can go to Tools / Set working directory, or use the command setwd(), we put the complete path of the directory between the brackets, do not forget to put the path into quotation marks "".

To know the actual working directory we use getwd().

1.8.7 Loading Data

The read.table() function is one of the primary ways to import a data set into . The help file ?read.table() contains details about how to use this function. We can use the function write.table() to export data.

Next we will show how to load the data set Auto.data (Download it from here ).

  • If the file is of csv format, we use read.csv.
  • Always try to look to the file before importing it to r icon::fa("r-project", color="steelblue") (Open it in a text editor. See for example if the first row containes the variables names, if the columns are separated by , or ; or ..
  • For text editors, I suggest Sublime Text or Atom.
Take a look at this (very) short introduction to R. It can be useful.

1.9 Regression

1.9.1 The lm function

We are going to employ the EU dataset. The EU dataset contains 28 rows with the member states of the European Union (Country), the number of seats assigned under different years (Seats2011, Seats2014), the Cambridge Compromise apportionment (CamCom2011), and the countries population (Population2010,Population2013). Click here to download the EU dataset.

There is two ways to tell where is the file you want to load/use/import or where to save a file when you write/export/save :

  1. write the complete path of the files.
  2. set a working directory and put the files in it.

\[\begin{align*} F=\frac{\text{SSR}/1}{\text{SSE}/(n-2)}\stackrel{H_0}{\sim} F_{1,n-2}, \end{align*}\]

# lm (for linear model) has the syntax: 
# lm(formula = response ~ predictor, data = data)
# The response is the y in the model. The predictor is x.
# For example (after loading the EU dataset)
mod <- lm(formula = Seats2011 ~ Population2010, data = EU)

# We have saved the linear model into mod, which now contains all the output of lm
# You can see it by typing
mod
#ans> 
#ans> Call:
#ans> lm(formula = Seats2011 ~ Population2010, data = EU)
#ans> 
#ans> Coefficients:
#ans>    (Intercept)  Population2010  
#ans>       7.91e+00        1.08e-06

# mod is indeed a list of objects whose names are
names(mod)
#ans>  [1] "coefficients"  "residuals"     "effects"       "rank"         
#ans>  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#ans>  [9] "na.action"     "xlevels"       "call"          "terms"        
#ans> [13] "model"

# We can access these elements by $
# For example
mod$coefficients
#ans>    (Intercept) Population2010 
#ans>       7.91e+00       1.08e-06

# The residuals
mod$residuals
#ans>        Germany         France United Kingdom          Italy          Spain 
#ans>         2.8675        -3.7031        -1.7847         0.0139        -3.5084 
#ans>         Poland        Romania    Netherlands         Greece        Belgium 
#ans>         1.9272         1.9434         0.2142         1.8977         2.3994 
#ans>       Portugal Czech Republic        Hungary         Sweden        Austria 
#ans>         2.6175         2.7587         3.2898         2.0163         2.0575 
#ans>       Bulgaria        Denmark       Slovakia        Finland        Ireland 
#ans>         1.9328        -0.8790        -0.7606        -0.6813        -0.7284 
#ans>      Lithuania         Latvia       Slovenia        Estonia         Cyprus 
#ans>         0.4998        -1.3347        -2.1175        -3.3552        -2.7761 
#ans>     Luxembourg          Malta 
#ans>        -2.4514        -2.3553

# The fitted values
mod$fitted.values
#ans>        Germany         France United Kingdom          Italy          Spain 
#ans>          96.13          77.70          74.78          72.99          57.51 
#ans>         Poland        Romania    Netherlands         Greece        Belgium 
#ans>          49.07          31.06          25.79          20.10          19.60 
#ans>       Portugal Czech Republic        Hungary         Sweden        Austria 
#ans>          19.38          19.24          18.71          17.98          16.94 
#ans>       Bulgaria        Denmark       Slovakia        Finland        Ireland 
#ans>          16.07          13.88          13.76          13.68          12.73 
#ans>      Lithuania         Latvia       Slovenia        Estonia         Cyprus 
#ans>          11.50          10.33          10.12           9.36           8.78 
#ans>     Luxembourg          Malta 
#ans>           8.45           8.36

# Summary of the model
sumMod <- summary(mod)
sumMod
#ans> 
#ans> Call:
#ans> lm(formula = Seats2011 ~ Population2010, data = EU)
#ans> 
#ans> Residuals:
#ans>    Min     1Q Median     3Q    Max 
#ans> -3.703 -1.951  0.014  1.980  3.290 
#ans> 
#ans> Coefficients:
#ans>                Estimate Std. Error t value Pr(>|t|)    
#ans> (Intercept)    7.91e+00   5.66e-01    14.0  2.6e-13 ***
#ans> Population2010 1.08e-06   1.92e-08    56.3  < 2e-16 ***
#ans> ---
#ans> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ans> 
#ans> Residual standard error: 2.29 on 25 degrees of freedom
#ans>   (1 observation deleted due to missingness)
#ans> Multiple R-squared:  0.992,   Adjusted R-squared:  0.992 
#ans> F-statistic: 3.17e+03 on 1 and 25 DF,  p-value: <2e-16

The following table contains a handy cheat sheet of equivalences between code and some of the statistical concepts associated to linear regression.

Statistical concept
x Predictor \(X_1,\ldots,X_n\)
y Response \(Y_1,\ldots,Y_n\)
data <- data.frame(x = x, y = y) Sample \((X_1,Y_1),\ldots,(X_n,Y_n)\)
model <- lm(y ~ x, data = data) Fitted linear model
model$coefficients Fitted coefficients \(\hat\beta_0,\hat\beta_1\)
model$residuals Fitted residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\)
model$fitted.values Fitted values \(\hat Y_1,\ldots,\hat Y_n\)
model$df.residual Degrees of freedom \(n-2\)
summaryModel <- summary(model) Summary of the fitted linear model
summaryModel$sigma Fitted residual standard deviation \(\hat\sigma\)
summaryModel$r.squared Coefficient of determination \(R^2\)
summaryModel$fstatistic \(F\)-test
anova(model) ANOVA table

Do the following:

  • Download The ‘EU’ dataset from here as an .RData file and load it using the function load.
  • Compute the regression of CamCom2011 into Population2010. Save that model as the variable myModel.
  • Access the objects residuals and coefficients of myModel.
  • Compute the summary of myModel and store it as the variable summaryMyModel.
  • Access the object sigma of myModel.

1.9.2 Predicting House Value: Boston dataset

We are going to use a dataset called Boston which is part of the MASS package. It recordes the median value of houses for 506 neighborhoods around Boston. Our task is to predict the median house value (medv) using only one predictor (lstat: percent of households with low socioeconomic status).

STEP 1: Split the dataset

STEP 2: Check for Linearity

In order to perfom linear regression in , we will use the function lm()to fit a simple linear regression with medv as the response (dependent variable) and lstat as the predictor or independent variable, and then save it in model.

But before we run our model, let’s visually check if the relationship between x and y is linear.

On the figure above, modify the following:

  • Figure title.
  • Axis titles.
  • Shape of observations and their colors.
  • Sizes of the chosen shape.

According to the plot, we see that the relationship is not linear. Let’s try a transformation of our explanatory variable lstat.

Look at the plot, it is more linear, so we can proceed and perform lm():

STEP 3: Run the linear regression model

Notice that basic information when we print model. This only give us the slope \((-12.2)\) and the intercept \((51.8)\) of the linear model. Note that here we are looking at log(lstat) and not lstat anymore. So for every one unit increase in lstat, the median value of the house will decrease by \(e^{12.2}\). For more detailed information, we can use the summary() function:

Now, we have access to p-values and standard errors for the coefficients, as well as the \(R^2\).

  • The output states that the slope is statistically significant and different from \(0\) and with a t-value= \(-25.9\) (p-value \(< 0.05\)), which means that there is a significant relationship between the percentage of households with low socioeconomic income and the median house value.
  • This relationship is negative. That is as the percantage of household with low socioeconomic income increases, the median house value decreases.
  • Looking at \(R^2\), we can deduce that \(62.7\%\) of the model variation is being explained by the predictor log(lstat). This is probably low, but indeed it would increase if we had more independent (explanatory) variables. We can use the names() function to see what other pieces of information are stored in our linear model (model).

To obtain the confidence intervel for the linear model (model), we can use the confint() function:

So, a \(95\%\) confidence interval for the slope of log(lstat) is \((-13.13, -11.28)\). Notice that this confidence interval gives us the same result as the hypothesis test performed earlier, by stating that we are \(95\%\) confident that the slope of lstat is not zero (in fact it is less than zero, which means that the relationship is negative.)

STEP 4: Plot the regression model

Now, let’s plot our regression line on top of our data.

Let’s play with the look of the plot, and makes it perttier!

STEP 5: Assess the model

Final thing we will do is to predict using our fitted model. We can use the predict() function for this purpose:

Now let’s assess our model, by computing th mean squared error (MSE). To assess the model we created, then we will be using the test data!