PW 4

During this session we are going to continue the analysis of the Social_Network_Ads dataset . Recall that this dataset contains informations of users of a social network and if they bought a specified product. Last week we built a Logistic Regression model for the variable Purchased in function of Age and EstimatedSalary. We will consider the same variables this week but we will fit different models using methods such as LDA, QDA, and Naive Bayes.

Logistic Regression

1. First, let’s do the pre-processing steps you were asked to do during the last session and fit a logistic regression model. Please read and understand very well the following code (read the comments!). Then copy what is necessary for today’s session to your report (but remove my comments!).

If you lost the dataset, you can download it from here .


# Loading the dataset.. I have putted it into a folder called "datasets"
dataset <- read.csv('http://www.mghassany.com/MLcourse/datasets/Social_Network_Ads.csv')

# Describing and Exploring the dataset

str(dataset) # to show the structure of the dataset. 
summary(dataset) # will show some statistics of every column. 
# Remark what it shows when the column is a numerical or categorical variable.
# Remark that it has no sense for the variable User.ID

boxplot(Age ~ Purchased, data=dataset, col = "blue", main="Boxplot Age ~ Purchased")
# You know what is a boxplot right? I will let you interpret it.
boxplot(EstimatedSalary ~ Purchased, data=dataset,col = "red",
 main="Boxplot EstimatedSalary ~ Purchased")
# Another boxplot

aov(EstimatedSalary ~Purchased, data=dataset)
# Anova test, but we need to show the summary of 
# it in order to see the p-value and to interpret.

summary(aov(EstimatedSalary ~Purchased, data=dataset))
# What do you conclude ?
# Now another anova test for the variable Age
summary(aov(Age ~Purchased, data=dataset))

# There is a categorical variable in the dataset, which is Gender.
# Of course we cannot show a boxplot of Gender and Purchased.
# But we can show a table, or a mosaic plot, both tell the same thing.
table(dataset$Gender,dataset$Purchased)
# Remark for the function table(), that
# in lines we have the first argument, and in columns we have the second argument.
# Don't forget this when you use table() to show a confusion matrix!
mosaicplot(~ Purchased + Gender, data=dataset,
  main = "MosaicPlot of two categorical variables: Puchased & Gender",
  color = 2:3, las = 1)

# since these 2 variables are categorical, we can apply
# a Chi-square test. The null hypothesis is the independance between
# these variables. You will notice that p-value = 0.4562 which is higher than 0.05 (5%)
# so we cannot reject the null hypothesis. 
# conclusion: there is no dependance between Gender and Purchased (who
# said that women buy more than men? hah!)

chisq.test(dataset$Purchased, dataset$Gender)

# Let's say we want to remove the first two columns as we are not going to use them.
# But, we can in fact use a categorical variable as a predictor in logistic regression.
# It will treat it the same way as in regression. Check Appendix C.
# Try it by yourself if you would like to.
dataset = dataset[3:5]
str(dataset) # show the new structure of dataset


# splitting the dataset into training and testing sets
library(caTools)
set.seed(123) # CHANGE THE VALUE OF SEED. PUT YOUR STUDENT'S NUMBER INSTEAD OF 123.
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# scaling
# So here, we have two continuous predictors, Age and EstimatedSalary.
# There is a very big difference in their scales (units).
# That's why we scale them. But it is not always necessary.

training_set[-3] <- scale(training_set[-3]) #only first two columns
test_set[-3] <- scale(test_set[-3])

# Note that, we replace the columns of Age and EstimatedSalary in the training and
# test sets but their scaled versions. I noticed in a lot of reports that you scaled
# but you did not do the replacing.
# Note too that if you do it column by column you will have a problem because 
# it will replace the column by a matrix, you need to retransform it to a vector then.
# Last note, to call the columns Age and EstimatedSalary we can it like I did or 
# training_set[c(1,2)] or training_set[,c(1,2)] or training_set[,c("Age","EstimatedSalary")]


# logistic regression

classifier.logreg <- glm(Purchased ~ Age + EstimatedSalary , family = binomial, data=training_set)
classifier.logreg
summary(classifier.logreg)

# prediction
pred.glm = predict(classifier.logreg, newdata = test_set[,-3], type="response")
# Do not forget to put type response. 
# By the way, you know what you get when you do not put it, right?

# Now let's assign observations to classes with respect to the probabilities
pred.glm_0_1 = ifelse(pred.glm >= 0.5, 1,0)
# I created a new vector, because we need the probabilities later for the ROC curve.

# show some values of the vectors
head(pred.glm)
head(pred.glm_0_1)

# confusion matrix
cm = table(test_set[,3], pred.glm_0_1)
cm
# First line to store it into cm, second line to show the matrix! 

# You remember my note about table() function and the order of the arguments?
cm = table(pred.glm_0_1, test_set[,3])
cm

# You can show the confusion matrix in a mosaic plot by the way
mosaicplot(cm,col=sample(1:8,2)) # colors are random between 8 colors.

# ROC
require(ROCR)
score <- prediction(pred.glm,test_set[,3]) # we use the predicted probabilities not the 0 or 1
performance(score,"auc") # y.values
plot(performance(score,"tpr","fpr"),col="green")
abline(0,1,lty=8)

So now we have a logistic regression model stored in classifier.logreg. It is a model of Purchased in function of Age and EstimatedSalary. We will use this model to show the decision boundary in the next part of this PW. Then we will compare this model to other models obtained by Discriminant Analysis approaches.

Decision Boundary of Logistic Regression

Now you are going to visualize the decision boundary for logistic regression.

  • Since the decision boundary of logistic regression is a linear (you know why right?) and the dimension of the feature space is 2 (Age and EstimatedSalary), the decision boundary in this 2-dimensional space is a line that separates the predicted classes “0” and “1” (values of the response Purchased).
  • For logistic regression, we predict \(y=1\) if \(\beta^T X \geq 0\) (right side of the line) and \(y=0\) if \(\beta^T X \lt 0\) (left side of the line). Where

\[ \beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta2 \end{pmatrix} \,\, \text{and} \,\, X = \begin{pmatrix} 1 \\ X_1 \\ X_2 \end{pmatrix}\]

So we predict \(y=1\) if \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 \geq 0\) which means that the equation of the decision boundary (a line here) is \(X_2 = - \frac{\beta_1}{\beta_2}X_1 - \frac{\beta_0}{\beta_2}\)

2. Plot the decision boundary obtained with logistic regression. In order to do so, calculate the intercept and the slope of the line presenting the decision boundary, then plot EstimatedSalary in function of Age (from the test_set) and add the line using abline().

3. In order to verify that your line (decision boundary) is well plotted, color the points on the last Figure with respect to the predicted response.

Hints:

  • If your predictions are stored in y_pred, you can do it using bg = ifelse(y_pred == 1, 'color1', 'color2'), and precise the argument pch to be 21 (you can choose pch to be a value between 21 and 25, try it).
  • Then, add the line using abline(), put the line width = 2 to make it more visible. Do not forget to title the Figure).

4. Now make the same plot but color the points with respect to their real labels (the variable Purchased). From this figure, count the number of the false positive predictions and compare it to the value obtained in the confusion matrix.

Linear Discriminant Analysis (LDA)

Let us apply linear discriminant analysis (LDA) now. First we will make use of the lda() function in the package MASS. Second, you are going to create the model and predict the classes by yourself without using the lda() function. And we will visualize the decision boundary of LDA.

5. Fit a LDA model of Purchased in function of Age and EstimatedSalary. Name the model classifier.lda.

library(MASS)
classifier.lda <- lda(Purchased~Age+EstimatedSalary, data=training_set)

6. Call classifier.lda and understand what does it compute.

Plus: If you enter the following you will be returned with a list of summary information concerning the computation:

classifier.lda$prior
classifier.lda$means

7. On the test set, predict the probability of purchasing the product by the users using the model classifier.lda. Remark that when we predict using LDA, we obtain a list instead of a matrix, do str() for your predictions to see what do you get.

Remark: we get the predicted class here, without being obligated to round the predictions as we did for logistic regression.

8. Compute the confusion matrix and compare the predictions results obtained by LDA to the ones obtained by logistic regression. What do you remark? (Hint: compare the accuracy)

9. Now let us plot the decision boundary obtained with LDA. You saw in the course that decision boundary for LDA represent the set of values \(x\) where \(\delta_k(x) = \delta_c(x)\). Recall that \[ \delta_k(X) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k \]

Here in our case, we have 2 classes (\(K=2\)) and 2 predictors (\(p=2\)). So the decision boundary (which is linear in the case of LDA, and line in our case since \(p=2\)) will verify the equation \(\delta_0(x) = \delta_1(x)\) Since we have two classes “0” and “1”. In the case of LDA this leads to linear boundary and is easy to be plotted. But in more complicated cases it is difficult to manually simplify the equations and plot the decision boundary. Anyway, there is a smart method to plot (but a little bit costy) the decision boundary in R using the function contour(), the corresponding code is the following (you must adapt it and use it to plot your decision boundary):

# create a grid corresponding to the scales of Age and EstimatedSalary
# and fill this grid with lot of points
X1 = seq(min(training_set[, 1]) - 1, max(training_set[, 1]) + 1, by = 0.01)
X2 = seq(min(training_set[, 2]) - 1, max(training_set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
# Adapt the variable names
colnames(grid_set) = c('Age', 'EstimatedSalary')

# plot 'Estimated Salary' ~ 'Age'
plot(test_set[, 1:2],
     main = 'Decision Boundary LDA',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))

# color the plotted points with their real label (class)
points(test_set[1:2], pch = 21, bg = ifelse(test_set[, 3] == 1, 'green4', 'red3'))

# Make predictions on the points of the grid, this will take some time
pred_grid = predict(classifier.lda, newdata = grid_set)$class

# Separate the predictions by a contour
contour(X1, X2, matrix(as.numeric(pred_grid), length(X1), length(X2)), add = TRUE)

LDA from scratch

10. Now let us build a LDA model for our data set without using the lda() function. You are free to do it by creating a function or without creating one. Go back to question 6 and see what did you obtain by using lda(). It computes the prior probability of group membership and the estimated group means for each of the two groups. Additional information that is not provided, but may be important, is the single covariance matrix that is being used for the various groupings.

In LDA, we compute for every observation \(x\) its discriminant score \(\delta_k(x)\). Then we attribute \(x\) to the class that has the highest \(\delta\). Recall that

\[\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k\]

So to compute \(\delta_k(x)\) we need to estimate \(\pi_k\), \(\mu_k\) and \(\Sigma\).

Note that \[x=\begin{pmatrix} X_1 \\ X_2 \end{pmatrix}\] and here \(X_1\)=Age and \(X_2\)=EstimatedSalary.

So let us do it step by step, first we will do the estimates:

10.1 Subset the training set into two sets: class0 where Purchased = 0 and class1 where Purchased = 1).

10.2 Compute \(\pi_0\) and \(\pi_1\).

\[\pi_i = N_i / N, \,\, \text{where} \,\, N_i \,\, \text{is the number of data points in group } i\]

10.3 Compute \(\mu_0\) and \(\mu_1\). \[\mu_0 = \begin{pmatrix} \mu_0(X_1) \\ \mu_0(X_2) \end{pmatrix} \,\, \text{and} \,\, \mu_1 = \begin{pmatrix} \mu_1(X_1) \\ \mu_1(X_2) \end{pmatrix}\]

where, for example, \(\mu_0(X_1)\) is the mean of the variable \(X_1\) in the group \(0\) (the subset class0).

10.4 Compute \(\Sigma\). In the case of two classes like here, it is computed by calculating the following:

\[\Sigma = \frac{(N_0-1)\Sigma_0 + (N_1-1)\Sigma_1}{N_0+N_1-2}\]

where \(\Sigma_i\) is the estimated covariance matrix for specific group \(i\).

Remark: Recall that in LDA we use the same \(\Sigma\). But in QDA we do not.

10.5. Now that we have computed all the needed estimates, we can calculate \(\delta_0(x)\) and \(\delta_1(x)\) for any observation \(x\). And we will attribute \(x\) to the class with the highest \(\delta\). First, try it for \(x\) where \(x^T=(1,1.5)\), what is class prediction for this spesific \(x\)?

10.6. Compute the discriminant scores \(\delta\) for the test set (a matrix \(100\times 2\)), predict the classes and compare your results with the results obtained with the lda() function.

Quadratic Discriminant Analysis (QDA)

Training and assessing a QDA model in R is very similar in syntax to training and assessing a LDA model. The only difference is in the function name qda()

11. Fit a QDA model of Purchased in function of Age and EstimatedSalary. Name the model classifier.qda.

# qda() is a function of library(MASS)
classifier.qda <- qda(Purchased~., data = training_set)

12. Make predictions on the test_set using the QDA model classifier.qda. Show the confusion matrix and compare the results with the predictions obtained using the LDA model classifier.lda.

13. Plot the decision boundary obtained with QDA. Color the points with the real labels.

Comparison

14. In order to compare the methods we used, plot on the same Figure the ROC curve for each classifier we fitted and compare the correspondant AUC. What was the best model for this dataset? Can you justify it?

Remark: If you use the ROCR package:

  • For Logistic regression, use the predicted probabilities in the prediction() (and not the round values “0” or “1”).
  • For LDA and QDA, put pred.lda$posterior[,2] in the prediction() function (those are the posterior probabilities that observations belong to class “1”).