PW 7

In this practical work we will learn how to create a report in Rstudio using Rmarkdown files. Then we will apply the \(k\)-means clustering algorithm using the standard function kmeans() in , we will use the dataset Ligue1 2017/2018.

Reporting

Markdown

Markdown is a lightweight markup language with plain text formatting syntax designed so that it can be converted to HTML and many other formats (pdf, docx, etc..).

Click here to see an example of a markdown (.md) syntaxes and the result in HTML. The markdown syntaxes are on right and their HTML result is on left. You can modify the source text to see the result.

Extra: There is some markdown online editors you can use, like dillinger.io/. See the Markdown source file and the HTML preview. Play with the source text to see the result in the preview.

R Markdown

R Markdown is a variant of Markdown that has embedded code chunks, to be used with the knitr package to make it easy to create reproducible web-based reports.

First, in Rstudio create a new R Markdown file. A default template will be opened. There is some code in R chunks. Click on knit, save your file and see the produced output. The output is a html report containing the results of the codes. If your file is named report.Rmd, your report is named report.html.

Make sure to have the latest version of Rstudio.
If you have problems creating a R Markdown file (problem in installing packages, etc..) close your Rstudio and reopen it with administrative tools and retry.

Be ready to submit your report (your .html file) at the end of each class.
You report must be named:

YouLastName_YourFirstName_WeekNumber.html

You can find all the informations about R Markdown on this site: rmarkdown.rstudio.com.

You may also find the following resources helpful:

The report to be submitted

In Rstudio, start by creating a R Markdown file. When you create it a default template will be opened with the following first lines:

---
title: "Untitled"
output: html_document
---

These lines are the YAML header in which you choose the settings of your report (title, author, date, appearance, etc..)

For your submitted report, use the following YAML header:

---
title: "Week 7"
subtitle: "Clustering"
author: LastName FirstName
date: "`#r format(Sys.time())`" # remove the # to show the date
output:
  html_document:
    toc: true
    toc_depth: 2
    theme: flatly
---

Very Important Remark: Click on the settings button of Rstudio’s text editor and choose to Chunk Output in Console.

In the core of your report:

Put every exercise in a section, name the section Exercise i (i is the exercise’s number).
Paste the exercise content.
Write the code of the exercise in R chunks.
Run the chunk to make sure it works.
If there is a need, explain the results.
Click on knit

\(k\)-means clustering

1. Download the dataset: Ligue1 2017-2018 and import it into . Put the argument row.names to 1.

# You can import directly from my website (instead of downloading it..)
ligue1 <- read.csv("http://mghassany.com/MLcourse/datasets/ligue1_17_18.csv", row.names=1, sep=";")

2. Print the first two rows of the dataset and the total number of features in this dataset.

You can create an awesome HTML table by using the function kable from the knitr library. For example, if you want to show the first 5 lines and 5 columns of your dataset, you can use knitr::kable(ligue1[1:5,1:5]). Give it a try and see the result on your html report!

`pointsCards`

3. We will first consider a smaller dataset to easily understand the results of \(k\)-means. Create a new dataset in which you consider only Points and Yellow.cards from the original dataset. Name it pointsCards

4. Apply \(k\)-means on pointsCards. Chose \(k=2\) clusters and put the number of iterations to 20. Store your results into km. (Remark: kmeans() uses a random initialization of the clusters, so the results may vary from one call to another. Use set.seed() to have reproducible outputs).

5. Print and describe what is inside km.

6. What are the coordinates of the centers of the clusters (called also prototypes or centroids) ?

7. Plot the data (Yellow.cards vs Points). Color the points corresponding to their cluster.

8. Add to the previous plot the clusters centroids and add the names of the observations.

9. Re-run \(k\)-means on pointsCards using 3 and 4 clusters and store the results into km3 and km4 respectively. Visualize the results like in question 7 and 8.

How many clusters \(k\) do we need in practice? There is not a single answer: the advice is to try several and compare. Inspecting the ‘between_SS / total_SS’ for a good trade-off between the number of clusters and the percentage of total variation explained usually gives a good starting point for deciding on \(k\) (criterion to select \(k\) similar to PCA).

There is several methods of computing an optimal value of \(k\) with code on following stackoverflow answer: here .

10. Visualize the “within groups sum of squares” of the \(k\)-means clustering results (use the code in the link above).

11. Modify the code of the previous question in order to visualize the ‘between_SS / total_SS’. Interpret the results.

`Ligue 1`

So far, you have only taken the information of two variables for performing clustering. Now you will apply kmeans() on the original dataset ligue1. Using PCA, we can visualize the clustering performed with all the available variables in the dataset.

By default, kmeans() does not standardize the variables, which will affect the clustering result. As a consequence, the clustering of a dataset will be different if one variable is expressed in millions or in tenths. If you want to avoid this distortion, use scale to automatically center and standardize the dataset (the result will be a matrix, so you need to transform it to a data frame again).

12. Scale the dataset and transform it to a data frame again. Store the scaled dataset into ligue1_scaled.

13. Apply kmeans() on ligue1 and on ligue1_scaled using 3 clusters and 20 iterations. Store the results into km.ligue1 and km.ligue1.scaled respectively (do not forget to set a seed)

14. How many observations there are in each cluster of km.ligue1 and km.ligue1.scaled ? (you can use table()). Do you obtain the same results when you perform kmeans() on the scaled and unscaled data?

`PCA`

15. Apply PCA on ligue1 dataset and store you results in pcaligue1. Do we need to apply PCA on the scaled dataset? Justify your answer.

16. Plot the observations and the variables on the first two principal components (biplot). Interpret the results.

17. Visualize the teams on the first two principal components and color them with respect to their cluster.

# You can use the following code, based on `factoextra` library.
fviz_cluster(km.ligue1, data = ligue1, # km.ligue1 is where you stored your kmeans results
              palette = c("red", "blue", "green"), # 3 colors since 3 clusters
              ggtheme = theme_minimal(),
              main = "Clustering Plot"
)

18. Recall that the figure of question 17 is a visualization with PC1 and PC2 of the clustering done with all the variables, not on PC1 and PC2. Now apply the kmeans() clustering taking only the first two PCs instead the variables of original dataset. Visualize the results and compare with the question 17.

By applying \(k\)-means only on the PCs we obtain different and less accurate result, but it is still an insightful way.

`Implementing k-means`

In this part, you will perform \(k\)-means clustering manually, with \(k=2\), on a small example with \(n=6\) observations and \(p=2\) features. The observations are as follows.

Observation	\(X_1\)	\(X_2\)
1	1	4
2	1	3
3	0	4
4	5	1
5	6	2
6	4	0

19. Plot the observations.

20. Randomly assign a cluster label to each observation. You can use the sample() command in to do this. Report the cluster labels for each observation.

21. Compute the centroid for each cluster.

22. Create a function that calculates the Euclidean distance for two observations.

23. Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

24. Repeat 21 and 23 until the answers obtained stop changing.

25. In your plot from 19, color the observations according to the cluster labels obtained.

◼