5 Density-based Clustering

5.1 Documents

Documents for this chapter

Table of contents 📄 from “Data Clustering Algorithms and Applications” book for authors C Aggarwal and C Reddy.
Cluster analysis lecture notes 📄 from Prof. Erich Schuberts.
Clustering lecture notes 📄 from Prof. Emilie Chouzenoux and Prof. Frédéric PASCAL.
DBSCAN chapter 📄 from “The Unsupervised Learning” book for authors A Jones, C Kruger and B Johnston.
Density-bsed Methods 📄 from “Data Mining Concepts and Techniques” book for authors J Han, M Kamber and J Pei.
DBSCAN: Density-Based Clustering 📄 from “Practical Guide To Cluster Analysis in R” book for author A Kassambara.
Cluster validation: Internal versus External indexes.

5.2 Lab

During an interview, you are asked to create the DBSCAN algorithm from scratch using a generated two-dimensional dataset. To do this, you will need to convert the theory behind neighborhood searching into production code, with a recursive call that adds neighbors. As explained in the documents, you will use a distance scan in space surrounding a specified point to add these neighbors.

Given what you’ve learned about DBSCAN and distance metrics from the documents, build an implementation of DBSCAN from scratch in R or Python. You are free to use R or Python libraries to evaluate the distances.

These steps will help you to complete the activity:

Generate a random cluster dataset.
Visualize the data.
Create functions from scratch that allow you to call DBSCAN on a dataset.
Use your created DBSCAN implementation to find clusters in the generated dataset. Feel free to use hyperparameters as you see fit, tuning them based on their performance.
Visualize the clustering performance of your DBSCAN implementation from scratch.

The desired outcome of this lab is for you to implement how DBSCAN works from the ground up before you use the fully packaged implementation in scikit-learn for example. Taking this approach to any machine learning algorithm from scratch is important, as it helps you “earn” the ability to use easier implementations, while still being able to discuss DBSCAN in depth in the futur.

Once your implementation is done, you can complete your work by¹:

Extend your algorithm to multidimensional datasets (>2).
Implementing a tool to help tuning the hyperparameters of DBSCAN.
Implementing a tool to compare the performances of DBSCAN with k-means and Hierarchical Clustering.
Implementing other density based algorithms: HDBSCAN or OPTICS.

◼

it is up to you to choose between them↩︎