Clustering For Novices

In an interactive, content+quiz format

Arun Jagota
7 min readJul 21, 2023
By Author

This covers basic concepts of clustering and hierarchical clustering followed by some algorithms: K-means clustering, DBScan, and bottom-up hierarchical clustering.

Paper and pencil may help. Quiz-style questions are followed by answers downstream.

Clustering is the process of grouping similar data points together, keeping dissimilar ones in different clusters.

What might be a sensible clustering of the 1D data: 1, 2, 3, 8, 9, 10,20,21,22. Why?

In hierarchical clustering, the clusters themselves are arranged into a hierarchy. That is, clusters may have subclusters nested within them.

What might be a sensible hierarchical clustering of the 1D data: 1, 2, 8, 9, 100,102,111,112. Why?

K-means Clustering

This is a clustering algorithm in which the number of clusters K is specified in advance. Of course, from this, we can immediately see that if the actual number of clusters in the data differs significantly from K, this algorithm may produce poor results.

The algorithm tracks and evolves the K clusters as K points in the space of the data points. These points eventually become the means of the clusters they represent, hence the term ‘means’ in…

--

--

Arun Jagota

PhD, Computer Science, neural nets. 14+ years in industry: data science algos developer. 24+ patents issued. 50 academic pubs. Blogs on ML/data science topics.