Clustering

Clustering#

Another unsupervised machine learning technique is clustering. Clustering aims to identify similarities between data points and, therefore, cluster them. This is an unsupervised approach, not requiring labels to be used. Instead, the positions of the data points in the feature space may provide a guide for the labels.

../_images/clustering.png

Fig. 24 A visual representation of two different clusterings of the same dataset.#

Clustering aims to find natural groupings in data. However, as a result, it depends on the feature space in which the data is clustered. Some dimensionality reduction techniques are often used before data clustering to provide a more meaningful psuedo-feature space. But, many clustering algorithms can operate in N-dimensions, though our appreciation of them is limited to two or three.

In this section, we will look at three popular and powerful clustering algorithms: k-means clustering, Gaussian mixture models and hierarchical clustering, but this is by no means the only approach. The scikit-learn documentation has a nice overview of clustering methods that is great to look at.