k-Means Clustering#

The most straightforward approach to clustering is achieved by creating clusters of particles as close as possible to each other in space. This is essentially what k-means clustering does. From a predefined number of clusters, it uses an Expectation-Maximisation (EM) algorithm to find the optimal position of cluster centres that minimises their Euclidian distance to the data points.

Expectation-Maximisation Algorithm#

The EM algorithm is an iterative optimisation approach popular for finding the best estimates for parameters in statistical models, such as k-means. Generally, this algorithm alternates between two stages:

  1. Expectation: Estimating the model’s parameters and comparing these parameters with the observed data.

  2. Maximisation: Updating the parameters to improve how they describe the observed data.

We will see the EM algorithm appear a few times, so it is nice to consider it in the abstract.

k-Means In Action#

Before discussing the k-mean algorithm, let’s look at what it can do for some of the data transformed into two principal components in the previous section.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

data = pd.read_csv('./../data/breast-cancer.csv')
scaled_data = StandardScaler().fit_transform(data[data.columns[1:]])
pca = PCA()
transformed = pca.fit_transform(scaled_data)
pc = pd.DataFrame(transformed, columns=['PC{}'.format(i + 1) for i in range(transformed.shape[1])])
pc['Diagnosis'] = data['Diagnosis']
pc[['PC1', 'PC2']].head()
PC1 PC2
0 5.224155 3.204428
1 1.728094 -2.540839
2 3.969757 -0.550075
3 3.596713 6.905070
4 3.151092 -1.358072

As is the case with many of the statistical machine learning approaches, scikit-learn has a useful k-means implementation.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)
kmeans.fit(pc[['PC1', 'PC2']])
pc['Cluster'] = kmeans.labels_

The KMeans class follows the standard process of other scikit-learn implementations, with the creation of an object followed by the fitting. We can visualise these results in the same way we used previously, though this time, the colour indicates the selected cluster for each point.

import matplotlib.pyplot as plt
import seaborn as sns

sns.jointplot(x='PC1', y='PC2', hue='Cluster', data=pc, kind='scatter')
plt.show()
../_images/ce8360b5fc2181153f538e42a67f8facd3ba0613d4319f3861a835ad7c7176c1.png

To calculate and assess the accuracy of the PCA followed by clustering, to find the true labels of Benign and Malignant, we map the values to 0 and 1, respectively.

true_labels = pc['Diagnosis'].map({'Malignant': 1, 'Benign': 0})
cluster_labels = pc['Cluster']

Using this, a confusion matrix can be plotted.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(confusion_matrix(true_labels, cluster_labels, normalize='true'), 
                              display_labels=['Malignant', 'Benign'])
disp.plot()
plt.show()
../_images/74cb4d0e12d956dc789d968b174ae5e29ec81816c60b8325c7b60e463e730437.png

We see that this approach produces a false negative (when the tumour was predicted to be benign but was actually malignent) rate of 1.4%. However, it also had a false positive rate of over 20 %, meaning over 20% of benign tumours were misidentified as malignent. This approach is unlikely to be acceptable for real world use, other then perhaps as an early screening tool to flag for additional testing.

Another scoring method that can be used is an accuracy score.

from sklearn.metrics import accuracy_score

accuracy_score(true_labels, cluster_labels)
0.9033391915641477

This tells us that the accuracy, which is the ratio of correct predictions to total predictions, is 90 %.

As we work through the material, we will look at other metrics for scoring machine learning classification algorithms. But now, let’s look at the k-means algorithm.