How Does PCA Work?

How Does PCA Work?#

Data Cleaning and Scaling#

Before we look at the algorithm, we will read in the abalone data discussed previously and clean it for processing.

import pandas as pd

data = pd.read_csv('./../data/abalone.csv')
data
Sex Length/mm Diameter/mm Height/mm Whole Weight/g Shucked Weight/g Viscera Weight/g Shell Weight/g Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
... ... ... ... ... ... ... ... ... ...
4172 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11
4173 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4174 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4175 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4176 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12

4177 rows × 9 columns

As mentioned, the sex data is categorical and, therefore, incompatible with the PCA algorithm in the form that we will discuss. So, we should remove that.

cleaned = data[data.columns[1:]]

The remaining features are continuous variables and can be used for the PCA algorithm.

Warning

Here, we highlight that the rings data is not strictly continuous, as it can only be integer values. We refer to data of this type as discrete. However, as it does not describe a class of the feature, like sex, it is compatible with PCA.

If we look at the variances for each feature, we will see that some features (rings) cover much larger ranges than others (diameter). Therefore, if we naïvely applied the PCA algorithm, we risk the first principal component containing only really information about the rings, as these have the most variance.

cleaned.var()
Length/mm            0.014422
Diameter/mm          0.009849
Height/mm            0.001750
Whole Weight/g       0.240481
Shucked Weight/g     0.049268
Viscera Weight/g     0.012015
Shell Weight/g       0.019377
Rings               10.395266
dtype: float64

Hence, we must scale our data. There are a few approaches to this, but here, we will scale our data such that the mean for each feature is 0 and the variance is 1. In scikit-learn this is called the StandardScaler(), but here we will do it manually to show how it is achieved.

scaled = (cleaned - cleaned.mean()) / cleaned.std()
scaled.var()
Length/mm           1.0
Diameter/mm         1.0
Height/mm           1.0
Whole Weight/g      1.0
Shucked Weight/g    1.0
Viscera Weight/g    1.0
Shell Weight/g      1.0
Rings               1.0
dtype: float64

Our data is now ready for the PCA algorithm.

The PCA Algorithm#

Strictly speaking, the PCA algorithm involves the calculation of the covariance matrix for the data, followed by determining the eigenvalues and eigenvectors. The eigenvectors are the principal components, which can be sorted by their eigenvalues, which are the explained variance. We can see this in action for our data below.

import numpy as np

cov = scaled.cov()
eigenthings = np.linalg.eig(cov)
indices = np.argsort(eigenthings.eigenvalues)[::-1]
explained_variance = eigenthings.eigenvalues[indices]
components = eigenthings.eigenvectors[:, indices].T

Let’s look at the covariance matrix from our input data after the linear transformation by the matrix that our PCA algorithm defines.

transformed = components / np.sqrt(explained_variance[:, np.newaxis]) @ scaled.T
scaled_with_pc = scaled.copy()
for i in range(8):
    scaled_with_pc[f'PC{i+1}'] = transformed.loc[i]
scaled_with_pc[[f'PC{i+1}' for i in range(cleaned.shape[1])]].cov()
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
PC1 1.000000e+00 -1.956715e-16 3.083953e-16 1.780611e-15 -6.206189e-16 -1.050245e-15 -6.264040e-15 -3.776034e-15
PC2 -1.956715e-16 1.000000e+00 -6.805965e-18 3.890034e-16 1.203805e-15 -7.848128e-16 -1.353324e-15 -3.546120e-15
PC3 3.083953e-16 -6.805965e-18 1.000000e+00 -7.537606e-16 -1.711062e-16 -1.592170e-15 2.189181e-15 -1.998401e-15
PC4 1.780611e-15 3.890034e-16 -7.537606e-16 1.000000e+00 2.435685e-15 -8.866896e-16 -1.724887e-15 -8.866045e-15
PC5 -6.206189e-16 1.203805e-15 -1.711062e-16 2.435685e-15 1.000000e+00 7.588651e-16 -1.279011e-14 5.405638e-15
PC6 -1.050245e-15 -7.848128e-16 -1.592170e-15 -8.866896e-16 7.588651e-16 1.000000e+00 8.338158e-15 -4.759922e-15
PC7 -6.264040e-15 -1.353324e-15 2.189181e-15 -1.724887e-15 -1.279011e-14 8.338158e-15 1.000000e+00 -4.338207e-14
PC8 -3.776034e-15 -3.546120e-15 -1.998401e-15 -8.866045e-15 5.405638e-15 -4.759922e-15 -4.338207e-14 1.000000e+00

We can see that this is a diagonal matrix, as we would expect.

However, calculating the covariance matrix can be slow, where there are a large number of data points. Therefore, a more stable (and flexible) approach is to use singular value decomposition (this is how scikit-learn does it).

The SVD approach does not require the computation of the covariance matrix. Instead, the SVD is performed directly on the data.

U, S, Vt = np.linalg.svd(scaled)

The explained variance is then the square of the singular values, scaled by the number of samples in the data (minus one, as it is a sample variance).

explained_variance = np.square(S) / (scaled.shape[0] - 1)
explained_variance
array([6.71243915e+00, 6.95612967e-01, 2.58443121e-01, 1.65989843e-01,
       8.49496637e-02, 6.34727669e-02, 1.26941638e-02, 6.39832445e-03])

The components are then the \(\mathbf{V}^\top\) columns, which are already sorted in terms of decreasing singular values.

components = Vt

transformed = components / np.sqrt(explained_variance[:, np.newaxis]) @ scaled.T
for i in range(8):
    scaled_with_pc[f'PC{i+1}'] = transformed.loc[i]
scaled_with_pc[[f'PC{i+1}' for i in range(cleaned.shape[1])]].cov()
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
PC1 1.000000e+00 1.514327e-16 -2.992498e-16 -7.341935e-16 1.972454e-15 -2.233207e-17 -7.195606e-15 4.078049e-15
PC2 1.514327e-16 1.000000e+00 -5.955219e-17 -1.854625e-16 -1.361193e-17 8.199061e-17 -2.460782e-16 -1.067686e-16
PC3 -2.992498e-16 -5.955219e-17 1.000000e+00 -5.768055e-16 9.868649e-16 3.488057e-17 2.216192e-16 -3.692236e-16
PC4 -7.341935e-16 -1.854625e-16 -5.768055e-16 1.000000e+00 -4.236075e-15 -1.242089e-16 2.169401e-17 -1.171902e-15
PC5 1.972454e-15 -1.361193e-17 9.868649e-16 -4.236075e-15 1.000000e+00 -7.529099e-17 -1.804431e-15 -5.904175e-16
PC6 -2.233207e-17 8.199061e-17 3.488057e-17 -1.242089e-16 -7.529099e-17 1.000000e+00 2.006909e-15 8.507456e-17
PC7 -7.195606e-15 -2.460782e-16 2.216192e-16 2.169401e-17 -1.804431e-15 2.006909e-15 1.000000e+00 -5.742533e-18
PC8 4.078049e-15 -1.067686e-16 -3.692236e-16 -1.171902e-15 -5.904175e-16 8.507456e-17 -5.742533e-18 1.000000e+00