How Does PCA Work?#
Data Cleaning and Scaling#
Before we look at the algorithm, we will read in the abalone data discussed previously and clean it for processing.
import pandas as pd
data = pd.read_csv('./../data/abalone.csv')
data
Sex | Length/mm | Diameter/mm | Height/mm | Whole Weight/g | Shucked Weight/g | Viscera Weight/g | Shell Weight/g | Rings | |
---|---|---|---|---|---|---|---|---|---|
0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.1500 | 15 |
1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.0700 | 7 |
2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.2100 | 9 |
3 | M | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.1550 | 10 |
4 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.0550 | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4172 | F | 0.565 | 0.450 | 0.165 | 0.8870 | 0.3700 | 0.2390 | 0.2490 | 11 |
4173 | M | 0.590 | 0.440 | 0.135 | 0.9660 | 0.4390 | 0.2145 | 0.2605 | 10 |
4174 | M | 0.600 | 0.475 | 0.205 | 1.1760 | 0.5255 | 0.2875 | 0.3080 | 9 |
4175 | F | 0.625 | 0.485 | 0.150 | 1.0945 | 0.5310 | 0.2610 | 0.2960 | 10 |
4176 | M | 0.710 | 0.555 | 0.195 | 1.9485 | 0.9455 | 0.3765 | 0.4950 | 12 |
4177 rows × 9 columns
As mentioned, the sex
data is categorical and, therefore, incompatible with the PCA algorithm in the form that we will discuss.
So, we should remove that.
cleaned = data[data.columns[1:]]
The remaining features are continuous variables and can be used for the PCA algorithm.
Warning
Here, we highlight that the rings
data is not strictly continuous, as it can only be integer values.
We refer to data of this type as discrete.
However, as it does not describe a class of the feature, like sex
, it is compatible with PCA.
If we look at the variances for each feature, we will see that some features (rings
) cover much larger ranges than others (diameter
).
Therefore, if we naïvely applied the PCA algorithm, we risk the first principal component containing only really information about the rings
, as these have the most variance.
cleaned.var()
Length/mm 0.014422
Diameter/mm 0.009849
Height/mm 0.001750
Whole Weight/g 0.240481
Shucked Weight/g 0.049268
Viscera Weight/g 0.012015
Shell Weight/g 0.019377
Rings 10.395266
dtype: float64
Hence, we must scale our data.
There are a few approaches to this, but here, we will scale our data such that the mean for each feature is 0 and the variance is 1.
In scikit-learn
this is called the StandardScaler()
, but here we will do it manually to show how it is achieved.
scaled = (cleaned - cleaned.mean()) / cleaned.std()
scaled.var()
Length/mm 1.0
Diameter/mm 1.0
Height/mm 1.0
Whole Weight/g 1.0
Shucked Weight/g 1.0
Viscera Weight/g 1.0
Shell Weight/g 1.0
Rings 1.0
dtype: float64
Our data is now ready for the PCA algorithm.
The PCA Algorithm#
Strictly speaking, the PCA algorithm involves the calculation of the covariance matrix for the data, followed by determining the eigenvalues and eigenvectors. The eigenvectors are the principal components, which can be sorted by their eigenvalues, which are the explained variance. We can see this in action for our data below.
import numpy as np
cov = scaled.cov()
eigenthings = np.linalg.eig(cov)
indices = np.argsort(eigenthings.eigenvalues)[::-1]
explained_variance = eigenthings.eigenvalues[indices]
components = eigenthings.eigenvectors[:, indices].T
Let’s look at the covariance matrix from our input data after the linear transformation by the matrix that our PCA algorithm defines.
transformed = components / np.sqrt(explained_variance[:, np.newaxis]) @ scaled.T
scaled_with_pc = scaled.copy()
for i in range(8):
scaled_with_pc[f'PC{i+1}'] = transformed.loc[i]
scaled_with_pc[[f'PC{i+1}' for i in range(cleaned.shape[1])]].cov()
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
---|---|---|---|---|---|---|---|---|
PC1 | 1.000000e+00 | -1.956715e-16 | 3.083953e-16 | 1.780611e-15 | -6.206189e-16 | -1.050245e-15 | -6.264040e-15 | -3.776034e-15 |
PC2 | -1.956715e-16 | 1.000000e+00 | -6.805965e-18 | 3.890034e-16 | 1.203805e-15 | -7.848128e-16 | -1.353324e-15 | -3.546120e-15 |
PC3 | 3.083953e-16 | -6.805965e-18 | 1.000000e+00 | -7.537606e-16 | -1.711062e-16 | -1.592170e-15 | 2.189181e-15 | -1.998401e-15 |
PC4 | 1.780611e-15 | 3.890034e-16 | -7.537606e-16 | 1.000000e+00 | 2.435685e-15 | -8.866896e-16 | -1.724887e-15 | -8.866045e-15 |
PC5 | -6.206189e-16 | 1.203805e-15 | -1.711062e-16 | 2.435685e-15 | 1.000000e+00 | 7.588651e-16 | -1.279011e-14 | 5.405638e-15 |
PC6 | -1.050245e-15 | -7.848128e-16 | -1.592170e-15 | -8.866896e-16 | 7.588651e-16 | 1.000000e+00 | 8.338158e-15 | -4.759922e-15 |
PC7 | -6.264040e-15 | -1.353324e-15 | 2.189181e-15 | -1.724887e-15 | -1.279011e-14 | 8.338158e-15 | 1.000000e+00 | -4.338207e-14 |
PC8 | -3.776034e-15 | -3.546120e-15 | -1.998401e-15 | -8.866045e-15 | 5.405638e-15 | -4.759922e-15 | -4.338207e-14 | 1.000000e+00 |
We can see that this is a diagonal matrix, as we would expect.
However, calculating the covariance matrix can be slow, where there are a large number of data points.
Therefore, a more stable (and flexible) approach is to use singular value decomposition (this is how scikit-learn
does it).
The SVD approach does not require the computation of the covariance matrix. Instead, the SVD is performed directly on the data.
U, S, Vt = np.linalg.svd(scaled)
The explained variance is then the square of the singular values, scaled by the number of samples in the data (minus one, as it is a sample variance).
explained_variance = np.square(S) / (scaled.shape[0] - 1)
explained_variance
array([6.71243915e+00, 6.95612967e-01, 2.58443121e-01, 1.65989843e-01,
8.49496637e-02, 6.34727669e-02, 1.26941638e-02, 6.39832445e-03])
The components are then the \(\mathbf{V}^\top\) columns, which are already sorted in terms of decreasing singular values.
components = Vt
transformed = components / np.sqrt(explained_variance[:, np.newaxis]) @ scaled.T
for i in range(8):
scaled_with_pc[f'PC{i+1}'] = transformed.loc[i]
scaled_with_pc[[f'PC{i+1}' for i in range(cleaned.shape[1])]].cov()
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
---|---|---|---|---|---|---|---|---|
PC1 | 1.000000e+00 | 1.514327e-16 | -2.992498e-16 | -7.341935e-16 | 1.972454e-15 | -2.233207e-17 | -7.195606e-15 | 4.078049e-15 |
PC2 | 1.514327e-16 | 1.000000e+00 | -5.955219e-17 | -1.854625e-16 | -1.361193e-17 | 8.199061e-17 | -2.460782e-16 | -1.067686e-16 |
PC3 | -2.992498e-16 | -5.955219e-17 | 1.000000e+00 | -5.768055e-16 | 9.868649e-16 | 3.488057e-17 | 2.216192e-16 | -3.692236e-16 |
PC4 | -7.341935e-16 | -1.854625e-16 | -5.768055e-16 | 1.000000e+00 | -4.236075e-15 | -1.242089e-16 | 2.169401e-17 | -1.171902e-15 |
PC5 | 1.972454e-15 | -1.361193e-17 | 9.868649e-16 | -4.236075e-15 | 1.000000e+00 | -7.529099e-17 | -1.804431e-15 | -5.904175e-16 |
PC6 | -2.233207e-17 | 8.199061e-17 | 3.488057e-17 | -1.242089e-16 | -7.529099e-17 | 1.000000e+00 | 2.006909e-15 | 8.507456e-17 |
PC7 | -7.195606e-15 | -2.460782e-16 | 2.216192e-16 | 2.169401e-17 | -1.804431e-15 | 2.006909e-15 | 1.000000e+00 | -5.742533e-18 |
PC8 | 4.078049e-15 | -1.067686e-16 | -3.692236e-16 | -1.171902e-15 | -5.904175e-16 | 8.507456e-17 | -5.742533e-18 | 1.000000e+00 |