# Principal Component Analysis

The first of the dimensionality reduction algorithms that we will look at is principal component analysis (PCA). 
PCA benefits from being computationally efficient compared to other dimensionality reduction approaches. 
Additionally, it is possible to interpret the results by tending towards ideas of explainable machine learning that we will meet again later. 
However, PCA assumes that there are linear trends in the data that can be used to bring different features together. 
This means that it may be less effective where the trends are non-linear. 


## What Is the Aim of PCA?

The *aim* of PCA is to transform the data into a new coordinate system defined by the data's **principal components**. 
These new axes (the principal components) are ordered by how much of the variance is present in the original data they explain. 
A matrix with these principal components as columns, scaled by the amount of variance each describes, produces a vector space where all variables are distributed with a standard deviation of 1 and a zero covariance.
Let's see that in action for a two-dimensional example.
````{margin}
```{note}
The data file can be obtained [here](../data/pca-example.csv). 
```
````

In [None]:
import pandas as pd

data = pd.read_csv('./../data/pca-example.csv')
data

This data has two columns, *x* and *y*, and 200 datapoints. 
We can visualise this. 

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(4, 4))
data.plot(kind='scatter', x='x', y='y', ax=ax)
ax.axis('equal')
plt.show()

If we pass this data, as is, to a PCA algorithm (we will use the `scikit-learn`), we should get a pair of principal component vectors. 

````{margin}
```{note}
Take a look at the documentation for [`sklearn.decomposition.PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), you will see that if the `n_components` is not given, then the number of features present will be used.
```
````

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(data)
pca.components_

Each of the columns in this vector is a principal axis. 
We then scale by their *explained variance* (or rather the squared root of the explained variance). 
````{margin}
```{note}
To ensure that the arrays are broadcast correctly, i.e., the explained variances weight the columns of the `components_`, we use `np.newaxis` to create a column of variances.
```
````

In [None]:
pca.explained_variance_

In [None]:
import numpy as np

weighted_components = pca.components_ / np.sqrt(pca.explained_variance_[:, np.newaxis])
weighted_components

The matrix above can then be used to perform a linear transformation on the original data. 
We will add the results of the linear transformation into the original `data` object as `x_prime` and `y_prime`. 

In [None]:
transformed = weighted_components @ data.T
data['x_prime'] = transformed.loc[0]
data['y_prime'] = transformed.loc[1]

We can see that the result of the transformation has a covariance matrix, approximating an identify matrix. 

In [None]:
data[['x_prime', 'y_prime']].cov()

Then, when we plot this, the data are distributed as expected. 

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))

data.plot(kind='scatter', x='x', y='y', ax=ax[0])
data.plot(kind='scatter', x='x_prime', y='y_prime', ax=ax[1])
ax[0].axis('equal')
ax[0].set_aspect('equal')
ax[1].axis('equal')
ax[1].set_aspect('equal')
plt.show()

### What Are the Principal Components?

The principal components are vectors that describe the (new) dimension of the highest variance in the data. 
Each principal component must be orthogonal (i.e., at a right angle) to each other. 
So, suppose we visualise the above data's first and second principal components. 
In that case, we see that the first component will follow the direction of the data along the positive diagonal in the plot. 
Meanwhile, the second principal component must be orthogonal to this so that it will sit at a right angle to the first. 

````{margin}
```{note}
These vectors have not been scaled, for example, by the explained variance (the importance) of that principal component.
```
````

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
    data.plot(kind='scatter', x='x', y='y', ax=ax)
    ax.plot([-pca.components_[i, 0] * 3, pca.components_[i, 0] * 3], 
            [-pca.components_[i, 1] * 3, pca.components_[i, 1] * 3], color='k', lw=2)
    ax.axis('equal')
    ax.set_aspect('equal')
plt.show()

## Where Does PCA Go Wrong?

PCA assumes there are linear relationships present in the data. 
Consider the following moon-shaped data. 

In [None]:
data = pd.read_csv('./../data/moon.csv')

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))
data.plot(kind='scatter', x='x', y='y', ax=ax)
ax.axis('equal')
plt.show()

There are some relationships in the data that we should aim to describe. 
However, is we attempt to use the data's principal components to describe it, we find that both are equally weighted. 

In [None]:
pca = PCA()
pca.fit(data)
pca.explained_variance_

This isn't surprising if we visualise the principal components plotted on top of the data. 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
    data.plot(kind='scatter', x='x', y='y', ax=ax)
    ax.plot([-pca.components_[i, 0] * 3, pca.components_[i, 0] * 3], 
            [-pca.components_[i, 1] * 3, pca.components_[i, 1] * 3], color='k', lw=2)
    ax.axis('equal')
    ax.set_aspect('equal')
plt.show()

The components cannot describe the curve in the data; instead, they sit along $x=0$ and $y=0$.
This is because the principal components can only be linear, and the non-linear relationships present in the data cannot be accurately modelled. 
Therefore, it is generally a good idea to check the linearity of your data before PCA is applied. 
```{warning}
There is no hard rule for how linear data must be to use PCA successfully. 
However, it is essential to consider the model (for PCA, it is a linear one) that is being assumed. 
```