# Polynomial Regression

An extension of linear regression is polynomial regression. 
This is where the independent variables are modelled as *n*-degree polynomials. 
This captures non-linear behaviour, which is impossible for linear models. 
However, this comes at the risk of overfitting, where the polynomial *n* is too large. 
In polynomial regression, the matrix $\mathbf{X}$ is expanded to include polynomial terms among the independent variables (the features of the data). 
This can include cross-terms between different parameters. 

## A Quadratic Example 
Let's consider an example where *n* is 2. 
Again, we will use the student performance dataset to see if the additional parameters can improve the mean-squared error. 

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('../data/student-performance.csv')  
data['Encoded EA'] = [1 if x == 'Yes' else 0 for x in data['Extracurricular Activities']]
train, test = train_test_split(data, test_size=0.2, random_state=42)
X = train.drop(['Performance Index', 'Extracurricular Activities'], axis=1)

The mathematics of polynomial regression is the same matrix equation discussed for linear regression; however, now there are many more parameters, $\beta$, being estimated. 
The student performance data has five features that we are training against. 

In [None]:
X.shape

However, if we add all the cross- and self-polynomial terms, there will be many more "features" to train against. 
If there are features $a$ and $b$ in linear regression, we would have features $a$, $b$, $a^2$, $ab$ and $b^2$, so you can see how the number of features can balloon. 
There is a `sklearn.preprocessing` method to perform this extension of the matrix $\mathbf{X}$; we set `include_bias` to `False` as the scikit-learn implementation of `LinearRegression handles this for us. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_ = poly.fit_transform(X)
y = train['Performance Index']
X_.shape

This *new* feature set can be put into the standard `LinearRegression` method, as the matrix equations are the same. 

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(X_, y)

We can then check the model against the test dataset and compute the mean-squared error. 

In [None]:
from sklearn.metrics import mean_squared_error

X_test = test.drop(['Performance Index', 'Extracurricular Activities'], axis=1).values
X_test_ = poly.fit_transform(X_test)
y_test = test['Performance Index'].values
mean_squared_error(y_test , linear_regression.predict(X_test_))

There is an improvement in the MSE; however, it is not substantial. 
This indicates that the linear model was probably sufficient to describe the trends. 
We can see this by investigating the coefficients $\beta$. 

In [None]:
linear_regression.coef_

We should observe that the zeroth to fourth values in the `coef_` array are more significant than the other terms. 
These larger values are associated with the linear terms, while the smaller ones are the polynomial terms, which we accept are not important to model the data. 