Polynomial Regression#
An extension of linear regression is polynomial regression. This is where the independent variables are modelled as n-degree polynomials. This captures non-linear behaviour, which is impossible for linear models. However, this comes at the risk of overfitting, where the polynomial n is too large. In polynomial regression, the matrix \(\mathbf{X}\) is expanded to include polynomial terms among the independent variables (the features of the data). This can include cross-terms between different parameters.
A Quadratic Example#
Let’s consider an example where n is 2. Again, we will use the student performance dataset to see if the additional parameters can improve the mean-squared error.
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('../data/student-performance.csv')
data['Encoded EA'] = [1 if x == 'Yes' else 0 for x in data['Extracurricular Activities']]
train, test = train_test_split(data, test_size=0.2, random_state=42)
X = train.drop(['Performance Index', 'Extracurricular Activities'], axis=1)
The mathematics of polynomial regression is the same matrix equation discussed for linear regression; however, now there are many more parameters, \(\beta\), being estimated. The student performance data has five features that we are training against.
X.shape
(8000, 5)
However, if we add all the cross- and self-polynomial terms, there will be many more “features” to train against.
If there are features \(a\) and \(b\) in linear regression, we would have features \(a\), \(b\), \(a^2\), \(ab\) and \(b^2\), so you can see how the number of features can balloon.
There is a sklearn.preprocessing
method to perform this extension of the matrix \(\mathbf{X}\); we set include_bias
to False
as the scikit-learn implementation of `LinearRegression handles this for us.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_ = poly.fit_transform(X)
y = train['Performance Index']
X_.shape
(8000, 20)
This new feature set can be put into the standard LinearRegression
method, as the matrix equations are the same.
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X_, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We can then check the model against the test dataset and compute the mean-squared error.
from sklearn.metrics import mean_squared_error
X_test = test.drop(['Performance Index', 'Extracurricular Activities'], axis=1).values
X_test_ = poly.fit_transform(X_test)
y_test = test['Performance Index'].values
mean_squared_error(y_test , linear_regression.predict(X_test_))
4.0806431061515545
There is an improvement in the MSE; however, it is not substantial. This indicates that the linear model was probably sufficient to describe the trends. We can see this by investigating the coefficients \(\beta\).
linear_regression.coef_
array([ 2.78106210e+00, 1.02971732e+00, 2.40177829e-01, 1.84813326e-01,
2.57332248e-01, 2.85528073e-03, 2.73505146e-05, 3.47997954e-03,
2.54364914e-04, 3.51859945e-02, -1.05936488e-04, 5.44727734e-04,
-2.88919698e-04, -7.76943024e-04, 1.21638891e-02, 5.15124216e-03,
-1.53319234e-03, -6.85617338e-04, -3.32331961e-03, 2.57332248e-01])
We should observe that the zeroth to fourth values in the coef_
array are more significant than the other terms.
These larger values are associated with the linear terms, while the smaller ones are the polynomial terms, which we accept are not important to model the data.