Fit Nonlinear Data with a Linear Model!

Photo by Sid Verma on Unsplash

Fitting nonlinear data with a linear model is a technique called Polynomial Regression. The intuition is that the model will have a higher degree of freedom to fit the data.

First, we generate the data (note that y is a quadratic function of X):

m = 100
X = 9 * np.random.rand(m, 1) - 7
y = X**2 + 3*X + 5 + np.random.randn(m, 1)

The linear regression model (without Polynomial features):

reg = LinearRegression()
reg.fit(X, y)

Adding polynomial features (XX, X**2):

poly= PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

The first five samples from X:

>>> X[:5]
array([[-0.63502308]
[-6.87887923],
[-4.63090189],
[ 0.23522634],
[-5.11050991]])

The first five samples from X_poly:

>>> X_poly[:5]
array([[-0.63502308, 0.40325431],
[-6.87887923, 47.31897949],
[-4.63090189, 21.4452523 ],
[ 0.23522634, 0.05533143],
[-5.11050991, 26.11731159]])

The linear regression model (with Polynomial features):

reg.fit(X_poly, y)
reg.intercept_, reg.coef_ #--> 4.84, 3.04, 1.01

The models’ coefficients are almost identical to y.

This trick has many applications in machine learning (such as Support Machine Vectors). However, polynomial features can cause over-fitting. The solution is to use grid search to pick the optimal parameter for the polynomial feature function.

Bonus: The grid search implementation is in the link below:

https://github.com/booletic/medium/blob/main/poly.ipynb