Overfitting is a common problem in (not only) machine learning where a model is almost perfectly fit (sometimes perfectly) to the training data to the extent that it negatively impacts the performance of the model on new / test data. Oftentimes, it’s said that the model is “learning” the noise, although that a hand-wavey textual way of saying it’s fitting perfectly to the training data. In this tutorial, we'll explore the concept of overfitting using polynomial (linear) regression and provide a mathematical explanation of why choosing a higher-order polynomial decreases the error on the training set.
Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. While polynomial regression can fit a wide range of curvatures, it's prone to overfitting, especially with higher degree polynomials.
Let's demonstrate this with an example and delve into the mathematical reasoning behind it.
# %% Overfitting Example
# This script demonstrates the concept of overfitting using polynomial regression.
# It generates a quadratic dataset with added noise and fits linear, quadratic, and cubic polynomials to a subset of the data.
# The fitted curves are then evaluated on the entire dataset to illustrate overfitting.
import numpy as np
import matplotlib.pyplot as plt
# %% Generate Dataset
# Define the quadratic function coefficients
a = 10
b = 2
c = 1
# Generate random x values and calculate the corresponding y values with added noise
x = np.sort(np.random.rand(100))
y = a * x**2 + b * x + c + 0.2 * np.random.randn(x.shape[0])
# Plot the generated dataset
plt.figure()
plt.plot(x, y, 'bo')
plt.grid(True)
plt.show()
# %% Select Test Points
# Select a subset of the data points as test points
x_test = x[::25]
y_test = y[::25]
# %% Fit Polynomials
# Fit linear, quadratic, and cubic polynomials to the test points
p1 = np.polyfit(x_test, y_test, 1)
p2 = np.polyfit(x_test, y_test, 2)
p3 = np.polyfit(x_test, y_test, 3)
# %% Generate Plotting Points
# Generate evenly spaced x values for plotting the fitted curves
x_plot = np.linspace(x.min(), x.max(), 100)
# Evaluate the fitted curves at the x_plot values
y_plot_p1 = np.polyval(p1, x_plot)
y_plot_p2 = np.polyval(p2, x_plot)
y_plot_p3 = np.polyval(p3, x_plot)
# %% Plot Fitted Curves and Test Points
# Plot the data points, fitted curves, and test points
plt.figure()
plt.plot(x, y, 'bo', markersize=6, label='Data Points')
plt.plot(x_plot, y_plot_p1, 'r-', linewidth=2, label='Linear Fit')
plt.plot(x_plot, y_plot_p2, 'g-', linewidth=2, label='Quadratic Fit')
plt.plot(x_plot, y_plot_p3, 'k-', linewidth=2, label='Cubic Fit')
plt.plot(x_test, y_test, 'ro', markersize=8, linewidth=2, label='Test Points')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.grid(True)
plt.legend()
plt.show()
# %% Calculate Mean Squared Error (MSE)
# Calculate the MSE for each fitted curve on the test points
mse_p1_train = np.mean((y_test - np.polyval(p1, x_test))**2)
mse_p2_train = np.mean((y_test - np.polyval(p2, x_test))**2)
mse_p3_train = np.mean((y_test - np.polyval(p3, x_test))**2)
# Calculate the MSE for each fitted curve on the entire dataset
mse_p1_test = np.mean((y - np.polyval(p1, x))**2)
mse_p2_test = np.mean((y - np.polyval(p2, x))**2)
mse_p3_test = np.mean((y - np.polyval(p3, x))**2)
# %% Display MSE Values
# Display the MSE values for the training points and the test set
print('Mean Squared Error (MSE) on Training Points:')
print(f'Linear Fit: {mse_p1_train:.4f}')
print(f'Quadratic Fit: {mse_p2_train:.4f}')
print(f'Cubic Fit: {mse_p3_train:.4f}')
print('Mean Squared Error (MSE) on Test Set:')
print(f'Linear Fit: {mse_p1_test:.4f}')
print(f'Quadratic Fit: {mse_p2_test:.4f}')
print(f'Cubic Fit: {mse_p3_test:.4f}')
numpy.polyfit()
. This function returns the coefficients of the fitted polynomials.numpy.linspace()
. We then evaluate the fitted curves at these x values using numpy.polyval()
.matplotlib
. This allows us to visually observe how well each polynomial fits the data.Overfitting example, Red circles are the 4 training data points which is perfectly described (fit) with a cubic polynomial.
Mean Squared Error (MSE) on Training Points:
Linear Fit: 0.3749
Quadratic Fit: 0.0102
Cubic Fit: 0.0000
Mean Squared Error (MSE) on Test Set:
Linear Fit: 0.8831
Quadratic Fit: 0.0384
Cubic Fit: 0.1675