Understanding overfitting (the right, mathematical way)

Overfitting is a common problem in (not only) machine learning where a model is almost perfectly fit (sometimes perfectly) to the training data to the extent that it negatively impacts the performance of the model on new / test data. Oftentimes, it’s said that the model is “learning” the noise, although that a hand-wavey textual way of saying it’s fitting perfectly to the training data. In this tutorial, we'll explore the concept of overfitting using polynomial (linear) regression and provide a mathematical explanation of why choosing a higher-order polynomial decreases the error on the training set.

Introduction

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. While polynomial regression can fit a wide range of curvatures, it's prone to overfitting, especially with higher degree polynomials.

Let's demonstrate this with an example and delve into the mathematical reasoning behind it.

Python Code

# %% Overfitting Example
# This script demonstrates the concept of overfitting using polynomial regression.
# It generates a quadratic dataset with added noise and fits linear, quadratic, and cubic polynomials to a subset of the data.
# The fitted curves are then evaluated on the entire dataset to illustrate overfitting.

import numpy as np
import matplotlib.pyplot as plt

# %% Generate Dataset
# Define the quadratic function coefficients
a = 10
b = 2
c = 1
# Generate random x values and calculate the corresponding y values with added noise
x = np.sort(np.random.rand(100))
y = a * x**2 + b * x + c + 0.2 * np.random.randn(x.shape[0])
# Plot the generated dataset
plt.figure()
plt.plot(x, y, 'bo')
plt.grid(True)
plt.show()

# %% Select Test Points
# Select a subset of the data points as test points
x_test = x[::25]
y_test = y[::25]

# %% Fit Polynomials
# Fit linear, quadratic, and cubic polynomials to the test points
p1 = np.polyfit(x_test, y_test, 1)
p2 = np.polyfit(x_test, y_test, 2)
p3 = np.polyfit(x_test, y_test, 3)

# %% Generate Plotting Points
# Generate evenly spaced x values for plotting the fitted curves
x_plot = np.linspace(x.min(), x.max(), 100)
# Evaluate the fitted curves at the x_plot values
y_plot_p1 = np.polyval(p1, x_plot)
y_plot_p2 = np.polyval(p2, x_plot)
y_plot_p3 = np.polyval(p3, x_plot)

# %% Plot Fitted Curves and Test Points
# Plot the data points, fitted curves, and test points
plt.figure()
plt.plot(x, y, 'bo', markersize=6, label='Data Points')
plt.plot(x_plot, y_plot_p1, 'r-', linewidth=2, label='Linear Fit')
plt.plot(x_plot, y_plot_p2, 'g-', linewidth=2, label='Quadratic Fit')
plt.plot(x_plot, y_plot_p3, 'k-', linewidth=2, label='Cubic Fit')
plt.plot(x_test, y_test, 'ro', markersize=8, linewidth=2, label='Test Points')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Overfitting Example')
plt.grid(True)
plt.legend()
plt.show()

# %% Calculate Mean Squared Error (MSE)
# Calculate the MSE for each fitted curve on the test points
mse_p1_train = np.mean((y_test - np.polyval(p1, x_test))**2)
mse_p2_train = np.mean((y_test - np.polyval(p2, x_test))**2)
mse_p3_train = np.mean((y_test - np.polyval(p3, x_test))**2)
# Calculate the MSE for each fitted curve on the entire dataset
mse_p1_test = np.mean((y - np.polyval(p1, x))**2)
mse_p2_test = np.mean((y - np.polyval(p2, x))**2)
mse_p3_test = np.mean((y - np.polyval(p3, x))**2)

# %% Display MSE Values
# Display the MSE values for the training points and the test set
print('Mean Squared Error (MSE) on Training Points:')
print(f'Linear Fit: {mse_p1_train:.4f}')
print(f'Quadratic Fit: {mse_p2_train:.4f}')
print(f'Cubic Fit: {mse_p3_train:.4f}')
print('Mean Squared Error (MSE) on Test Set:')
print(f'Linear Fit: {mse_p1_test:.4f}')
print(f'Quadratic Fit: {mse_p2_test:.4f}')
print(f'Cubic Fit: {mse_p3_test:.4f}')

Explanation

We start by generating a quadratic dataset with added noise. The true underlying function is $y = 10x^2 + 2x + 1$, and we add random Gaussian noise to the y values.
We select a subset of the data points as test points. These test points will be used to fit the polynomials.
We fit linear, quadratic, and cubic polynomials to the test points using numpy.polyfit(). This function returns the coefficients of the fitted polynomials.
We generate evenly spaced x values for plotting the fitted curves using numpy.linspace(). We then evaluate the fitted curves at these x values using numpy.polyval().
We plot the data points, the fitted curves, and the test points using matplotlib. This allows us to visually observe how well each polynomial fits the data.
We calculate the mean squared error (MSE) for each fitted curve on both the test points and the entire dataset. The MSE measures the average squared difference between the predicted values and the actual values. A lower MSE indicates a better fit.
Finally, we display the MSE values for the training points and the test set.

Output

Overfitting example, Red circles are the 4 training data points which is perfectly described (fit) with a cubic polynomial.

Mean Squared Error (MSE) on Training Points:
Linear Fit: 0.3749
Quadratic Fit: 0.0102
Cubic Fit: 0.0000
Mean Squared Error (MSE) on Test Set:
Linear Fit: 0.8831
Quadratic Fit: 0.0384
Cubic Fit: 0.1675

Introduction

Python Code

Explanation

Output

Observations