6. Understanding Overfitting and Underfitting in Machine Learning

  • 4.0/5
  • 187
  • May 03, 2025
In the world of machine learning, building an effective model means finding a delicate balance. One of the most common pitfalls developers encounter is a problem known as overfitting, which can seriously harm a model's performance.

In this article, we'll explore what overfitting is, how it contrasts with underfitting, and how to recognize when your model might be suffering from either issue.

Underfitting: Too Simple to Be Useful

When a model is too simple to capture the underlying pattern in the data, it's said to underfit. The technical term for this is high bias. The model's assumptions are too rigid—such as assuming a strictly linear relationship when the real world is more complex.

It doesn't perform well on the training set and, as expected, does poorly on new data too.

To understand this better, let's consider a more relatable example: predicting student test scores based on the number of hours they study. Suppose we plot study time (input feature x) against test scores (target value y).

A simple approach is to use linear regression to fit a straight line to this data. However, in real life, the relationship isn't perfectly linear—students may hit a point of diminishing returns where additional study hours don't significantly improve their scores.

If our model assumes a strictly linear relationship, it may fail to capture this plateauing effect, resulting in a poor fit. The model is too simplistic to capture the actual trend, and as a result, it underfits the data.

Here's the graph showing how a linear regression model (w1·x + b) underfits the actual data trend when predicting student test scores based on hours studied.



The red dashed line represents the linear model, which fails to capture the plateau effect seen in the actual data.

Now, what if we fit a slightly more complex model, like a quadratic function (w1·x + w2·x2 + b) ? This type of model includes not only x (hours studied) but also x² as a feature.

Suddenly, the curve starts to capture the plateauing effect of test scores as study time increases. While this model doesn't perfectly match every data point, it does a much better job of following the overall trend.

Most importantly, it's more likely to generalize well to new students whose data wasn't part of the training set.



It fits the trend more accurately than the linear model, showing less underfitting.

What Is Overfitting?

Overfitting occurs when a machine learning model fits the training data too well—so well, in fact, that it captures noise and random fluctuations rather than the underlying patterns.

This leads to poor performance on new, unseen data because the model essentially "memorized" the training data rather than learning generalizable insights.

If we use a high-degree polynomial (e.g., 20th-order) or an interpolation method that passes through all data points exactly, it results in a visibly erratic and wiggly curve.

At first glance, this might seem perfect—the training error is zero! But if you look closer, the model becomes wiggly, erratic, and counterintuitive.



The model predicts a lower score at 8.5 hours of study than at 8 hours. This contradicts expectations and real-world intuition.

This is a classic case of overfitting—the model has high variance. It's too sensitive to the specific data points in the training set. Even small changes in the data could result in drastically different curves.

If two engineers trained this model on slightly different student data, they could end up with models that make very different predictions. That's not robustness—it's unreliability.

Just Right

Suppose we use a fourth-order polynomial with features like x, x², x³, and x⁴. This model doesn't perfectly match every training example—but it captures the overall trend well. More importantly, it generalizes effectively to new, unseen data.



This represents the "Just Right" model in machine learning—not too simple (which would underfit), and not too complex (which would overfit), but balanced and well-suited to the data.

In machine learning, we often use the terms high bias and high variance to describe underfitting and overfitting, respectively. The goal is to minimize both bias and variance, aiming for a model that captures the true pattern of the data and generalizes well.

A Classification Example

Let's consider a classification task where you're predicting whether a financial transaction is fraudulent or not, using features like transaction amount, time of day, and user location.

A simple logistic regression model might draw a straight-line boundary to separate fraudulent and non-fraudulent transactions. While it works okay, it misses some of the more complex patterns that can indicate fraud, like the combination of a large amount and an unusual time—this is underfitting (high bias).

By adding more features or interactions, like transaction amount and time together, the model becomes more flexible and can draw a more complex boundary, maybe a curve. It performs better and generalizes well, even if it doesn't perfectly classify all transactions.

But if you use a very complex model with too many features or a high-degree polynomial, the decision boundary becomes overly complicated, fitting each individual transaction exactly. The model may perform perfectly on the training data but fail to detect fraud in new, unseen transactions. This is overfitting (high variance).



How to Address Overfitting?

There are three strategies to reduce overfitting:

1) Collect More Training Data: The most effective tool to combat overfitting is increasing the amount of training data. If you collect scores from more students, especially those who studied for more varied hours, your model will generalize better.

2) Use Fewer Features: Suppose you were using not just hours of study but also unrelated features like favorite color or seating position. Removing these irrelevant features helps the model focus on meaningful patterns and reduces overfitting.

3) Apply Regularization: Even if you keep a polynomial model, regularization penalizes large coefficients. This makes the curve smoother and less extreme, improving generalization. It's like saying: "Use all the features, but don’t rely on any one of them too much.

Apply Regularization

In the previous discussion, we explored how regularization helps reduce overfitting by encouraging the parameter values (such as 𝑤1 , 𝑤2 , . . . , 𝑤𝑛) to remain small. In this section, we'll build on that intuition and formalize how regularization modifies the cost function in a learning algorithm to achieve that goal.

Why Regularization?

Recall the example from earlier: you fit a quadratic function (w1·x + w2·x2 + b) to student data—say, hours of study vs. score—and it performed well. But fitting a higher-order polynomial, such as a 20th-degree curve, resulted in overfitting.

That curve clung tightly to the training points but failed to generalize to new data. Now, imagine you could force some of the less useful weights—say 𝑤3 and 𝑤4 —to be very small, nearly zero. What would happen?

You'd essentially eliminate the contribution of the cubic and quartic terms, resulting in a simpler model, much like a quadratic. This smoother model would better capture the underlying trend without overreacting to each data point.

Regularized Cost Function

To formalize this idea, let’s revisit the standard cost function used in linear regression:

𝐽(𝑤,𝑏) = 1/2𝑚 ​ ∑𝑖=1𝑚 ( 𝑓(𝑥(𝑖)) − 𝑦(𝑖))2

To apply regularization, we add a penalty for large parameter values:

𝐽(𝑤,𝑏) = 1/2𝑚 ​ ∑𝑖=1𝑚 ( 𝑓(𝑥(𝑖)) − 𝑦(𝑖))2 + λ/2m ​ ∑j=1n ​ wj2

λ is the regularization parameter (you choose this),
𝑚 is the number of training examples,
𝑛 is the number of features,
𝑤𝑗 are the model weights (excluding the bias 𝑏).

Why 𝜆/2𝑚? - This scaling ensures both the data-fit term and regularization term are balanced proportionally. It also makes the choice of 𝜆 less sensitive to the number of training examples.

Why not penalize 𝑏? - Penalizing the bias term has negligible effect in practice.

This new cost function now balances two goals:

1) Minimize prediction error: Fit the training data well.
2) Minimize model complexity: Keep the weights small to reduce overfitting

The value of 𝜆 controls this trade-off:

𝜆 = 0: Regularization is off; the model may overfit.
Large 𝜆: The weights shrink toward zero, and the model underfits (e.g., a nearly flat line).
Moderate 𝜆: Encourages simpler models that generalize better.

Gradient Descent in Regularized Linear Regression

In this section, we'll dive into how gradient descent can be adapted to work with regularized linear regression. Regularization is a powerful technique used to reduce overfitting, especially when dealing with models that have many features.

Since we've already learned how standard linear regression works with gradient descent, we are now just one step away from mastering its regularized form.

Recall the standard cost function for linear regression:

𝐽(𝑤,𝑏) = 1/2𝑚 ​ ∑𝑖=1𝑚 ( 𝑓(𝑥(𝑖)) − 𝑦(𝑖))2

To prevent overfitting, we add a regularization term:

𝐽(𝑤,𝑏) = 1/2𝑚 ​ ∑𝑖=1𝑚 ( 𝑓(𝑥(𝑖)) − 𝑦(𝑖))2 + λ/2m ​ ∑j=1n ​ wj2

Here: 𝜆 is the regularization parameter controlling the penalty for large weights. Only the weights 𝑤𝑗 are regularized — not the bias term 𝑏.

Before Regularization: The below gradient descent algorithm updates parameters 𝑤 and 𝑏 based on their derivatives. We repeat these updates until convergence.

Repeat these updates until convergence {

𝑤j = 𝑤j − α · 𝑑/𝑑𝑤j · 𝐽(𝑤,𝑏)
𝑏 = 𝑏 − α · 𝑑/𝑑𝑏 · 𝐽(𝑤,𝑏)
}

or

Repeat these updates until convergence {

𝑤j = 𝑤j − α · [ 1/𝑚 ​ ∑𝑖=1𝑚 [( 𝑓(𝑥(𝑖)) − 𝑦(𝑖)) ​ 𝑥j(𝑖)] + λ/m ​ 𝑤j

𝑏 = 𝑏 − α · 1/𝑚 ​ ∑𝑖=1𝑚 ( 𝑓(𝑥(𝑖)) − 𝑦(𝑖))]

}

The update for 𝑏 remains the same.

Regularized Logistic Regression

As you've seen in previous articles, logistic regression works by applying a sigmoid function to a linear combination of features.

However, just like linear regression, logistic regression is also prone to overfitting—especially when you're working with high-degree polynomial features or a large number of features in general.

For example, when fitting a high-order polynomial, the decision boundary can become overly complex, tightly hugging the training data. While this may result in low training error, it typically leads to poor generalization to new, unseen data.

To combat overfitting, we introduce regularization into the logistic regression cost function. Here’s the standard cost function for logistic regression:

J(w,b) = −1/m ​ ∑i=1m ​ [y(i)log(f(x(i)))+(1−y(i))log(1−f(x(i)))]

To regularize it, we simply add a penalty term to discourage large weights:

J(w,b) = J(w,b) + 𝜆/2𝑚 ∑j=1n 𝑤𝑗2

The beauty of this approach is that the gradient descent update rules remain almost identical to those for regularized linear regression.

Without Regularization:
The weight and bias updates are:

repeat{

wj := wj − α ⋅ 1/m ​ ∑i=1m ​ (f(x(i))−y(i))xj(i)
𝑏:=𝑏−𝛼 ⋅ 1/m ​ ∑i=1m ​ (f(x(i))−y(i))

}

With Regularization:
The weight update gains an additional term:

wj := wj − α ⋅ [1/m ​ ∑i=1m ​ (f(x(i))−y(i))xj(i) + λ/m ​ wj]

The update for 𝑏 b remains unchanged, because we do not regularize the bias term.
Index
1. Introduction to Machine Learning: Theoretical Foundations

18 min

2. Supervised learning: Univariate Linear Regression (Linear Regression with One Variable)

17 min

3. Supervised learning: Multiple features (Linear Regression with Multiple Variable)

13 min

4. Supervised learning: Understanding Classification with Logistic Regression

8 min

5. Supervised learning: Cost function and Gradient descent for Logistic Regression

7 min

6. Understanding Overfitting and Underfitting in Machine Learning

9 min