Skip to main content

5. Supervised learning: Cost function and Gradient descent for Logistic Regression

In this article, the main focus is on understanding how to choose a suitable cost function for logistic regression. The cost function plays a crucial role in measuring how well a given set of parameters—typically denoted as ๐‘ค and ๐‘—fits the training data.

By evaluating how well these parameters perform across the training set, the cost function gives us a method to iteratively improve them, generally using an optimization algorithm like gradient descent.

Why Squared Error Doesn't Work for Logistic Regression?

Squared error (used in linear regression) produces a non-convex cost function when used with logistic regression. A non-convex function has many local minima, which makes gradient descent unreliable.

In linear regression, the model is:

๐‘“(๐‘ฅ) = ๐‘ค⋅๐‘ฅ + ๐‘

And the squared error cost function is:

J(w, b) = (1/2m) * ∑i=1m (f(x(i)) - ลท(i)

This function is convex, meaning it has a nice bowl shape. So, gradient descent works well—it gradually steps toward the global minimum, no matter where it starts.

In logistic regression, the prediction is:

f(x)= ลท = 1/1+e-(๐‘ค⋅๐‘ฅ + ๐‘)

This function is non-linear and always gives a value between 0 and 1 (interpreted as a probability of fraud in our case).

If we plug this nonlinear ๐‘“(๐‘ฅ) into the same squared error formula:

J(w, b) = (1/2m) * ∑i=1m (1/1+e-(๐‘ค⋅๐‘ฅ + ๐‘) - ลท(i)

...then the resulting cost function is no longer convex.



When using squared error as a cost function in logistic regression, the shape of the cost surface becomes non-convex—this means that instead of a smooth bowl shape with a single lowest point (the global minimum), the surface has multiple dips or valleys (local minima).

Gradient descent works by moving in the direction that decreases the cost function the most. If the surface is non-convex, gradient descent can:

- Get stuck in a local minimum (a dip that's not the lowest possible point),
- Fail to reach the best parameters that result in optimal predictions.

Cost Function for Logistic Regression

Let: ๐‘“(๐‘ฅ(๐‘–)) = prediction for the ๐‘–๐‘กโ„Ž training example (i.e., probability it is fraudulent)
๐‘ฆ(๐‘–) = true label (1 if fraud, 0 otherwise)
๐‘š = number of training examples

For a Single Example Logistic loss Function:

If ๐‘ฆ=1: Loss = −log⁡(๐‘“(๐‘ฅ))
If ๐‘ฆ=0: Loss = −log⁡(1−๐‘“(๐‘ฅ))

Loss is a number that tells you how wrong your model's prediction is for a single training example. Loss is for one example. Cost is the average loss over all examples. The goal of training is to minimize loss, and therefore minimize the cost.

Here's the graph showing both log(x) (in blue) and -log(x) (in red):


The loss function commonly used for logistic regression is binary cross-entropy (or log loss), which measures how far the predicted probability is from the actual class label (0 or 1).

Here's the graph of the loss function − log ⁡(๐‘“) for ๐‘ฆ = 1 (true label 1) in logistic regression:


As you can see, the loss decreases as the predicted probability ๐‘“ gets closer to 1, and increases sharply as ๐‘“ approaches 0:

The curve shows that the model's loss is minimal when it is confident and correct (predicting probabilities close to 1), and grows significantly when the model is far from the correct prediction.

Here's the graph for the loss function when ๐‘ฆ = 0 y=0:


As the predicted probability ๐‘“ → 0, the loss goes to 0 — perfect prediction. As ๐‘“ → 1, the loss increases sharply — the model is confidently wrong.

This curve reflects how logistic regression penalizes incorrect high-confidence predictions when the true label is 0.

The simplified loss function for a single training example is:

Loss (๐‘“(w,b)(x(i)),๐‘ฆ(i)) = − [ ๐‘ฆ(i) ⋅ log ⁡ (๐‘“(w,b)(x(i))) - ( 1 − ๐‘ฆ(i) ) ⋅ log ⁡ (1 − ๐‘“(w,b)(x(i))) ]

Why This Works:

When ๐‘ฆ = 1:
Loss = − log(๐‘“) (Just like before!)

When ๐‘ฆ = 0:
Loss = − log (1−๐‘“) (Also same as before!)

So this unified expression handles both cases in one line — super useful for coding and implementing things like gradient descent!

Now, let’s say you have ๐‘š training examples. The cost function (average loss across all examples) is:

๐ฝ (w,b) = - (1/m) * ∑i=1m [L(๐‘“(w,b)(x(i)),๐‘ฆ(i))]

or

๐ฝ (w,b) = - (1/m) * ∑i=1m [๐‘ฆ(i) ⋅ log ⁡ (๐‘“(w,b)(x(i))) - ( 1 − ๐‘ฆ(i) ) ⋅ log ⁡ (1 − ๐‘“(w,b)(x(i)))]

Gradient Descent for Logistic Regression

In this section, we'll dive into how to implement logistic regression by optimizing its parameters using gradient descent.

To fit a logistic regression model, we aim to find values for the parameters w (weights) and b (bias) that minimize the cost function ๐ฝ(๐‘ค,๐‘). This cost function quantifies how well the model's predictions align with the actual labels in the training data.

To minimize this cost function, we apply gradient descent, an optimization algorithm that iteratively updates the model parameters in the direction that reduces the cost.

The logistic regression model:

fw,b(x)= ลท = 1/1+e-(๐‘ค⋅๐‘ฅ + ๐‘)

Once the model has been trained, we can use it to assess new data—for example, a new transaction that includes features like the transaction amount, location, time, and device used. The model can then predict whether the transaction is fraudulent or legitimate by estimating the probability that the label y=1 (fraud).

Here's how gradient descent works in this context. We update the parameters w and b using the following rule:

repeat{ ๐‘ค๐‘— := ๐‘ค๐‘— − ๐›ผ ​ ∂๐ฝ/∂๐‘ค๐‘— ​ ๐ฝ (w,b)
๐‘ := ๐‘ − ๐›ผ ​ ∂๐ฝ/∂๐‘ ​ ๐ฝ (w,b) }

By applying the rules of calculus and working through the logistic regression cost function, this derivative ∂๐ฝ/∂๐‘ค๐‘— ​ ๐ฝ (w,b) evaluates to:

∂๐ฝ/∂๐‘ค๐‘— ​ ๐ฝ (w,b) = (1/m) * ∑i=1m ​ (๐‘“(w,b)(x(i)) - ๐‘ฆ(i)) ​ xj(i)
∂๐ฝ/∂๐‘ ​ ๐ฝ (w,b) = (1/m) * ∑i=1m ​ (๐‘“(w,b)(x(i)) - ๐‘ฆ(i))

As a quick reminder: when updating the parameters ๐‘ค๐‘— and ๐‘, we don't update them one at a time while computing the gradients. Instead, we:
- First compute all the necessary gradient values (i.e., the right-hand side of the update rules),
- Then simultaneously update all the parameters using those values.

Let's plug the gradients we derived earlier into the gradient descent update rules. The update formulas become:

repeat{
๐‘ค๐‘— := ๐‘ค๐‘— − ๐›ผ ​ [(1/m) * ∑i=1m ​ (๐‘“(w,b)(x(i)) - ๐‘ฆ(i)) ​ xj(i)]
๐‘ := ๐‘ − ๐›ผ ​ [(1/m) * ∑i=1m ​ (๐‘“(w,b)(x(i)) - ๐‘ฆ(i))]
}

These updates form the core of gradient descent for logistic regression.

These equations look really similar to the ones we used for linear regression. Are logistic regression and linear regression actually the same?

The answer is no. While the gradient descent update equations look similar, the fundamental difference lies in the function ๐‘“ (๐‘ฅ):
- In linear regression, ๐‘“(๐‘ฅ) = ๐‘ค ​ ๐‘ฅ + ๐‘
- In logistic regression, ๐‘“(๐‘ฅ) = 1/1+e-(๐‘ค⋅๐‘ฅ + ๐‘)


The sigmoid function squashes the output to lie between 0 and 1, turning it into a probability estimate—perfect for classification tasks like fraud detection. That change makes a huge difference in the behavior and purpose of the model.

So while the mechanics of gradient descent might feel familiar, logistic regression is fundamentally different from linear regression.

In linear regression, we talked about how to monitor gradient descent to ensure it converges—i.e., that the cost function is decreasing with each iteration. The same principle applies here. You can:

- Track the value of the cost function ๐ฝ(๐‘ค,๐‘) over iterations
- Plot a learning curve
- Stop training once the cost function stabilizes or drops below a desired threshold

This helps confirm that your model is learning correctly.

Another trick we discussed during linear regression is feature scaling—and yes, it's just as valuable here. Feature scaling ensures that all features (e.g., transaction amount, transaction time) lie within a similar range, typically between -1 and 1.

This helps gradient descent converge faster and more reliably, as the optimization landscape becomes smoother and easier to navigate.

Comments

Popular posts from this blog

Deploying Spring Boot microservices on Kubernetes Cluster

This article guides you through the deployment of two Spring Boot microservices, namely "order-service" and "inventory-service," on Kubernetes using "MiniKube" . We will establish communication between them, with "order-service" making calls to an endpoint in "inventory-service." Additionally, we will configure "order-service" to be accessible from the local machine's browser . 1) Create Spring Boot microservices The Spring Boot microservices, "order-service" and "inventory-service," have been developed and can be found in this GitHub repository. If you are interested in learning more about creating Spring Boot REST microservices, please refer to this or this (Reactive) link. 2) Build Docker Images The Docker images for both "order-service" and "inventory-service" have already been generated and deployed on DockerHub, as shown below. codeburps/order-service cod...

Circuit Breaker Pattern with Resilience4J in a Spring Boot Application

Read Also: Spring Cloud Circuit Breaker + Resilience4j Resilience4j is a lightweight fault tolerance library that draws inspiration from Netflix Hystrix but is specifically crafted for functional programming. The library offers higher-order functions, known as decorators , designed to augment any functional interface, lambda expression, or method reference with features such as Circuit Breaker, Rate Limiter, Retry, or Bulkhead . These functionalities can be seamlessly integrated within a project, class, or even applied to a single method. It's possible to layer multiple decorators on any functional interface, lambda expression, or method reference, allowing for versatile and customizable fault tolerance. While numerous annotation-based implementations exist online, this article focuses solely on the reactive approach using router predicates and router functions . How Circuit Breaker Pattern works? In general, a circuit breaker functions as an automatic electrical s...

How to create a basic Spring 6 project using Maven

Below is a step-by-step guide to creating a basic Spring project using Maven. 1) Create a Maven Project Use the following Maven command to create a new Maven project. mvn archetype:generate -DgroupId=com.tb -DartifactId=spring-demo -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false 2) Import in IntelliJ IDEA If you haven't already, open IntelliJ IDEA on your system. Go to "File" > "New" > "Project from Existing Sources..." . In the file dialog, navigate to the directory where your Maven project is located. Select the pom.xml file within the project directory and click "Open." 3) Update pom.xml In total, the application requires the below-mentioned dependencies: 4) Create Spring Configuration Create a Java configuration class that uses annotations to define your Spring beans and their dependencies. This class should be annotated with @Configuration . 5) Create the Main Application C...