Gradient Descent

Gradient descent Introduction Cost Function How Does Gradient Descent Work?Real-world Example Alpha – The Learning Rate Types of gradient descent Python Implemention Advantages and Disadvantages Frequently Asked Questions

*Gradient Descent（梯度下降）**是机器学习中用于优化模型的常见算法之一。它通常用于调整模型参数以最小化损失函数（或成本函数）的值，从而使模型能够更好地拟合训练数据。

Gradient descent Introduction

Gradient descent is an optimization algorithm used to minimize a cost function during the training process of machine learning models.

Gradient Descent is a crucial component in the training process of many machine learning models, regardless of whether they are supervised or unsupervised. It is used to adjust the model parameters iteratively, reducing the difference between predicted and actual outcomes or optimizing some other objective function.

The algorithm operates by calculating the gradient of the cost function, which indicates the direction and magnitude of steepest ascent. However, since the objective is to minimize the cost function, gradient descent moves in the opposite direction of the gradient, known as the negative gradient direction.

Cost Function

The cost function represents the discrepancy between the predicted output of the model and the actual output. The goal of gradient descent is to find the set of parameters that minimizes this discrepancy and improves the model’s performance.

After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data.

How Does Gradient Descent Work?

In machine learning, we use a similar process to minimize a “cost function” (which measures how well our model is performing, in the case of our example it’s the error in our step taking that we are taking the right step are not). Let’s understand how it works:

Start with an Initial Guess: Begin with some initial values for the model’s parameters (like weights in a neural network).

Evaluate the Cost Function: Calculate how well the model is doing with the current parameters. This is done by evaluating the cost function, which tells us the error or loss.

Compute the Gradient: Instead of checking the ground’s slope, we compute the gradient, which tells us the direction in which the cost function increases the most. Essentially, it tells us which direction to move our parameters to reduce the cost the fastest.

Update the Parameters: Move the parameters a small step in the opposite direction of the gradient. This step size is controlled by the “learning rate.” Just like taking a step in the direction that slopes downward the steepest.

Repeat: Keep repeating the process of computing the gradient and updating the parameters until the cost function is minimized (or until it stops changing significantly).

Real-world Example

Imagine you have a plant, and you want to determine the best amount of water and sunlight it needs to grow the tallest. You start with a guess, give the plant a certain amount of water and sunlight, and observe its growth (this is like evaluating the cost function).

Observation: You notice that the plant’s growth is not optimal.

Adjust: You tweak the amounts of water and sunlight slightly and see if the plant grows taller.

Direction: If giving it more water improves growth but more sunlight doesn’t, you realize you should increase water but not sunlight (this is like computing the gradient).

Repeat: You continue adjusting the water and sunlight levels, each time improving the plant’s growth until you find the optimal combination.

In this analogy:

The plant’s growth is the performance of your model.

Water and sunlight are the model’s parameters.

Observing the plant’s growth is like evaluating the cost function.

Tweaking the amounts based on growth is like computing the gradient and updating the parameters.

Alpha – The Learning Rate

学习率是梯度下降中的一个重要超参数，控制了每次参数更新的步长大小。选择合适的学习率很关键，过大的学习率可能导致无法收敛，而过小的学习率可能导致训练过慢或陷入局部最小值。

Learning rate (also referred to as step size or the alpha) is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risks overshooting the minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of more precision, the number of iterations compromises overall efficiency as this takes more time and computations to reach the minimum.

a) Learning rate is optimal, model converges to the minimum

b) Learning rate is too small, it takes more time but converges to the minimum

c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C)

d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning

Types of gradient descent

There are three types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent and mini-batch gradient descent.

Batch gradient descent

Batch gradient descent sums the error for each point in a training set, updating the model only after all training examples have been evaluated. This process referred to as a training epoch.

While this batching provides computation efficiency, it can still have a long processing time for large training datasets as it still needs to store all of the data into memory. Batch gradient descent also usually produces a stable error gradient and convergence, but sometimes that convergence point isn’t the most ideal, finding the local minimum versus the global one.

Stochastic gradient descent

Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and it updates each training example's parameters one at a time. Since you only need to hold one training example, they are easier to store in memory. While these frequent updates can offer more detail and speed, it can result in losses in computational efficiency when compared to batch gradient descent. Its frequent updates can result in noisy gradients, but this can also be helpful in escaping the local minimum and finding the global one.

Mini-batch gradient descent

Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic gradient descent. It splits the training dataset into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of batch gradient descent and the speed of stochastic gradient descent.

Python Implemention


import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add a bias term to the input features
X_b = np.c_[np.ones((100, 1)), X]

# Initialize parameters
theta = np.random.randn(2, 1)

# Set hyperparameters
learning_rate = 0.01
n_iterations = 1000

# Gradient Descent implementation
for iteration in range(n_iterations):
    # Calculate predictions
    y_pred = X_b.dot(theta)
    
    # Calculate gradients
    gradients = 2 / len(X) * X_b.T.dot(y_pred - y)
    
    # Update parameters
    theta = theta - learning_rate * gradients

# Print the final parameters
print("Final Theta:", theta)

# Plot the data and the linear regression line
plt.scatter(X, y, label='Data')
plt.plot(X, X_b.dot(theta), color='red', label='Linear Regression')
plt.xlabel('House Size')
plt.ylabel('House Price')
plt.legend()
plt.show()

In this example:

We generate synthetic data representing house sizes (feature) and prices (target).

We add a bias term to the input features to account for the intercept in the linear regression equation.

We initialize random parameters (weights) and use Gradient Descent to optimize these parameters by minimizing the Mean Squared Error cost function.

Finally, we visualize the original data and the linear regression line fitted by the Gradient Descent algorithm.

This example demonstrates how Gradient Descent can be applied to find the optimal parameters for a linear regression model, allowing you to make predictions on house prices based on their sizes

Advantages and Disadvantages

Advantages:

Simple to understand and implement.

Efficient for large datasets (especially with Mini-Batch and Stochastic variants).

Disadvantages:

Sensitive to the choice of learning rate.

Can get stuck in local minima for non-convex functions.

Requires careful tuning of hyperparameters like learning rate and batch size.

Frequently Asked Questions

Q1. What is gradient descent in linear regression?

A. Gradient descent is an optimization algorithm used to minimize the cost function in linear regression. It iteratively updates the model’s parameters by computing the partial derivatives of the cost function with respect to each parameter and adjusting them in the opposite direction of the gradient.

Q2. Which ML algorithms use gradient descent?

A. Several machine learning algorithms use gradient descent, including linear regression, logistic regression, neural networks, and support vector machines. These algorithms use gradient descent to optimize their respective cost functions and improve their performance on the training data.

Q3. What is gradient descent and backpropagation?

A. Gradient descent and backpropagation are two algorithms commonly used in training neural networks. Gradient descent updates the weights of the network by minimizing the cost function, while backpropagation calculates the gradient of the cost function with respect to each weight and propagates it backwards through the network.

Q4. What is gradient descent in simple terms?

A. Gradient descent is an optimization algorithm used to find the minimum of a function by iteratively adjusting the parameters in the opposite direction of the gradient. It is commonly used in machine learning to optimize the parameters of models and improve their performance on a given task.

Q5. What are the two types of gradient descent?

A. The two types of gradient descent are batch gradient descent and stochastic gradient descent. Batch gradient descent updates the model’s parameters using the entire training set in each iteration, while stochastic gradient descent updates the parameters using only one training sample at a time.