I am an experienced software engineer working in artificial intelligence and machine learning. Are you also learning/interested in learning? Learn with me! I’m sharing my learning journals along the way.
Disclosure: I’ve already learned basic machine learning, but I’m starting this tutorial log over because I need a refresher 😅.
In my previous tutorial, I covered model parameters and cost functions. Here’s a quick summary:
- Parameters of a model: Variables that can be modified during training to improve the model. For example, in the model y = wx + b, w and b are parameters, with an input attribute, x.
- Cost function, J: Indicates the accuracy of a model’s predictions against sample data. A smaller J indicates closer predicted values (ŷ) to actual values (y), indicating better parameter choices and an improved model.
- Calculation of J: J is calculated over the entire example and varies with the choice of parameter values. Thus, in a model with parameters w and b, the cost function is represented as J(w, b).
When we train a model, our goal is to discover a function that best fits our example data. We achieve this by adjusting the parameter values to have the lowest possible cost, J.
Let us first consider the most obvious way of finding optimal parameter values.
We could test every possible parameter value. At first glance, this solution may seem to work. However, there are two main pitfalls. First, it is impossible to exhaustively test all possible parameter values, since the parameters are real numbers. they have infinite range and decimal depth! Second, it would be incredibly inefficient and time-consuming.
We need a more systematic and innovative solution.
Enter descent slope. This algorithm involves iteratively testing different parameters and calculating the cost at each step. A “step” is synonymous with iteration — each represents a move closer to optimal parameter values. Gradient descent is similar to our naive solution in that it requires testing different parameter values and many steps. The critical difference lies in the choice of the next set of parameter values. In gradient descent, we choose the next values based on the gradient calculations, leading to a much more directed and efficient process.
Here’s how it works at a high level:
- Starting point: We start with specific parameter values, usually chosen randomly or through an educated guess.
- Downward direction: We determine the direction of movement by calculating the slope.
- Update parameters: We adjust the parameter values to decrease J using gradient.
- Repetition: This process continues until we meet certain final conditions. Each repetition is a step.
Here’s what the descent slope might look like on a cost plot for a one-parameter model. In this diagram, we use gradient descent to take steps that bring us closer to the minimum.
Gradient descent can handle any number of parameters (it just gets more and more complicated to visualize!). In general, since we have a cost function, we can use gradient descent to find values for parameters that minimize the cost, J.
To really understand gradient descent, let’s dig into the details.
Remember high school math… What does slope help us calculate? The slope! The slope in this context helps us determine the slope of the cost function. Knowing this gradient enables us to determine the direction of steepest descent.
This idea made more sense when I tried to visualize it. Let’s say I plot a graph of the cost J at different parameter values. Here’s an example of what a cost chart might look like with two parameters:
Imagine standing on this slope and looking for the fastest way down. In which direction should I move? Down the slope! Ideally, I would be moving in the steepest direction from my current position. This idea is the essence of gradient descent!
A critical aspect of this process is the size of the step we take down the slope. This is affected by a critical factor known as the learning rate, a. This rate is a positive number that dictates the magnitude of each parameter change.
Now it’s time for math.
Let us introduce the mathematical formulas central to gradient descent. For each parameter, we calculate a new value using its slope. The learning rate, a, plays a vital role here, determining the size of the step we take towards the steepest descent.
The formula for updating each parameter is unique, ensuring specific and effective adjustments. For example, updates to parameters w1, w2etc., in a multiparameter model are:
Why do we use the partial derivative?
In stepwise descent, using partial derivatives instead of full derivatives is a deliberate choice. This is because, in multiparameter models, each parameter uniquely affects the model output. The partial derivative emphasizes the effect of a single parameter change, holding all others constant. For example, customization w1 affects cost J, regardless of changes in w2 the si.
Therefore, the gradient descent update for each parameter is adjusted with its own partial derivative, allowing for precise, personalized adjustments:
As mentioned, the learning rate, a, is a key element that dictates the step size in our journey to the minimum of the cost function. Therefore, choosing a reasonable learning rate is critical. Here’s how choosing a bad learning rate can negatively affect gradient descent and, in some cases, stop it from working at all:
- Overcoming with big strides: A tall one a makes our steps very large. We risk exceeding the minimum and going to another slope. Then we can keep bouncing from slope to slope without sitting on the same slope. Here’s what the breakthrough might look like:
- Usually, we can tell if this is the case if the gradient swings back and forth (eg between positive and negative) at each iteration. Sometimes, we can get lucky and get close to the optimal minimum.
- Deviation: In extreme cases, these large steps don’t just lead to a bounce, they scale, moving us further than the minimum and causing the algorithm to completely diverge:
- Inefficient in tiny steps: A small learning rate can slow down the journey considerably, resulting in little progress towards the minimum. Downhill will still work, but it will be extremely inefficient.
Tips for optimizing your learning rate
To optimize your learning rate:
- Experiment with increments (eg, 0.001, 0.01, 0.1), looking for a percentage that consistently reduces cost.
- Start with a very small alpha to ensure consistent cost reduction, then increase gradually.
- Aim for the highest learning rate that still guarantees consistent and rapid cost reduction.
Determining the right moment to stop tilting is a strategic decision. We can choose different final conditions. here are some examples:
- Convergence: We stop when the parameter changes are negligible, indicating proximity to the minimum.
- Fixed number of steps: We set a predetermined number of iterations for practical reasons, especially when computational resources are a constraint.
- Based on threshold: We stop when the cost reduction J falls below a specified threshold, indicating diminishing returns in further iterations.
- Gradient descent optimizes model parameters by iteratively adjusting them based on gradient calculations, aiming to minimize the cost function, J.
- The algorithm involves determining the steepest descent direction using partial derivatives and fitting each parameter with a calculated step size affected by the learning rate (α). The process is repeated until it meets certain end conditions, such as convergence or a fixed number of iterations.
- The choice of learning rate is crucial as it affects the effectiveness of achieving minimum cost.
Gradient descent is a highly versatile algorithm that is applicable to a wide range of machine learning methods, including complex models such as deep learning.
I’m taking Andrew Ng’s machine learning specialization and these learning logs contain some of what I learned from it. It’s a great course. I highly recommend it!