Momentum Matters

Momentum in Gradient Descent

We often think of Momentum as a means of dampening oscillations and speeding up the iterations, leading to faster convergence. But it has other interesting behavior.

Step-size α = 0.02

Momentum β = 0.99

We often think of Momentum as a means of dampening oscillations and speeding up the iterations, leading to faster convergence. But it has other interesting behavior. It allows a larger range of step-sizes to be used, and creates its own oscillations. What is going on?

Here’s a popular story about momentum: gradient descent is a man walking down a hill. He follows the steepest path downwards; his progress is slow, but steady. Momentum is a heavy ball rolling down the same hill. The added inertia acts both as a smoother and an accelerator, dampening oscillations and causing us to barrel through narrow valleys, small humps and local minima.

This standard story isn’t wrong, but it fails to explain many important behaviors of momentum. In fact, momentum can be understood far more precisely if we study it on the right model.

One nice model is the convex quadratic. This model is rich enough to reproduce momentum’s local dynamics in real problems, and yet simple enough to be understood in closed form. This balance gives us powerful traction for understanding this algorithm.

Gradient Descent

Gradient descent has many virtues, but speed is not one of them. It is simple — when optimizing a smooth function f, we make a small step in the gradient:

w^k+1 = w^k – α ∇f(w^k).

For a step-size small enough, gradient descent makes a monotonic improvement at every iteration. It always converges, albeit to a local minimum. And under a few weak curvature conditions it can even get there at an exponential rate.

Momentum

But the exponential decrease, though appealing in theory, can often be infuriatingly small. Things often begin quite well — with an impressive, almost immediate decrease in the loss. But as the iterations progress, things start to slow down. You start to get a nagging feeling you’re not making as much progress as you should be. What has gone wrong?

The problem could be the optimizer’s old nemesis, pathological curvature. Pathological curvature is, simply put, regions of f which aren’t scaled properly. The landscapes are often described as valleys, trenches, canals and ravines. The iterates either jump between valleys, or approach the optimum in small, timid steps. Progress along certain directions grind to a halt. In these unfortunate regions, gradient descent fumbles.

Momentum proposes the following tweak to gradient descent. We give gradient descent a short-term memory:

z^k+1 = βz^k + ∇f(w^k).

Conclusion

Momentum is often misunderstood as simply a way to speed up gradient descent. But it has much deeper implications for the behavior of the algorithm. By introducing a short-term memory, momentum can help the algorithm navigate pathological curvature and achieve faster convergence.

FAQs

What is momentum in gradient descent?
Momentum is a modification to the gradient descent algorithm that introduces a short-term memory. It helps the algorithm navigate pathological curvature and achieve faster convergence.

How does momentum work?
Momentum works by adding a component to the update step that is proportional to the previous update step. This helps the algorithm to build up momentum and navigate the optimization landscape more effectively.

What are the benefits of momentum?
The benefits of momentum include faster convergence, improved stability, and better performance on optimization problems with non-smooth or non-convex objectives.

Are there any drawbacks to using momentum?
Yes, one drawback to using momentum is that it can make the algorithm more sensitive to the choice of hyperparameters. Additionally, momentum can sometimes get stuck in local minima.

How do I choose the right step-size and momentum parameters?
The choice of step-size and momentum parameters depends on the specific optimization problem and the desired performance of the algorithm. It is often a good idea to use cross-validation or other methods to select the best parameters.

Can I use momentum with other optimization algorithms?
Yes, momentum can be used with other optimization algorithms, such as stochastic gradient descent or Adam. However, the specific implementation and hyperparameters may need to be adjusted depending on the algorithm being used.

Post Views: 37

Momentum in Gradient Descent

We often think of Momentum as a means of dampening oscillations and speeding up the iterations, leading to faster convergence. But it has other interesting behavior.

One nice model is the convex quadratic. This model is rich enough to reproduce momentum’s local dynamics in real problems, and yet simple enough to be understood in closed form. This balance gives us powerful traction for understanding this algorithm.

Gradient Descent

Momentum

Conclusion

FAQs

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter