13.3 Why does gradient descent work?
Young man, in mathematics you don’t understand things. You just get used to them. — John von Neumann
In the practice of machine learning, we use gradient descent so much that we get used to it. We hardly ever question why it works.
What’s usually told is the mountain-climbing analogue: to find the peak (or the bottom) of a bumpy terrain, one has to look at the direction of the steepest ascent (or descent), and take a step in that direction. This direction is desribed by the gradient, and the iterative process of finding local extrema by following the gradient is called gradient ascent/descent. (Ascent for finding peaks, descent for finding valleys.)
However, this is not a mathematically precise explanation. There are several questions left unanswered, and based on our mountain-climbing intuition, it’s not even clear if the algorithm works.
Without a precise understanding of gradient descent, we are practically flying...