15.3 The curse of dimensionality
First, let’s talk about optimization. If all else fails, optimizing a single-variable function f : [a,b] →ℝ can be as simple as partitioning [a,b] into a grid of n points, evaluating the function at each point, then finding the minima/maxima.
We cannot do this in higher dimensions. To see why, consider ResNet18, the famous convolutional network architecture. It has precisely 11,689,512 parameters. Thus, training is equivalent to optimizing a function of a whopping 11,689,512-variable function. If we were to construct a grid with just two points along every dimension, we would have 211689512 points to evaluate the function at. For comparison, the number of atoms in our observable universe is around 1082. A number that is dwarfed by the size of our grid. Thus, grid search is currently impossible on such an enormous grid. We are forced to devise clever algorithms that can tackle the size and complexity of large dimensional spaces.
Another...