Hyperparameter Optimization

One of the biggest drawbacks to using deep neural networks is that they have many hyperparameters that should be optimized so that the network performs optimally. In each of the earlier chapters, we've encountered, but not covered, the challenge of hyperparameter estimation. Hyperparameter optimization is a really big topic; it's, for the most part, an unsolved problem and, while we can't cover the entire topic in this book, I think it still deserves its own chapter.

In this chapter, I'm going to offer you what I believe is some practical advice for choosing hyperparameters. To be sure, this chapter may be somewhat opinionated and biased because it comes from my own experience. I hope that experience might be useful while also leading you to greater investigation on the topic.

We will cover the following topics in this chapter:

...

Should network architecture be considered a hyperparameter?

In building even the simplest network, we have to make all sorts of choices about network architecture. Should we use 1 hidden layer or 1,000? How many neurons should each layer contain? Should they all use the relu activation function or tanh? Should we use dropout on every hidden layer, or just the first? There are many choices we have to make in designing a network architecture.

In the most typical case, we search exhaustively for optimal values for each hyperparameter. It's not so easy to exhaustively search for network architectures though. In practice, we probably don't have the time or computational power to do so. We rarely see researchers searching for the optimal architecture through exhaustive search because the number of choices is so very vast and because there there is more than one correct answer...

Which hyperparameters should we optimize?

Even if you were to follow my advice above and settle on a good enough architecture, you can and should still attempt to search for ideal hyperparameters within that architecture. Some of the hyperparameters we might want to search include the following:

Our choice of optimizer. Thus far, I've been using Adam, but an rmsprop optimizer or a well-tuned SGD may do better.
Each of these optimizers has a set of hyperparameters that we might tune, such as learning rate, momentum, and decay.
Network weight initialization.
Neuron activation.
Regularization parameters such as dropout probability or the regularization parameter used in l2 regularization.
Batch size.

As implied above, this is not an exhaustive list. There are most certainly more options you could try, including introducing variable numbers of neurons in each hidden layer,...

Hyperparameter optimization strategies

At this point in the chapter, we've suggested that it is, for the most part, computationally impossible, or at least impractical, to try every single combination of hyperparameters we might want to try. Deep neural networks can certainly take a long time to train. While you can parallelize and throw computational resources at the problem, it's likely that your greatest limiter in searching for hyperparameters will continue to be time.

If time is our greatest constraint, and we can't reasonably explore all possibilities in the time we have, then we will have to create a strategy where we get the most utility in the time we have.

In the remainder of this section, I'll cover some common strategies for hyperparameter optimization and then I'll show you how to optimize hyperparameters in Keras with two of my favorite methods...

Summary

Hyperparameter optimization is an important step in getting the very best from our deep neural networks. Finding the best way to search for hyperparameters is an open and active area of machine learning research. While you most certainly can apply the state of the art to your own deep learning problem, you will need to weigh the complexity of implementation against the search runtime in your decision.

There are decisions related to network architecture that most certainly can be searched exhaustively, but a set of heuristics and best practices, as I offered above, might get you close enough or even reduce the number of parameters you search.

Ultimately, hyperparameter search is an economics problem, and the first part of any hyperparameter search should be consideration for your budget of computation time, and personal time, in attempting to isolate the best hyperparameter...