Determining the right activation function

The purpose of an activation function is to introduce non-linearity into a neural network. Non-linearity helps a neural network to learn more complex patterns. We will discuss some important activation functions, and their respective DL4J implementations.

The following are the activation functions that we will consider:

Tanh
Sigmoid
ReLU (short for Rectified Linear Unit)
Leaky ReLU
Softmax

In this recipe, we will walk through the key steps to decide the right activation functions for a neural network.

How to do it...

Choose an activation function according to the network layers: We need to know the activation functions to be used for the input/hidden layers and output layers. Use ReLU for input/hidden layers preferably.
Choose the right activation function to handle data impurities: Inspect the data that you feed to the neural network. Do you have inputs with a majority of negative values observing dead neurons? Choose the appropriate activation functions accordingly. Use Leaky ReLU if dead neurons are observed during training.
Choose the right activation function to handle overfitting: Observe the evaluation metrics and their variation for each training period. Understand gradient behavior and how well your model performs on new unseen data.
Choose the right activation function as per the expected output of your use case: Examine the desired outcome of your network as a first step. For example, the SOFTMAX function can be used when you need to measure the probability of the occurrence of the output class. It is used in the output layer. For any input/hidden layers, ReLU is what you need for most cases. If you're not sure about what to use, just start experimenting with ReLU; if that doesn't improve your expectations, then try other activation functions.

How it works...

For step 1, ReLU is most commonly used because of its non-linear behavior. The output layer activation function depends on the expected output behavior. Step 4 targets this too.

For step 2, Leaky ReLU is an improved version of ReLU and is used to avoid the zero gradient problem. However, you might observe a performance drop. We use Leaky ReLU if dead neurons are observed during training. Dead neurons are referred to as neurons with a zero gradient for all possible inputs, which makes them useless for training.

For step 3, the tanh and sigmoid activation functions are similar and are used in feed-forward networks. If you use these activation functions, then make sure you add regularization to network layers to avoid the vanishing gradient problem. These are generally used for classifier problems.

There's more...

The ReLU activation function is non-linear, hence, the backpropagation of errors can easily be performed. Backpropagation is the backbone of neural networks. This is the learning algorithm that computes gradient descent with respect to weights across neurons. The following are ReLU variations currently supported in DL4J:

ReLU: The standard ReLU activation function:

public static final Activation RELU

ReLU6: ReLU activation, which is capped at 6, where 6 is an arbitrary choice:

public static final Activation RELU6

RReLU: The randomized ReLU activation function:

public static final Activation RRELU

ThresholdedReLU: Threshold ReLU:

public static final Activation THRESHOLDEDRELU

There are a few more implementations, such as SeLU (short for the Scaled Exponential Linear Unit), which is similar to the ReLU activation function but has a slope for negative values.