mechanism. This nonlinearity enables the artificial neural network to learn a complex nonlinear mapping from input to output signals. Without a nonlinear activation function, the network will be a linear system whose information processing capability will be very limited. Mathematically, even with a single-hidden-layer neural network, we can approximate all continuous functions when the activation function is nonlinear.
For an activation function to perform satisfactorily, it should satisfy the following conditions: (i) differentiability, which is necessary for the gradient descent method to work for optimization of a network and (ii) monotonicity, which is biologically motivated for the neuron to be in either a prohibitory or an excitatory state. Only when the activation function is monotonic can a single-hidden-layer network be optimized as a convex problem.
Generally speaking, the activation function is of great importance since it delivers a single number via a ‘soft’ thresholding operation as the final result of the information processing processed by the neuron. Several commonly used activation functions are described in the following subsections.
Sigmoid
Sigmoid is a commonly used activation function, as shown in figure 3.5 (Han and Moraga 1995). The Sigmoid function sets the output value in a range between 0 and 1, where 0 represents not activated at all, and 1 represents fully activated. This binary nature simulates the two states of a biological neuron, where an action potential is transmitted only if the accumulated stimulation strength is above a threshold. A larger output value means a stronger response of the neuron. The mathematical expression is as follows:
σ(x)=11+e−x.(3.3)
The sigmoid function was very popular in the past since it has a clear interpretation as the firing rate of a neuron. However, the sigmoid nonlinearity is rarely used now, because it has the following major drawback: when the activation saturates at either tail of 0 or 1, the gradient in these regions is almost zero, which is undesirable for optimization of the network parameters. During the backpropagation-based training process (see section 3.1.7 for details on backpropagation), a nearly zero gradient for a neuron will be multiplied with other factors according to the chain rule to compute an overall gradient for parametric updating, but such a diminishing gradient will effectively ‘kill’ the overall gradient, and almost no information will flow through the neuron to its weights and recursively to its data. In addition, we need to initialize the weights of sigmoid neurons carefully to avoid saturation. For example, when some neurons’ weights are set as too large initially, they will become saturated and not learn significantly. Also, sigmoid outputs are not zero-centered. This means that the output is always greater than 0, which will make the input values to the next layers all positive. Then, in the gradient derivation for backpropagation, elements in the weighted matrix change in a biased direction, compromising the training efficacy. In addition, the sigmoid function involves the exponential operation, which is computationally demanding.
Figure 3.5. The sigmoid function is differentiable and monotonic with the range [0, 1] and the number axis as the domain.
Tanh
Tanh is also a commonly used nonlinear activation function, as shown in figure 3.6 (Fan 2000). Although the sigmoid function has a direct biological interpretation, it turns out that sigmoid leads to a diminishing gradient, undesirable for training a neural network. Like sigmoid, the tanh function is also ‘s’-shaped, but its output range is (−1, 1). Thus, negative inputs to tanh are mapped to negative outputs, and only a zero input is mapped to zero. These properties make it better than sigmoid. The tanh function is defined as follows:
tanh(x)=ex−e−xex+e−x.(3.4)
Figure 3.6. The tanh function is similar to sigmoid in shape but has the symmetric range [−1, 1].
Tanh is nonlinear and squashes a real-valued number to the range [−1, 1]. Unlike sigmoid, the output of tanh is zero-centered symmetrically. Therefore, tanh is often preferred over sigmoid in practice.
ReLU
Currently, the rectified linear unit (ReLU) function has become very popular, and is shown in figure 3.7. Instead of sigmoid/tanh, ReLU outputs 0 if its input is less than 0; otherwise, it just reproduces the input. The mechanism of ReLU is more like the biological neurons in the visual cortex. ReLU allows some neurons to output zero while the rest of the neurons respond positively, often giving a sparse response to alleviate overfitting and simplify computation. In the brain, only when there is a suitable stimulus signal, do some specialized neurons respond at a high frequency. Otherwise, the response frequency of the neuron is no more than 1 Hz, which is just like being processed by a half-wave rectifier. The formula of ReLU is as follows:
ReLU(x)=max(0,x).(3.5)
Figure 3.7. The ReLU function, which is equal to zero for a negative input, and otherwise reproduces the input.
As shown in figure 3.7, the ReLU activation is easy to calculate, which simply thresholds the input value at zero. There are several merits of the ReLU function: (i) there is no saturation zone for a positive stimulation, without any gradient diminishing issue; (ii) there is no exponential operation so that the calculation is most efficient; and (iii) in the network training process, the convergence speed of ReLU is much faster than that of sigmoid/tanh. On the other hand, the ReLU function is not perfect. The output of ReLU is not always informative, which affects the efficiency of the network training process. Specifically, the ReLU output is always zero when x<0. As a result, the related network parameters cannot be updated with a zero gradient, leading to the phenomenon of ‘dead neurons’.
Leaky ReLU
The leaky ReLU function is an improved ReLU, as shown in figure 3.8 (Maas and Hannun 2013). In order to save ‘dead neurons’ when the input is less than 0, leaky ReLU responds to a negative input in such a way that the negative input is greatly attenuated but the information on the negative input is still kept. Compared to ReLU, leaky ReLU is written as follows:
Leaky ReLU(x)=αx,x<0x,x⩾0,(3.6)
Figure 3.8. The leaky ReLU function, which greatly attenuates a negative input but still records the negative information (the slope for the negative input is set to α=0.1).
where α is a small positive constant, which is usually set to 0.1. Leaky ReLU gives all negative values a small positive slope to prevent the information loss, effectively solving the gradient diminishing problem.
ELU
The exponential linear unit (ELU) is also a variant of ReLU, as shown in figure 3.9 in Clevert et al (2015). When the input is less than 0, the ELU is expressed in an exponential form, and the output saturation is controlled by the parameter α to ensure a smooth transition from the deactivated to activated state. Compared to RELU, ELU has negative values that push the mean output closer to zero. Mean shifting toward zero helps speed up learning because of a reduced bias. The ELU function is defined as follows:
ELU(x)=x,x>0αex−1,x⩽0.(3.7)
Figure 3.9. ELU function,