All About Activation Functions In Neural Networks

Ananda Hange
11 min readMar 2, 2021

Neural network activation functions are a crucial component of deep learning. Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model — which can make or break a large-scale neural network. Activation functions also have a major effect on the neural network’s ability to converge and the convergence speed, or in some cases, activation functions might prevent neural networks from converging in the first place.

A Simple Artificial Neuron:

Deep learning models usually consist of many neurons stacked in layers. Let’s consider a single neuron for simplicity.

Single-layer Perceptron

The operations performed by a neuron basically involve multiplication and summation operations which are linear and produce an output. After this, an activation function is applied to produce the final out of the neuron.

Without applying the activation function, the above will just be like a linear function that maps inputs to outputs.

This makes the neuron only approximate linear functions. As a result, the model can’t recognize complex patterns in data.

Why are activation functions needed?

In order for neural networks to approximate non-linear or complex functions, there has to be a way to add a non-linear property to the computation of results.

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.

The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.

Increasingly, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.

A. Can any non-linear function be used as an activation function?

No, before a function can be considered as a good candidate for Deep learning models it should have the following properties:

  1. Non-linear: This is required to introduce non-linearity in the model.
  2. Monotonic: A function that is either entirely non-increasing or non-decreasing.
  3. Differentiable: Deep learning algorithms update their weights via an algorithm called backpropagation. This algorithm can work when the activation function used is differentiable. ie its derivatives can be calculated.

B. Desirable features of an activation function:

1. Vanishing Gradient problem:

This problem occurs when we train neural networks very deep (hundreds and thousands of layers). Neural Networks are trained using the process gradient descent. The gradient descent consists of the backward propagation step which is basically a chain rule to get the change in weights in order to reduce the loss after every epoch. Consider, a 4-layer neural network consists of 4 neurons for the input layer, 4 neurons for the hidden layers, and 1 neuron for the output layer.

4- layer Neural Network

Input layer

The neurons, colored in purple, represent the input data. These can be as simple as scalars or more complex like vectors or multidimensional matrices.

Hidden layers

The final values at the hidden neurons, colored in green, are computed using z^l — weighted inputs in layer l, and a^l — activations in layer l. For layer 2 and 3 the equations are:

  • l = 2
  • l = 3

and are the weights in layer 2 and 3 while b² and b³ are the biases in those layers.

While Backpropagation and computing gradients to updates the weights W,

Backpropagation and Gradient calculation

Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires applying the chain rule through (z_2)³ and (a_2)³:

We multiplied multiple gradients (Derivatives) then after multiplication we likely to get very small values.

In other words, their gradients tend to vanish because of the depth of the network and the activation shifting the value to zero. This is called the vanishing gradient problem. So we want our activation function to not shift the gradient towards zero.

2. Zero-Centered:

The output of the activation function should be symmetrical at zero so that the gradients do not shift to a particular direction.

3. Computational Expense:

Activation functions are applied after every layer and need to be calculated millions of times in deep networks. Hence, they should be computationally inexpensive to calculate.

C. What is a Neural Network Activation Function?

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.

An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function.

D. Types of Activation Functions

1. Binary Step Function:

A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

Binary Step Function Plot

The problem with a step function is that it does not allow multi-value outputs — for example, it cannot support classifying the inputs into one of several categories.

2. Linear Activation Function:

A linear activation function takes the form: A = cx

Linear Function Plot

It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.

However, a linear activation function has two major problems:

a. Not possible to use backpropagation (gradient descent) to train the model — the derivative of the function is a constant and has no relation to the input, X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.

b. All layers of the neural network collapse into one — with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.

A neural network with a linear activation function is simply a linear regression model. It has limited power and the ability to handle complexity varying parameters of input data.

E. Non-Linear Activation Functions

Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets that are non-linear or have high dimensionality.

Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.

Non-linear functions address the problems of a linear activation function:

  1. They allow backpropagation because they have a derivative function that is related to the inputs.
  2. They allow the “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.

Common Nonlinear Activation Functions and How to Choose an Activation Function

1. Sigmoid / Logistic:

The logistic function is used in Logistic Regression as the Squashing function to squash the outliers points within a limited range [0,1].

Sigmoid Function : f(x)=1/(1+exp(-x)

Derivative of sigmoid: df(x)=f(x)*(1-f(x))

Code:

Sigmoid Function and It’s Derivative

Advantage:

  • Smooth gradient, preventing “jumps” in output values.
  • Output values bound between 0 and 1, normalizing the output of each neuron.
  • Clear predictions — For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.

Disadvantages:

  • Vanishing gradient — for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.
  • Outputs not zero centered.
  • Computationally expensive.

When we will use Sigmoid:

(i) if you want an output value between 0 to 1 use sigmoid at output layer neuron only.

(ii) when you are doing a binary classification problem use sigmoid

otherwise sigmoid is not preferred.

2. TanH / Hyperbolic Tangent:

The tanh function is just another possible function that can be used as a nonlinear activation function between layers of a neural network. It actually shares a few things in common with the sigmoid activation function. They both look very similar. But while a sigmoid function will map input values to be between 0 and 1, Tanh will map values to be between -1 and 1.

Plot for Sigmoid vs. tanh

Like the sigmoid function, one of the interesting properties of the tanh function is that the derivative can be expressed in terms of the function itself. Below is the actual formula for the tanh function along with the formula for calculating its derivative

tanh function and it's derivative

Code:

tanh function and its derivative plot

Advantages

  • Zero centered — making it easier to model inputs that have strongly negative, neutral, and strongly positive values. Usually used in hidden layers of a neural network as it’s values lie between-1 to 1 hence the mean for the hidden layer comes out to be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.
  • Otherwise like the Sigmoid function.

Disadvantages

  • Like the Sigmoid function

3. ReLU (Rectified Linear Unit)

ReLU is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

  • Equation: f(x) = a =max(0,x). It gives an output x if x is positive and 0 otherwise.
  • Derivative: f’(x) = { 1 ; if z>0, 0; if z<0 and undefined if z=0 }
  • Value Range :- [0, inf)

It avoids and rectifies the vanishing gradient problem to some extent. Almost all deep learning Models use ReLU nowadays.

Code:

ReLU function and it’s Derivative Plot

Advantages

  • Computationally efficient — allows the network to converge very quickly
  • Non-linear — although it looks like a linear function, ReLU has a derivative function and allows for backpropagation

DisadvantagesThe Dying ReLU problem — when inputs approach zero or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

4. Leaky ReLU :

To fix the Dying gradient problem in ReLU another modification was introduced called Leaky ReLU.

The difference in ReLU (left) and Leaky ReLU (right)

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so. When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Code :

Leaky ReLU function and it’s derivative plot

Advantages

  • Prevents dying ReLU problem — this variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values
  • Otherwise like ReLU

Disadvantages

  • Results not consistent — leaky ReLU does not provide consistent predictions for negative input values.

5. Softmax :

The softmax function is an activation function that turns numbers into probabilities that sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.

Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.

Advantages

  • The softmax function is used when we have multiple classes.
  • It is useful for finding out the class which has the max. Probability.
  • The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.
  • It ranges from 0 to 1.

F. Which activation function to use?

So till now, we have seen different types of activation functions, their advantages, and disadvantages. Now the question arises — Which activation function should I use for my neural network?

It would be incredibly difficult to recommend an activation function that works for all use cases. There are many considerations — how difficult it is to compute the derivative (if it is differentiable at all!), how quickly a network with your chosen AF converges, how smooth it is, whether it satisfies the conditions of the universal approximation theorem, whether it preserves normalization, and so on. You may or may not care about some or any of those.

  • Sigmoid functions and their combinations generally work better in the case of binary classification problems.
  • Sigmoid and tanh functions are sometimes avoided due to the vanishing gradient problem.
  • Tanh is avoided most of the time due to dead neuron problems.
  • ReLU activation function is widely used and is the default choice as it yields better results.
  • If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice.
  • ReLU function should only be used in the hidden layers.
  • An output layer can be a linear activation function in case of regression problems.
  • For Multiclass Classification Softmax activation function is most preferable

Hope this article serves the purpose of getting an idea about the activation function, why when, and which to use for a given problem statement.

Happy Learning!!!

Reference :

  1. https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

2. https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253#:~:text=Simply%20put%2C%20an%20activation%20function,fired%20to%20the%20next%20neuron.

--

--