12 Activation Functions That You May Want To Consider — Part-1

Saurav Pawar

7 min readFeb 27, 2022

Photo by Gertrūda Valasevičiūtė on Unsplash

Before digging into activation functions, let’s first understand the answers to these three questions:

· What are neural networks ?

· What is an activation function ?

· Why is it needed ?

What are neural networks ?

Neural networks are nothing but a network comprising of interconnected neurons, which are further identified by its weight, bias and activation function.

Elements of a neural network:

Input layer:

The sole purpose of this layer is to get the information (features) from the outside world and pass it to the network. Also, no computation/calculation is performed in this layer and the given information is straightaway passed on to the next layer (hidden layer).

Hidden layer:

As the name suggests, the nodes of this layer are not exposed. It is a layer between the input layer and the output layer. This layer is responsible for carrying out all the necessary computations and then the calculated results are passed on to the next layer (output layer). Also, there will always be a minimum of one hidden layer in a neural network.

Output layer:

Finally, this layer brings all the information learned through the hidden layers and delivers it as the final result.

Below is an image of a neural network:

What is an activation function ?

A neural network activation function basically decides whether a neuron/node should be activated (information should be passed) or not, and it does that by checking whether the value is above a certain threshold or not.

Why is it needed ?

The most important purpose of an activation function is to add non-linearity to our neural network. Activation functions introduce an additional step at each layer during the forward propagation, so it is quite obvious to question its existence. So let’s understand it through an example!

Let’s consider a neural network without an activation function. In that case, every neuron will just perform a linear transformation on the inputs using weights and biases (W*x+b). Now the point is, this linear transformation isn’t really helpful for us as the degree of it is 1 (W*x) i.e. linear and hence it will just work as any other linear classifier, which isn’t enough to recognize complex patterns that we encounter in computer vision or natural language processing. And therefore, to learn or recognize this non-linear patterns activation functions are used.

Apart from adding non-linearity to our neural network, activation functions also helps in maintaining the value of the output from the neuron restricted to a certain limit, according to our requirements.

For ex.

A linear transformation operation happens in our activation functions (W*x+b), where W is the weight, x is the input and b is the bias. The value of this operation can go on to a very large extent (in terms of magnitude) if not restricted, especially in deep (having more than one hidden layer) neural networks which can lead to some computational issues.

Now that we are ready with all the required knowledge to learn about activation functions, let’s get started!

Binary step function:

It is a threshold based activation function, meaning if the value is able to cross a particular threshold then the neuron will be activated else it will be deactivated (the output won’t be passed to the next layer).

Range: [-∞,∞]

Mathematical representation:

Graph:

Advantages:

A good choice for binary classification.
Simple to understand.

Disadvantages:

Can’t be used for multi-class classification.
The gradient (derivative) of this function is zero, therefore causing hindrance in the backpropagation process.

Linear activation function:

This function is also known as Identity Function. Here the activation is proportional to the input.

Range: [-∞,∞]

Mathematical representation:

Graph:

Advantages:

Output of this function is not confined between any ranges.
Simple to understand.

Disadvantages:

The gradient of this function is constant and therefore can’t be used for backpropagation.
The neural network won’t learn anything as it is not improving the error term.
As it’s nature is linear, it doesn’t matter how many hidden layers you add because in the end they all will be squashed into a single layer or we can say that a linear activation function converts a neural network into just one layer.

Sigmoid/Logistic activation function:

This function basically takes an input and outputs another value between 0 and 1. The more positive the input, the closer will be the value to 1. Similarly, more negative the input, the closer will be the value to 0.

Range: [0,1]

Mathematical representation:

Graph:

Advantages:

Great choice where we have to predict the result in the form of probability, since probability ranges between 0 and 1.
The gradient (derivative) of this function is differentiable and also provides a smooth gradient (avoids sudden breaks or jumps in between).
Gradient calculation is simple.

Disadvantages:

It suffers from vanishing gradient problem because during backpropagation, the gradient becomes very close to zero making it difficult for weights to get updated and therefore the convergence becomes very slow. Also, if the gradient becomes zero, no learning happens.

Tanh (hyperbolic tangent) function:

This activation function is similar to the sigmoid activation function, the only major difference is, its range lies between [-1,1] unlike [0,1]. This function also has an S-shaped curve but is zero-centered as the minimum range is -1 (rather than 0).

Range: [-1,1]

Mathematical representation:

Graph:

Advantages:

It is differentiable.
It strongly maps negative inputs as strongly negative, zero inputs as neutral and positive inputs as strongly positive.
This function makes it easier to choose which value to consider and which to ignore as we are able to get values of different signs due to it’s range, i.e. [-1,1].
It is a zero-centric activation function.

Disadvantages:

This function also suffers from vanishing gradient problem similar to sigmoid activation function.
It is computationally expensive.

Note: tanh nonlinearity is always preferred over sigmoid nonlinearity due to its zero-centric nature.

ReLU function:

ReLU stands for Rectified Linear Unit. This function is the most commonly used activation function as it is used in most of the convolutional neural networks or deep learning.

The special thing about this function is, it maps all negative inputs as zero and outputs any positive value as it is (like a linear function).

Range: [0,∞]

Mathematical expression:

Graph:

Advantages:

It doesn’t activate all the neurons at once, therefore making it computationally efficient as compared to other activation functions (sigmoid and tanh).
The most important property of ReLU function is its non-saturating property which encourages the convergence of gradient descent towards it’s global minimum.
Maximum threshold value is infinity, therefore solving the issue of vanishing gradient problem.

Disadvantages:

It is linear in nature.
Sometimes it faces the Dying ReLU problem.
It can only be used in the hidden layers of a neural network.

Leaky ReLU function:

This is an improved version of the ReLU function, created to rectify the issue of dying ReLU problem. It has a small positive slope in the negative area. It also has an alpha value which is commonly between 0.1 and 0.3.

Range: [-∞,∞]

Mathematical expression: