AKP's Newsletter
Posts
SiLU (Sigmoid Linear Unit) activation function

SiLU (Sigmoid Linear Unit) activation function

Abhishek Kumar Pandey
April 06, 2024

SiLU (Sigmoid Linear Unit) activation function

Easy:

Imagine you’re playing a game where you have to guess how many candies are in a jar. The game is really hard, and sometimes you guess too high, and sometimes too low. Now, imagine if there was a special rule that made it easier for you to guess. This rule would help you guess more accurately by making sure your guesses are always between 0 and 100 candies.

The SiLU activation function is like that special rule for your brain when it’s trying to learn from pictures or words. It helps the brain guess the right answers more easily by making sure the guesses are always between 0 and 100 percent. This makes it easier for the brain to learn from the pictures or words it sees or hears.

Just like how the special rule helps you guess better in the game, the SiLU activation function helps the brain learn better from the pictures or words it sees or hears. It’s like a magic trick that makes learning easier and more fun!

Another easy example:

Imagine you have a special machine that takes numbers as input and gives you different numbers as output. Now, this SiLU thing is a special rule or formula we use inside this machine.

Here’s how it works:

The Input and Output Connection: When you put a number into this machine, let’s call it x, it does two things: First, it looks at that number and makes a decision about how big or small it is. Then, it does a little math with that number.
Making Decisions: It’s like asking the machine, “Hey, are you a big number or a small one?” If the number is really big, the machine will say, “Okay, let’s make it a bit smaller.” If the number is small, the machine will say, “Alright, let’s keep it that way.”
Doing Math: After making the decision, the machine takes the number and does some simple math with it. Imagine it’s just multiplying the number by another special number.
The Output: Finally, the machine gives you a new number as the output. This new number depends on the original number you put in, but it’s been changed a little according to the decision the machine made and the math it did.

So, in simple terms, SiLU is like having a friend who looks at a number, decides if it’s big or small, does a little math with it, and then gives you back a new number. It’s a special way for our machine to change numbers while following some rules.

Moderate:

The SiLU (Sigmoid-weighted Linear Unit) activation function is a relatively new contender in the world of neural network activation functions. It offers some advantages over more traditional options like ReLU (Rectified Linear Unit).

Here’s a breakdown of what SiLU is and how it works:

Function:

The SiLU activation function is calculated by multiplying the input value (x) by the sigmoid of that same input value. Mathematically, it’s written as:

silu(x) = x * sigmoid(x)

The sigmoid function itself squashes any number between positive and negative infinity to a value between 0 and 1. So, SiLU essentially takes the input, scales it by a value between 0 and 1 based on the sigmoid function’s output for that input, and returns the result.

Properties:

Smooth and Non-monotonic: Unlike ReLU, which has a sharp kink at zero, SiLU’s curve is smooth because of the sigmoid influence. Additionally, SiLU is not monotonically increasing, meaning its output doesn’t strictly increase or decrease as the input changes.
Bounded Below: While unbounded above, SiLU’s output is always greater than or equal to -1 due to the sigmoid function.
Self-stabilizing: SiLU has a “soft floor” effect. The derivative of SiLU is zero around -1.28, which acts as a kind of regularizer, preventing weights from growing too large during training.

Benefits:

Addresses “Dying ReLU” Problem: ReLU neurons can become inactive (stuck at zero) if the weights and biases are not set properly. SiLU’s smoothness helps avoid this problem.
Can work well with Batch Normalization: The self-stabilizing property can help with training in conjunction with Batch Normalization.

Overall, SiLU is a promising activation function that offers advantages over ReLU in terms of smoothness and preventing dead neurons. It’s worth considering for your next neural network project!

Hard:

SiLU, also known as Swish, is a self-gated activation function introduced in the paper “Self-Gated Convolutional Networks” by Ramsauer et al. in 2017. It is a smooth, continuous, and differentiable function that combines the advantages of ReLU and sigmoid activation functions.

The mathematical formula for the SiLU activation function is:

f(x) = x sigmoid(x) = x (1 / (1 + e^(-x)))

Here, x is the input to the activation function, and sigmoid(x) is the sigmoid function, which is defined as:

sigmoid(x) = 1 / (1 + e^(-x))

The SiLU function applies the sigmoid function element-wise to the input, and then multiplies the result by the original input. This self-gating mechanism allows the function to adaptively scale its inputs based on their activation levels.

In comparison to ReLU, which has a step-like shape with a constant gradient of 1 for x > 0, SiLU has a more gradual increase in the positive region. This continuous slope provides a more nuanced non-linearity, which can be beneficial for certain machine learning tasks. It also alleviates the “dying ReLU” problem to some extent, where the ReLU units can get stuck in a state where their output is always 0, leading to vanishing gradients during training.

Unlike the sigmoid function, which saturates at the extremes (outputting values close to 0 and 1 for very negative and positive inputs, respectively), SiLU maintains a non-zero gradient even for large input values. This property helps the function to avoid the vanishing gradient problem often encountered when using sigmoid or tanh activations, particularly in deep neural networks.

Compared to other commonly used activation functions such as ReLU or sigmoid, the SiLU activation function has several advantages. Firstly, it introduces smoothness in the output, which can help improve gradient flow during backpropagation and reduce the vanishing gradient problem. Secondly, the SiLU activation function has been shown to have better performance than traditional activation functions on certain tasks, particularly in deep residual networks. This is because the SiLU activation function allows for more expressive representations and enables smoother optimization landscapes.

However, one potential downside of the SiLU activation function is its computational complexity compared to simpler activation functions like ReLU. Specifically, computing the exponential term in the denominator requires additional computation resources, which could potentially slow down training times. Nonetheless, with the increasing availability of powerful hardware and optimized libraries, this tradeoff may become less significant over time.

SiLU has been shown to improve the performance of various deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), in computer vision, natural language processing, and other domains. It has become a popular activation function in recent years, with many researchers exploring its potential benefits over traditional activation functions.

A few books on deep learning that I am reading:

Book 1

Book 2

Book 3