Comprehensive Guide on Softmax function
Start your free 7-days trial now!
What is the Softmax function?
The Softmax function is defined as follows:
Where:
is the -th value in the vector is the dimension of the vector
The Softmax function is a practical function turns numbers into probabilities that sum up to one.
Example of using the Softmax function
Consider the following input vector:
Using the formula for Softmax
Therefore, we have that:
Note the following:
the output of the entries sum to
, which means you can interpret them as probabilities.the output of the Softmax function
is sometimes referred to as the logit.
Application to neural network
When modelling with neural networks, we often run into the Softmax function. Suppose we wanted to build a neural network that aims to classify whether the image is a cat, a dog or a bird. In such a case, we often use the Softmax function as the activation function for the final layer. The output probabilities are saying that the model is:
70% sure the image is a cat
20% sure the image is a dog
10% sure the image is a bird
If you are performing predictions only without the need of probabilities, then the Softmax function is not necessary.
Comparison with Sigmoid function
Both the Softmax and sigmoid functions map inputs to a range of 0 to 1. However, the difference is that the inputs of the sigmoid do not sum to one as probabilities should.
Implementing Softmax function using Python's NumPy
Basic implementation
We can easily implement the Softmax function as described by equation
filter_none
Copy
Let's use this function to compute the Softmax of vector
filter_none
Copy
softmax([2, 1, 0.1])
array([0.65900114, 0.24243297, 0.09856589])
Notice how the output is identical to what we calculated by hand.
Optimised implementation
Our basic implementation of the Softmax function is based directly on the definition of the Softmax function as described by
The problem with this implementation is that exponential functions
filter_none
Copy
2.6881171418161356e+43
Notice how even a small input of
filter_none
Copy
inf
This happens because computers represent numerical values using a fixed number of bytes (e.g. 8 bytes). The caveat is that extremely small or large numbers cannot be defined simply because there aren't enough bytes. If the number is so large that it cannot be represented using a fixed-number of bytes, then NumPy will return inf
.
This limitation of our basic implementation means that large inputs will fail:
filter_none
Copy
softmax([800, 500, 600])
array([1.00000000e+000, 5.14820022e-131, 1.38389653e-087])
Here, nan
stands for not-a-number, that is, the number is too large that it cannot be computed. For this reason, the basic implementation is never used in practise.
The way to overcome this limitation is to reformulate the Softmax function like so:
Note that all we have done is multiplied the numerator and denominator by some scalar constant
Let's now understand why
Now, what is a good value of
For instance, consider the following input vector:
The negative of the maximum of
Notice how we now avoid
The implementation of
filter_none
Copy
def softmax(x): """ x: 1D NumPy array of inputs """ c = -np.max(x) x += c return np.exp(x) / np.sum(np.exp(x))
Now, we can use the function like so:
filter_none
Copy
softmax([800, 500, 600])
array([1.00000000e+000, 5.14820022e-131, 1.38389653e-087])
Notice how we do not have any nan
this time.