menu search toc more_vert
Guest 0reps
Thanks for the thanks!
Log in or sign up
Sign out
help Ask a question
Share on Twitter
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
Doc Search
Code Search Beta
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Comprehensive Guide on Softmax function

Machine Learning
Neural Networks
Activation functions
schedule Jul 1, 2022
Last updated
local_offer Machine LearningPython

What is the Softmax function?

The Softmax function is defined as follows:

$$\begin{equation}\label{eq:yHWjjyQou5VFcDGhpZV} y_i=\frac{e^{x_i}}{\sum^N_{j=1}e^{x_j}} \end{equation}$$


  • $x_i$ is the $i$-th value in the vector $\boldsymbol{x}$

  • $N$ is the dimension of the vector $\boldsymbol{x}$

The Softmax function is a practical function turns numbers into probabilities that sum up to one.

Example of using the Softmax function

Consider the following input vector:

$$\begin{equation}\label{eq:ei1XptgAwaMk77sju1d} \boldsymbol{x}=\begin{pmatrix} 2\\ 1\\ 0.1\\ \end{pmatrix} \end{equation}$$

Using the formula for Softmax \eqref{eq:yHWjjyQou5VFcDGhpZV} gives us:

$$\begin{align*} y_1&=\frac{e^2}{e^2+e^{1}+e^{0.1}}\approx0.66\\ y_2&=\frac{e^1}{e^2+e^{1}+e^{0.1}}\approx0.24\\ y_3&=\frac{e^{0.1}}{e^2+e^{1}+e^{0.1}}\approx0.10\\ \end{align*}$$

Therefore, we have that:

$$\boldsymbol{y}=\begin{pmatrix} 0.66\\ 0.24\\ 0.10\\ \end{pmatrix}$$

Note the following:

  • the output of the entries sum to $1$, which means you can interpret them as probabilities.

  • the output of the Softmax function $\boldsymbol{y}$ is sometimes referred to as the logit.

Application to neural network

When modelling with neural networks, we often run into the Softmax function. Suppose we wanted to build a neural network that aims to classify whether the image is a cat, a dog or a bird. In such a case, we often use the Softmax function as the activation function for the final layer. The output probabilities are saying that the model is:

  • 70% sure the image is a cat

  • 20% sure the image is a dog

  • 10% sure the image is a bird

If you are performing predictions only without the need of probabilities, then the Softmax function is not necessary.

Comparison with Sigmoid function

Both the Softmax and sigmoid functions map inputs to a range of 0 to 1. However, the difference is that the inputs of the sigmoid do not sum to one as probabilities should.

Implementing Softmax function using Python's NumPy

Basic implementation

We can easily implement the Softmax function as described by equation \eqref{eq:yHWjjyQou5VFcDGhpZV} using NumPy like so:

import numpy as np

def softmax(x):
""" x: 1D NumPy array of inputs """
return np.exp(x) / np.sum(np.exp(x))

Let's use this function to compute the Softmax of vector \eqref{eq:ei1XptgAwaMk77sju1d}:

softmax([2, 1, 0.1])
array([0.65900114, 0.24243297, 0.09856589])

Notice how the output is identical to what we calculated by hand.

Optimised implementation

Our basic implementation of the Softmax function is based directly on the definition of the Softmax function as described by \eqref{eq:yHWjjyQou5VFcDGhpZV}:


The problem with this implementation is that exponential functions $e^x$ quickly become large as the value of $x$ increase. For instance, consider $\exp(100)$:


Notice how even a small input of $x=100$ would result in extremely large numbers. In fact, if we try $\exp(800)$, the value is so large that it cannot be computed:


This happens because computers represent numerical values using a fixed number of bytes (e.g. 8 bytes). The caveat is that extremely small or large numbers cannot be defined simply because there aren't enough bytes. If the number is so large that it cannot be represented using a fixed-number of bytes, then NumPy will return inf.

This limitation of our basic implementation means that large inputs will fail:

softmax([800, 500, 600])
array([1.00000000e+000, 5.14820022e-131, 1.38389653e-087])

Here, nan stands for not-a-number, that is, the number is too large that it cannot be computed. For this reason, the basic implementation is never used in practise.

The way to overcome this limitation is to reformulate the Softmax function like so:

$$\begin{equation}\label{eq:u0SjbfEiloxYNtGxR2o} \begin{aligned}[b] y_i&=\frac{\exp(x_i)}{\sum^N_{j=1}\exp(x_j)}\\ &=\frac{C\cdot\exp(x_i)}{C\cdot\sum^N_{j=1}\exp(x_j)}\\ &=\frac{\exp(\ln(C))\cdot\exp(x_i)}{\exp(\ln(C))\cdot\sum^N_{j=1}\exp(x_j)}\\ &=\frac{\exp(\ln(C)+x_i)}{\sum^N_{j=1}\exp(\ln(C)+x_j)}\\ &=\frac{\exp(C'+x_i)}{\sum^N_{j=1}\exp(C'+x_j)}\\ \end{aligned} \end{equation}$$

Note that all we have done is multiplied the numerator and denominator by some scalar constant $C$, and hence \eqref{eq:u0SjbfEiloxYNtGxR2o} is equivalent to the original equation of the Softmax function \eqref{eq:yHWjjyQou5VFcDGhpZV}.

Let's now understand why \eqref{eq:u0SjbfEiloxYNtGxR2o} is better for numerical computation. $C'$ can be any constant value, so we can choose $C'$ such that the exponent ($C'+x_i$) is small. This is how we can avoid large uncomputable numbers.

Now, what is a good value of $C'$? If our goal is to minimize the exponent $C'+x_i$, we could set C' to be the negative maximum of our input vector x. 

For instance, consider the following input vector:

$$ \boldsymbol{x}=\begin{pmatrix} 800\\500\\600\\ \end{pmatrix} $$

The negative of the maximum of $\boldsymbol{x}$ is:

$$\begin{align*} C'&=-\max(\boldsymbol{x})\\ &=-800 \end{align*}$$

From \eqref{eq:u0SjbfEiloxYNtGxR2o} we know that:

$$\begin{align*} \frac{\exp(-800+800)}{\exp(-800+800)+\exp(-800+500)+\exp(-800+600)}= \frac{\exp(0)}{\exp(0)+\exp(-300)+\exp(-200)} \end{align*}$$

Notice how we now avoid $\exp(800)$, and our exponents are much smaller!

The implementation of \eqref{eq:u0SjbfEiloxYNtGxR2o} in NumPy is as follows:

def softmax(x):
""" x: 1D NumPy array of inputs """
c = -np.max(x)
x += c
return np.exp(x) / np.sum(np.exp(x))

Now, we can use the function like so:

softmax([800, 500, 600])
array([1.00000000e+000, 5.14820022e-131, 1.38389653e-087])

Notice how we do not have any nan this time.

Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...