Unlock 100+ guides
search toc
Log in or sign up
Sign out
What does this mean?
Why is this true?
Give me some examples!
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
Doc Search
Code Search Beta
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Gentle but Comprehensive Guide to Naive Bayes

schedule Aug 12, 2023
Last updated
Machine LearningPython
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Naive Bayes is a simple but powerful machine learning model that is often used for classification tasks. As the name suggests, the model is based on the Bayes theorem, and is widely used in the field of natural language processing.

Motivating example

Simple dataset

Consider the following dataset:



















Here, color is a categorical variable with 3 levels (Brown, Black and White), whereas size is a binary categorical variable (small and large). Suppose we are given that the animal is black in color and small in size. Our goal is to build a binary classifier using Naive Bayes that can predict whether an animal is a cat or a dog given it is black and small.

Computing posterior probabilities

The primary objective of Naive Bayes is to compute the posterior probabilities of every class given some feature values. In this case, the posterior probabilities we want to compute are as follows:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Black and Small})\\ \mathbb{P}(\text{Dog}|\text{Black and Small}) \end{align*}$$

Once we compute these posterior probabilities, we then simply need to compare the two values and choose the class (cat or dog) that is larger.

Let us begin by computing the first posterior probability:

$$\begin{align} \mathbb{P}(\text{Cat}|\text{Black and Small})&\propto\mathbb{P}(\text{Black and Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat}) \end{align}$$

Understanding why the proportional sign is used

In step (1), we have used the definition of conditional probability:


In this case, the conditional probability formula translates to the following:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Black and Small})&=\frac{\mathbb{P}(\text{Black and Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})}{\mathbb{P}(\text{Black and Small})} \end{align*}$$
$$\begin{align*} \mathbb{P}(\text{Dog}|\text{Black and Small})&=\frac{\mathbb{P}(\text{Black and Small}|\text{Dog})\cdot\mathbb{P}(\text{Dog})}{\mathbb{P}(\text{Black and Small})} \end{align*}$$

Notice how the denominators for both the posterior probabilities are the same. To make predictions, we simply care about which posterior probability is greater - and so we can simply eliminate the common denominator $\mathbb{P}(\text{Black and Small})$ here. This means that we can compute only the numerator instead for comparison purposes. Since we drop the denominators, the equality no longer holds so we must resort to using the proportional sign like so:

$$\mathbb{P}(\text{Cat}|\text{Black and Small})\propto\mathbb{P}(\text{Black and Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})$$

Using the independence assumption of features

Next, we make use of the independence assumption that states:


For our case, this translates to the following:

$$\begin{align} \mathbb{P}(\text{Cat}|\text{Black and Small})&\propto{\color{blue}\mathbb{P}(\text{Black and Small}|\text{Cat})}\cdot\mathbb{P}(\text{Cat})\\ &={\color{blue}\mathbb{P}(\text{Black}|\text{Cat})\cdot\mathbb{P}(\text{Small}|\text{Cat})}\cdot\mathbb{P}(\text{Cat})\\ \end{align}$$

This is what is meant by the naive assumption of Naive Bayes. In this case, we are assuming that given the animal is a cat, the two events (Black and Small) are independent.

Laplace smoothing

The next step is to compute the individual probabilities. We easily compute $\mathbb{P}(\text{Cat})$ using simple frequency counts:



  • $N_{\text{Cat}}$ is the number of data items where class=Cat.

  • $N$ is the total number of data items in the dataset.

Next, we need to compute the two conditional probabilities. One intuitive way of computing the conditional probabilities would be to use simple frequency counts once again:



  • $N_{\text{Black|Cat}}$ is the number of data items where color=Black given class=Cat.

  • $N_\text{Cat}$ is the number of data items where class=Cat.

However, this does not work well in practice. To understand why, suppose we did not have a black cat in our dataset, that is, $N_{\text{Black|Cat}}=0$. This would mean that the resulting posterior probability would be reduced to $0$ like so:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Black and Small})&\propto\mathbb{P}(\text{Black and Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})\\ &={\color{blue}\mathbb{P}(\text{Black}|\text{Cat})}\cdot\mathbb{P}(\text{Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})\\ &={\color{blue}(0)}\cdot\mathbb{P}(\text{Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})\\ &=0\\ \end{align*}$$

This is undesirable because we lose substantial information - for instance, we have information about $\mathbb{P}(\text{Small|Cat})$, but this information becomes meaningless since the posterior probability would end up being zero anyway. In order to ensure that the absence of a feature value does not reduce the posterior probability to zero, we use a technique called Laplace smoothing. This involves adding a small value to the numerator and the denominator like so:



  • $\alpha$ is called a smoothing parameter (usually just set to $1$)

  • $K_{\text{color}}$ is the number of dimensions of the feature color. In the case of categorical features, $K$ would be the number of classes of the feature ($K_{\text{color}}=3$ in this case because we have three levels for this feature: brown, black and white).

With this approach, the numerator would never be equal to 0 since the minimum value of the numerator would be $\alpha$. $\mathbb{P}(\text{Black}|\text{Cat})$ would therefore never equal zero, which then means that the posterior probability $\mathbb{P}(\text{Cat}|\text{Black and Small})$ would never be reduced to zero.

For this example problem, let us use the conventional value of $\alpha$, that is, $\alpha=1$:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Black and Small})&\propto\mathbb{P}(\text{Black and Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})\\ &=\mathbb{P}(\text{Black}|\text{Cat})\cdot\mathbb{P}(\text{Small}|\text{Cat})\cdot\mathbb{P}(\text{Cat})\\ &=\color{blue}\frac{N_{Black|Cat}+\alpha}{N_{\text{Cat}}+\alpha\cdot{K_{\text{color}}}}\cdot\frac{N_{Small|Cat}+\alpha}{N_{\text{Cat}}+\alpha\cdot{K_{\text{size}}}}\cdot\frac{3}{5}\\ &=\frac{1+1}{3+1\cdot3}\cdot\frac{2+1}{3+1\cdot2}\cdot\frac{3}{5}\\ &=0.12 \end{align*}$$

Computing posterior probability of the other level

In the previous section, we have computed the posterior probability $\mathbb{P}(\text{Cat}|\text{Black and Small})$. We now need to compute the posterior probability for the other levels of the target class, which in this case is $\mathbb{P}(\text{Dog}|\text{Black and Small})$:

$$\begin{align*} \mathbb{P}(\text{Dog}|\text{Black and Small})&\propto\mathbb{P}(\text{Black and Small}|\text{Dog})\cdot\mathbb{P}(\text{Dog})\\ &=\mathbb{P}(\text{Black}|\text{Dog})\cdot\mathbb{P}(\text{Small}|\text{Dog})\cdot\mathbb{P}(\text{Dog})\\ &=\frac{N_{Black|Dog}+\alpha}{N_{\text{Dog}}+\alpha\cdot{K}_{\text{color}}}\cdot\frac{N_{Small|Dog}+\alpha}{N_{\text{Dog}}+\alpha\cdot{K}_{\text{size}}}\cdot\frac{3}{5}\\ &=\frac{1+1}{2+1\cdot3}\cdot\frac{0+1}{2+1\cdot2}\cdot\frac{3}{5}\\ &=0.06 \end{align*}$$

Making predictions

We now compare the computed posterior probabilities:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Black and Small})>\mathbb{P}(\text{Dog}|\text{Black and Small}) \end{align*}$$

In words, the probability that the animal is a cat is greater than the probability that it is a dog given it is black in color and small in size. Therefore, we predict that the animal is a cat in this case.

Dealing with continuous variables

Consider the following simple dataset with one continuous feature:











Given weight=8, we want to predict whether the animal is a cat or a dog using the Naive Bayes model. The core logic behind the algorithm is exactly the same as the categorical case - we aim to compute the following posterior probabilities:

$$\begin{align*} \mathbb{P}(\text{Cat}|\text{Weight}=8)\\ \mathbb{P}(\text{Dog}|\text{Weight}=8) \end{align*}$$

Again, we make use of the conditional probability formula like so:

$$\begin{align} p(\text{Cat}|\text{Weight}=8)&\propto {\color{blue}p(\text{Weight}=8|\text{Cat})} \cdot{\color{red}\mathbb{P}(\text{Cat})}\\ \end{align}$$

$\color{red}\mathbb{P}(\text{Cat})$ can be computed easily using simple frequency counts:


For brevity, let $X$ denote the random variable representing the weight of the animal. To be able to compute ${\color{blue}p(\text{X}=8|\text{Cat})}$, we need to assume that $X$ follows some probability distribution - typically the normal distribution like so:


This mathematical notation tells us that the random variable $X$ (weight of the animal) given that it is a cat follows a normal distribution with some mean $\mu_{X|\text{Cat}}$ and variance $\sigma^2_{X|\text{Cat}}$. However, the problem is that we do not actually know $\mu_{X|\text{Cat}}$ and $\sigma^2_{X|\text{Cat}}$, and so we need to estimate these parameter values based on our dataset. As we know from elementary statistics, the parameter mean $\mu_X$ can be estimated using the sample mean $\bar{X}_\text{Cat}$, and parameter variance $\sigma^2_{X|\text{Cat}}$ using sample variance ${S}^2_\text{X|Cat}$ like so:

$$\begin{align*} \hat\mu_{\text{X|Cat}}&=\bar{X}_{\text{Cat}}\\ \hat\sigma^2_{\text{X|Cat}}&=S^2_{\text{X|Cat}} \end{align*}$$

Here, the hat symbol on top of the parameter values means that they are merely estimates, instead of the actual values.

Given target=Cat, the sample mean and variance can be easily calculated like so:

$$\begin{align*} \bar{X}_{\text{Cat}}&=\frac{5+10}{2}=7.5\\ S^2_{\text{X|Cat}}&=\frac{1}{2-1}\Big[(5-7.5)^2+(10-7.5)^2\Big]=12.5 \end{align*}$$

Now, we can compute the prior probability like so:

$$\begin{align*} {\color{blue}p(\text{X}=8|\text{Cat})} &=\frac{1}{\hat\sigma_{\text{X|Cat}}\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(x-\hat\mu_{\text{X|Cat}})^2}{2\hat\sigma^2_{\text{X|Cat}}}\right)\\ &=\frac{1}{\sqrt{12.5}\cdot\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(8-7.5)^2}{2\cdot12.5}\right)\\ &=0.112 \end{align*}$$

We now have everything we need to compute the posterior probability:

$$\begin{align*} p(\text{Cat}|\text{X}=8)&\propto {\color{blue}p(\text{X}=8|\text{Cat})} \cdot{\color{red}\mathbb{P}(\text{Cat})}\\ &={\color{blue}\frac{1}{\hat\sigma_{\text{X|Cat}}\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(x-\hat\mu_{\text{X|Cat}})^2}{2\hat\sigma^2_{\text{X|Cat}}}\right)}\cdot\color{red}{\frac{N_{\text{Cat}}}{N}}\\ &={\color{blue}\frac{1}{\sqrt{12.5}\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(8-7.5)^2}{2\cdot12.5}\right)}\cdot{{\color{red}\frac{2}{4}}}\\ &=0.0559\\ \end{align*}$$

We then need to compute the other probability like so $p(\text{Dog}|X=8)$:

$$\begin{align*} p(\text{Dog}|\text{X}=8)&\propto {\color{blue}p(\text{X}=8|\text{Dog})} \cdot{\color{red}\mathbb{P}(\text{Dog})}\\ &={\color{blue}\frac{1}{\hat\sigma_{\text{X|Dog}}\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(x-\hat\mu_{\text{X|Dog}})^2}{2\hat\sigma^2_{\text{X|Dog}}}\right)}\cdot\color{red}{\frac{N_{\text{Dog}}}{N}}\\ &={\color{blue}\frac{1}{\sqrt{6.25}\sqrt{2\pi}}\mathrm{exp}\left(-\frac{(8-17.5)^2}{2\cdot6.25}\right)}\cdot{{\color{red}\frac{2}{4}}}\\ &=0.0000584\\ \end{align*}$$

Comparing the two posterior probabilities:


We therefore predict that the animal is a cat.

Application of Naive Bayes in NLP

Suppose we are trying to build a binary classifier that can differentiate positive and negative reviews about a certain movie. Our training data consists of the following four labelled data items:



This movie is great.


Love the movie. Love the actors.


Bad movie. Do not watch.


Bad actors.


Suppose we are now presented with the review "Great Actors". Our goal is to predict whether this review is positive or negative. As discussed earlier, we would need to calculate the following posterior probabilities and compare them:

$$\begin{align*} \mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}})\\ \mathbb{P}(\text{neg}|{\color{green}\text{`great actors`}}) \end{align*}$$

Let us start by computing $\mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}})$ first. Using the conditional probability, we have that:

$$\begin{align*} \mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}}) \propto{\mathbb{P}({\color{green}\text{`great actors`}}|\text{pos})\cdot\mathbb{P}(\text{pos})} \end{align*}$$

We can easily compute $\mathbb{P}(\text{pos})$ using simple frequency counts:



  • $N_{\text{pos}}$ is the number of data items with class=positive.

  • $N$ is the total number of data items in the dataset.

Next, to compute the prior probability $\mathbb{P}({\color{green}\text{`great actors`}}|\text{pos})$, we resort to the independence assumption of the Naive Bayes model:

$$\begin{align*} \mathbb{P}({\color{green}\text{`great actors`}}|\text{pos})= \mathbb{P}({\color{green}\text{`great`}}|\text{pos})\cdot \mathbb{P}({\color{green}\text{`actors`}}|\text{pos}) \end{align*}$$

Let us compute $\mathbb{P}({\color{green}\text{`great`}}|\text{pos})$ first:

$$\mathbb{P}({\color{green}\text{`great`}}|\text{pos})= \frac{freq({\color{green}\text{`great`}}|\text{pos})}{freq(\text{all words}|\text{pos})}$$

Here, the numerator is the number of times the word "great" appears in positive reviews (just once), whereas the denominator is the total number of words (duplicated words are counted as a new word) in positive reviews (10 words):

$$\begin{align*} \mathbb{P}({\color{green}\text{`great`}}|\text{pos})= \frac{freq({\color{green}\text{`great`}}|\text{pos})}{freq(\text{all words}|\text{pos})}= \frac{1}{10} \end{align*}$$

However, once again, this does not work well in practice. If we consider the case when the word "great" does not appear in positive reviews, then the resulting poster probability ends up being zero once again, that is:

$$\begin{align*} \mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}}) &\propto{\mathbb{P}({\color{green}\text{`great actors`}}|\text{pos})\cdot\mathbb{P}(\text{pos})}\\ &=\mathbb{P}({\color{green}\text{`great`}}|\text{pos})\cdot \mathbb{P}({\color{green}\text{`actors`}}|\text{pos})\cdot\mathbb{P}(\text{pos})\\ &=(0)\cdot \mathbb{P}({\color{green}\text{`actors`}}|\text{pos})\cdot\mathbb{P}(\text{pos})\\ &=0 \end{align*}$$

To avoid this, we must perform Laplace smoothing like so:

$$\begin{align*} \mathbb{P}({\color{green}\text{`great`}}|\text{pos})= \frac{freq({\color{green}\text{`great`}}|\text{pos})+\alpha}{freq(\text{all words}|\text{pos})+\alpha\cdot{K}} \end{align*}$$


  • $\alpha$ is the smoothing parameter (typically set to $1$)

  • $K$ is the number of unique words in the dataset

One way of thinking about why $K$ is interpreted as such is to think of the review feature as a categorical variable with $K$ levels.

Since there are 11 unique words in our dataset, we have $K=11$. Let us use $\alpha=1$:

$$\begin{align*} \mathbb{P}({\color{green}\text{`great`}}|\text{pos})= \frac{1+1}{10+1\cdot{11}}= \frac{2}{21} \end{align*}$$

In the same way, we can compute $\mathbb{P}({\color{green}\text{`actors`}}|\text{pos})$ like so:

$$\begin{align*} \mathbb{P}({\color{green}\text{`actors`}}|\text{pos})= \frac{freq({\color{green}\text{`actors`}}|\text{pos})+\alpha}{freq(\text{allwords}|\text{pos})+\alpha\cdot{K}}= \frac{1+1}{10+1\cdot{11}}=\frac{2}{21} \end{align*}$$

We now have everything we need to compute the posterior probability:

$$\begin{align*} \mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}}) &\propto{\mathbb{P}({\color{green}\text{`great actors`}}|\text{pos})\cdot\mathbb{P}(\text{pos})}\\ &={\mathbb{P}({\color{green}\text{`great`}}|\text{pos})\cdot{\mathbb{P}({\color{green}\text{`actors`}}|\text{pos})}\cdot\mathbb{P}(\text{pos})}\\ &=\frac{2}{21}\cdot\frac{2}{21}\cdot\frac{2}{4}\\ &\approx4.535\times10^{-3} \end{align*}$$

We use the same logic to compute the posterior probability for the other target level:

$$\begin{align*} \mathbb{P}(\text{neg}|{\color{green}\text{`great actors`}}) &\propto{\mathbb{P}({\color{green}\text{`great actors`}}|\text{neg})\cdot\mathbb{P}(\text{neg})}\\ &={\mathbb{P}({\color{green}\text{`great`}}|\text{neg})\cdot{\mathbb{P}({\color{green}\text{`actors`}}|\text{neg})}\cdot\mathbb{P}(\text{neg})}\\ &=\frac{freq({\color{green}\text{`great`}}|\text{neg})+\alpha}{freq(\text{allwords}|\text{neg})+\alpha\cdot{K}} \cdot\frac{freq({\color{green}\text{`actors`}}|\text{neg})+\alpha}{freq(\text{allwords}|\text{neg})+\alpha\cdot{K}}\cdot\mathbb{P}(\text{neg})\\ &=\frac{0+1}{7+1\cdot11}\cdot\frac{1+1}{7+1\cdot11}\cdot\frac{2}{4}\\ &\approx3.086\times10^{-3} \end{align*}$$

Comparing the two posterior probabilities, we have the result that:

$$\mathbb{P}(\text{pos}|{\color{green}\text{`great actors`}}) \gt \mathbb{P}(\text{neg}|{\color{green}\text{`great actors`}})$$

We therefore predict that the the review "great actors" is positive.

Implementing Naive Bayes using Python's Scikit-learn

In this section, the goal is to implement Naive Bayes using Python's Scikit-learn to solve a classification problem. In particular, the problem is to classify the type of iris given 4 of its numerical characteristics: sepal length, sepal width, petal length and petal width. The target consists of 3 different types of Iris.

We first import the relevant libraries:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

We then retrieve our dataset as a Pandas DataFrame:

bunch_iris = datasets.load_iris()
# Construct a DataFrame from the Bunch Object
data = pd.DataFrame(data=np.c_[bunch_iris['data'], bunch_iris['target']],
columns=bunch_iris['feature_names'] + ['target'])

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

We then split our DataFrame into training and testing sets:

# Break into X (features) and y (target)
X = data.iloc[:,1:4]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2000)
print("Number of rows of X_train:", X_train.shape[0])
print("Number of rows of y_train:", y_train.shape[0])
print("Number of rows of X_test:", X_test.shape[0])
print("Number of rows of y_test:", y_test.shape[0])
Number of rows of X_train: 120
Number of rows of y_train: 120
Number of rows of X_test: 30
Number of rows of y_test: 30

We then build our model, perform some perform predictions using the testing set, and then print out the classification report:

model = GaussianNB(), y_train)
y_test_predicted = model.predict(X_test)
print(classification_report(y_test, y_test_predicted))
precision recall f1-score support
0.0 1.00 1.00 1.00 8
1.0 0.88 0.70 0.78 10
2.0 0.79 0.92 0.85 12
accuracy 0.87 30
macro avg 0.89 0.87 0.87 30
weighted avg 0.87 0.87 0.86 30

Here, since we are dealing with features that are continuous, we need to use the Gaussian Naive Bayes.

To learn more about what the performance keywords (e.g. precision, recall) mean, visit this guide. We can see that the overall accuracy for our model is 0.87. We can extract this value like so:

print("Accuracy is:", accuracy_score(y_test, y_test_predicted))
Accuracy is: 0.8666666666666667

We can also see the confusion matrix like so:

print(confusion_matrix(y_test, y_test_predicted))
[[ 8 0 0]
[ 0 7 3]
[ 0 1 11]]

To interpret the confusion matrix, visit this guide.

Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...