Comprehensive Guide on Metrics of Recommendation Systems
Start your free 7-days trial now!
Unlike metrics for regression and classification problems, the metrics for recommendation systems to evaluate their performance are less known and agreed upon in literature. In this guide, we will cover four key metrics that shed light on the characteristic of a recommendation system.
Intra-list similarity (ILS)
The intra-list similarity (ILS) is an average measure of how similar the recommended items are to their seed items. The formula to compute the intra-list similarity is as follows:
Where:
$n$ is the number of recommendations made.
$\text{sim}(P_i,P_j)$ is the similarity score between product $P_i$ and $P_j$ where $i\ne{j}$.
Note the following properties of intra-list similarity:
this metric ranges between $0$ and $1$ since $\text{sim}(P_i,P_j)$ is usually normalized to fall between $0$ and $1$.
a high ILS score suggests that the recommended products are similar to their seed products.
a low ILS score indicates that the recommended products are dissimilar to their seed products.
For example, consider the following table of recommendation similarities:
$P_1$ | $P_2$ | $P_3$ | $P_4$ | |
---|---|---|---|---|
$P_1$ | 1 | 0.2 | 0.4 | 0.1 |
$P_2$ | 0.2 | 1 | 0.5 | 0.4 |
$P_3$ | 0.4 | 0.5 | 1 | 0.5 |
$P_4$ | 0.1 | 0.3 | 0.5 | 1 |
Here, we see that the similarity score between $P_1$ and $P_2$ is $0.2$.
Suppose we pick the top two recommendations:
$P_1$ recommends $P_3$ (0.4) and $P_2$ (0.2).
$P_2$ recommends $P_3$ (0.5) and $P_4$ (0.3).
$P_3$ recommends $P_2$ (0.5) and $P_4$ (0.5).
$P_4$ recommends $P_3$ (0.5) and $P_2$ (0.4).
In total, we are making $8$ recommendations ($n=8$). The intra-list similarity of our recommendation system for this case is:
This means that the average similarity between all pairs of items in the recommendation list is $0.4125$.
Since we are only considering the top $2$ recommendations for each seed product, we sometimes include this detail in the metric name, as in $\text{ILS}@2$.
Diversity
Diversity is the opposite of intra-list similarity (ILS), that is, diversity is an average measure of how dissimilar the recommended items are to their seed items. The formula to compute diversity is:
Where:
$n$ is the number of recommended items.
$\text{sim}(P_i,P_j)$ is the similarity score between items $P_i$ and $P_j$ where $i\ne{j}$.
A high diversity score indicates that the recommendations are very different from their seed products. For instance, a system that recommends horror movies for a romance movie would have high diversity.
To demonstrate, let's use the same simple example used when explaining ILS. Recall that the ILS score for our recommendation system was $0.4125$. This means that the diversity is:
This means that, on average, the dissimilarity score for seed items and their recommendations is $0.5875$.
Coverage
Coverage is a measure of the percentage of items that are recommended. Intuitively, coverage is a measure of how well the recommendation system is able to cover the full range of items available. The formula for coverage is as follows:
Note the following:
a recommendation system with high coverage recommends most items.
a recommendation system with low coverage recommends only a select few items.
coverage can also be interpreted as a measure of how diverse the recommended products are.
To demonstrate, suppose we have $5$ products $P_1$, $P_2$, $P_3$, $P_4$ and $P_5$. Consider a recommendation system that recommends two different products for each product:
$P_1$ recommends $P_2$ and $P_5$.
$P_2$ recommends $P_1$ and $P_5$.
$P_3$ recommends $P_1$ and $P_5$.
$P_4$ recommends $P_1$ and $P_2$.
$P_5$ recommends $P_1$ and $P_2$.
Here, we see that only $3$ products are getting recommended ($P_1$, $P_2$ and $P_5$) out of a total of $5$ products. The coverage of our recommendation system is:
This means that $60\%$ of the items are getting recommended, that is, $40\%$ of the products never appear in the recommendations.
Novelty
Novelty is a recommendation system metric that measures how surprising or unique the recommended items are to the user. It is a measure of how different the recommended items are from what the user has seen before.
To demonstrate how to apply this formula, suppose we have a total of 100 users and the number of ratings given to movies:
movie 1 has 80 ratings.
movie 2 has 70 ratings.
movie 3 has 20 ratings.
movie 4 has 10 ratings.
Here, note that we don't care about what the rating is (e.g. a rating of 1 star and 5 stars are treated the same). This is because popularity in this context refers to the metric of how well users know the movie - a movie with lots of negative ratings is considered popular in the sense that many people know about it.
Here, we have used the number of ratings to infer a movie's popularity but we could also use other information such as the number of views and sales revenue instead.
Using the data available, we can infer the popularity of each movie like so:
For example, the popularity of movie one ($M_1$) is:
Let's summarize the popularity scores of the movies:
Movie | Popularity score |
---|---|
1 | 0.8 |
2 | 0.7 |
3 | 0.2 |
4 | 0.1 |
Now, suppose our recommendation system recommends movies $M_1$ and $M_2$ to a user. We now use formula \eqref{eq:XPf6di4PnMSHM2HVB3o} to compute the novelty score of our recommendation system:
Note the following:
a system with high novelty score means that it is recommending popular movies.
a system with low novelty score means that it is recommending lesser-known movies.
Novelty is an important metric for recommendation systems because it encourages the system to recommend items that are different from what the user has seen before, which can lead to a more engaging and diverse user experience.