Guide on using LightFM package in Python
Start your free 7-days trial now!
Suppose we wanted to build a movie recommendation system based on the following dataset:
df_data = pd.read_csv("./data.csv", index_col=0)print(len(df_data)) # 10987720 rowsdf_data
userID itemID rating occupation genre0 196 242 3.0 writer Comedy1 196 242 3.0 writer Comedy2 196 242 3.0 writer Comedy3 196 242 3.0 writer Comedy4 196 242 3.0 writer Comedy
Note the following:
itemID
refers to the movie ID.rating
is a numeric value that ranges from0
to5
.occupation
is the profession of the user. A user can only have a single occupation.genre
refers to the type of the movie. An item can have multiple genres (e.g."Comedy|Musical"
).
Our goal is to build a movie recommendation system based on the following features:
user-item interaction data -
rating
.user attribute data -
occupation
.item attribute data -
genre
.
Data preparation
Preparing LightFM Dataset
The first step is to prepare the dataset (Dataset
) that we will feed into our LightFM model. There are two things we must supply:
a list of unique user occupations.
a list of unique movie genres.
The list of unique occupations can be obtained like so:
list_str_occupations_unique = list(df_data["occupation"].drop_duplicates())print(len(list_str_occupations_unique)) # 21list_str_occupations_unique
['writer', 'marketing', 'student', 'other', ... ]
The list of unique movie genres can be obtained like so:
series_genre_of_movies = df_data["genre"].str.split("|")list_str_movie_genre_unique = list(set(np.concatenate(series_genre_of_movies).ravel()))print(len(list_str_movie_genre_unique)) # 18list_str_movie_genre_unique
['War', 'Animation', 'Sci-Fi', 'Comedy', 'Film-Noir', ... ]
Here, we are using the Series' split(-)
method to obtain a list of genres (e.g. "Comedy|Musical"
becomes ["Comedy","Musical"]
.
We can now build the LightFM dataset like so:
dataset = Dataset()dataset.fit(users=df_data["userID"], items=df_data["itemID"], item_features=list_str_movie_genre_unique, user_features=list_str_occupations_unique)
Preparing users features
Next, we must build the users' features. The input format expected by LightFM is as follows:
[(user_id_1, ['feature1']), (user_id_2, ['feature2']), ...]
We can obtain this input format like so:
df_data_with_unique_user_ids = df_data.drop_duplicates("userID")list_user_features = [(x,[y]) for x,y in zip(df_data_with_unique_user_ids["userID"], df_data_with_unique_user_ids["occupation"])]# print(len(list_user_features)) # 943list_user_features
[(196, ['writer']), (63, ['marketing']), (226, ['student']), (154, ['student']), ... ]
We then pass this into the LightFM's build_user_features(-)
method:
sm_user_features = dataset.build_user_features(list_user_features)sm_user_features
<943x964 sparse matrix of type '<class 'numpy.float32'>' with 1886 stored elements in Compressed Sparse Row format>
Here, the shape of sm_user_features
is 943
by 964
. This is because there are 943
unique users and each userID
is treated as an user attribute. We also have 21
unique occupations, which means there are a total of 964
user attributes.
Preparing items features
Similarly, we build the items' features:
list_item_features = [(x,y) for x,y in zip(df_movies_uniq["itemID"], series_genre_of_movies)]print(f"Length of item features: {len(list_item_features)}") # 352list_item_features
[(242, ['Comedy']), (257, ['Action', 'Adventure', 'Comedy', 'Sci-Fi']), (111, ['Comedy', 'Romance']), (25, ['Comedy']), (382, ['Comedy', 'Drama']), ... ]
We then pass this into LightFM's build_item_features(-)
method:
sm_item_features = dataset.build_item_features(list_item_features)sm_item_features
<352x370 sparse matrix of type '<class 'numpy.float32'>' with 1097 stored elements in Compressed Sparse Row format>
Preparing interaction data
Now that we've prepared the user and item data, we can move on to preparing the interaction data by using the build_interactions(-)
method:
sm_interactions, sm_weights = dataset.build_interactions(df_data[["userID","itemID","rating"]].values)sm_interactions
<943x352 sparse matrix of type '<class 'numpy.int32'>' with 10987720 stored elements in COOrdinate format>
Internal mapping of IDs
Instead of using the original IDs of the users and items in our dataset, LightFM internally assigns a new consecutive non-negative integer ID to each user and item. We can see the mapping like so:
user_id_map, user_feature_map, item_id_map, feature_item_map = dataset.mapping()
The user_id_map
is:
user_id_map
{196: 0, 63: 1, 226: 2,
The user_feature_map
is:
user_feature_map
{196: 0, 63: 1, ..., 'healthcare': 962, 'marketing': 963}
Remember, LightFM also treats the ID of every user as a feature - this is why we see the ID of our users included in the user_feature_map. We will later use these mappings to perform predictions.
Evaluating performance
Since we want to evaluate the performance of our LightFM model, we will use the library's random_train_test_split(-)
method:
sm_train_interactions, sm_test_interactions = random_train_test_split(sm_interactions, test_percentage=0.2, random_state=42)print(f"Shape of train interactions: {sm_train_interactions.shape}")print(f"Shape of test interactions: {sm_test_interactions.shape}")
Shape of train interactions: (943, 352)Shape of test interactions: (943, 352)
It's finally time to fit the LightFM model:
LEARNING_RATE = 0.25NO_EPOCHS = 20NO_COMPONENTS = 20 # Number of latent factorizationITEM_ALPHA = 1e-6 # Regularization factor for item featuresUSER_ALPHA = 1e-6 # Regularization factor for user features
model = LightFM(loss="warp", no_components=NO_COMPONENTS, learning_rate=LEARNING_RATE, item_alpha=ITEM_ALPHA, user_alpha=USER_ALPHA, random_state=42)
model.fit(interactions=sm_train_interactions, user_features=sm_user_features, item_features=sm_item_features, epochs=NO_EPOCHS)
It took me roughly 5 minutes to train the model using my M1-chip MacOS. Let's now evaluate the performance (precision@k) of our model using our testing data:
np_arr_prec = precision_at_k(model, test_interactions=sm_test_interactions, user_features=sm_user_features, item_features=sm_item_features)
print(len(np_arr_prec.shape)) # 943np_arr_prec[:10]
array([0.3, 0.3, 0.1, 0.3, 0.1, 0.5, 0.1, 0.9, 0.7, 0. ], dtype=float32)
Here, we have the precision@k value for every user. We can compute the mean average precision@k (MAP@K) like so:
np_arr_prec.mean()
0.38462353
User-to-Item recommendation
Suppose we wanted to recommend 5 movies to a particular user with ID 63. We can use the predict method of our model like so:
user_id = 63list_scores = model.predict(user_id_map[user_id], list(item_id_map.values()))print(len(list_scores)) # 352list_scores[:5]
array([-89.237495, -40.246788, -49.907825, -54.08462 , -99.96051 ], dtype=float32)
Note the following:
we convert the user_id to the user ID used by LightFM internally using
user_id_map
dictionary that we obtained earlier bydataset.mapping()
.we supply a list of movie IDs that we want to obtain a recommendation score for. Since we are interested in finding the top k movie recommendations for this user, we need to compute the recommendation score for every movie.
the
item_id_map
is a dictionary that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:item_id_map{242: 0,257: 1,111: 2, ...
We convert the list of recommendation scores into a Series such that we can assign the original movie IDs as the index:
series_scores = pd.Series(list_scores)series_scores.index = item_id_map.keys()print(len(list_scores)) # 352series_scores[:5]
242 -89.237495257 -40.246788111 -49.90782525 -54.084621382 -99.960510dtype: float32
Finally, we sort the scores in descending order to obtain the top movie recommendations for the user:
series_scores.sort_values(ascending=False, inplace=True)series_scores[:5]
222 124.6318281 21.56416315 -17.863367288 -31.532494258 -34.471897dtype: float32
Here, we see that the movie with ID 222
is the top recommended movie for this particular user. Note that the magnitude of the scores does not carry any meaning - they are simply used for ranking purpose only.
Item-to-item recommendation
To obtain the vector embedding of each movie, we can use the get_item_representations(-)
method:
_, np_item_embeddings = model.get_item_representations(features=sm_item_features)print(np_item_embeddings.shape) # (352, 20)np_item_embeddings[:2]
array([[-0.79904926, -0.85671186, 0.42553982, 0.994905 , -0.06102959, -0.41155615, -0.64710814, 0.38948753, -0.16504961, -0.24440393, -0.46848 , -0.00726059, -0.5730575 , -0.12569594, -0.84235895, 0.9981231 , -0.36846963, 0.0336417 , 0.1883249 , 0.7433187 ], [ 1.3654532 , 1.4837279 , 1.0903912 , -0.14545436, -1.0986278 , 0.08251551, -2.776378 , -0.20987356, 1.8015835 , 2.2055554 , -0.22924855, -3.5627067 , -0.35516343, -0.79560184, -1.4587665 , 1.6426092 , 1.2299991 , 0.26629227, 0.877507 , -0.35510343]], dtype=float32)
Note the following:
for every movie, we get a vector representation that encodes the characteristics of the movie.
the shape is
352
by20
because there are352
movies in total and we set the latent vector size (NO_COMPONENTS
) to20
during model fitting.
We compute the cosine similarity for each pair of movies:
np_item_similarities = cosine_similarity(sparse.csr_matrix(np_item_embeddings))print(np_item_similarities.shape) # (352, 352)np_item_similarities[:2]
array([[ 9.99999940e-01, 9.83655304e-02, 8.39596242e-03, -1.05853856e-01, 4.35901970e-01, -1.35404930e-01, -7.39989057e-03, 5.28273523e-01, -5.74964844e-02, ...
It's good practise to convert this NumPy array into a DataFrame as we can assign the movie IDs to the rows and columns like so:
df_item_similarities = pd.DataFrame(np_item_similarities)df_item_similarities.columns = item_id_map.keys()df_item_similarities.index = item_id_map.keys()df_item_similarities
242 257 111 25 382 202 153 286 66 845 ... 1181242 1.000000 0.098366 0.008396 -0.105854 0.435902 -0.135405 -0.007400 0.528274 -0.057496 0.247484 ... 0.113449257 0.098366 1.000000 0.079299 0.134674 -0.236781 -0.060198 0.160672 -0.037351 -0.040263 0.233311 ... 0.121223111 0.008396 0.079299 1.000000 0.555100 0.156462 0.340305 0.059644 0.220777 0.389159 0.734334 ... 0.11343425 -0.500537 -0.442127 -0.427205 -0.404253 -0.395989 -0.377560 -0.364232 -0.344025 -0.341550 -0.338591 ... 1.000000382 0.435902 -0.236781 0.156462 0.271259 1.000000 0.204918 0.178467 0.311732 0.174588 0.147333 ... 0.076015
Remember, the item_id_map
is a dictionary obtained by dataset.mapping()
from earlier that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:
item_id_map
{242: 0, 257: 1, 111: 2, ...
To get the top 5 recommendations for a particular movie, say movie with ID 1049:
int_movie_id = 1049series_rec_movie_ids = df_item_similarities.loc[int_movie_id,:]series_top_5_rec_movie_ids = series_rec_movie_ids.sort_values(ascending=False).head(6)series_top_5_rec_movie_ids
1049 1.000000832 0.7865101047 0.745040756 0.741784930 0.729942407 0.729872Name: 1049, dtype: float32
As expected, the first value is the movie itself, which receives a perfect score. We see that the movie with ID 832
is the most recommended (similar) movie!
Users' vector embeddings
Not only has LightFM learned the vector embedding of each movie, it has also learned it for each user. To obtain the users' vector embeddings, use the get_user_representations(-)
method:
_, np_user_embeddings = model.get_user_representations(features=sm_user_features)print(np_user_embeddings.shape) # (943,20)np_user_embeddings[:2]
array([[-0.917974 , -0.19056752, 0.804319 , 1.2637781 , -1.1266037 , -1.5947983 , 0.4530285 , 0.78684676, 1.6366841 , -0.55212635, -0.48068386, -1.92954 , -2.35468 , 0.41447783, -1.2052028 , -0.7124101 , -0.64502287, -1.2504478 , 1.0433927 , 0.9680061 ], [-1.6857581 , 0.64886266, -0.39518487, -2.0046077 , 2.8225665 , -0.59971595, -0.90769523, 0.50067604, 1.2015885 , 0.06845629, 0.0112996 , -1.1589563 , -0.62648225, 0.6836995 , -0.53056884, 2.109985 , -0.28536165, -1.1948149 , -0.03558412, 2.0381184 ]], dtype=float32)
To find out how similar two users are, we can do exactly what we did for item-to-item recommendations, that is, compute the pairwise cosine similarity and sort the values in descending order:
np_user_similarities = cosine_similarity(sparse.csr_matrix(np_user_embeddings))print(np_user_similarities.shape) # (943, 943)
df_user_similarities = pd.DataFrame(np_user_similarities)df_user_similarities.columns = user_id_map.keys()df_user_similarities.index = user_id_map.keys()
int_user_id = 196df_user_similarities.loc[int_user_id,:].sort_values(ascending=False)[:6]
196 1.000000485 0.668303888 0.643664306 0.595293482 0.588159336 0.586000Name: 196, dtype: float32
Here, we see that user with ID 485
is most similar to user 196
.