search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Guide on using LightFM package in Python

schedule Aug 10, 2023
Last updated
local_offer
Machine Learning
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Suppose we wanted to build a movie recommendation system based on the following dataset:

df_data = pd.read_csv("./data.csv", index_col=0)
print(len(df_data)) # 10987720 rows
df_data
userID itemID rating occupation genre
0 196 242 3.0 writer Comedy
1 196 242 3.0 writer Comedy
2 196 242 3.0 writer Comedy
3 196 242 3.0 writer Comedy
4 196 242 3.0 writer Comedy

Note the following:

  • itemID refers to the movie ID.

  • rating is a numeric value that ranges from 0 to 5.

  • occupation is the profession of the user. A user can only have a single occupation.

  • genre refers to the type of the movie. An item can have multiple genres (e.g. "Comedy|Musical").

Our goal is to build a movie recommendation system based on the following features:

  • user-item interaction data - rating.

  • user attribute data - occupation.

  • item attribute data - genre.

Data preparation

Preparing LightFM Dataset

The first step is to prepare the dataset (Dataset) that we will feed into our LightFM model. There are two things we must supply:

  • a list of unique user occupations.

  • a list of unique movie genres.

The list of unique occupations can be obtained like so:

list_str_occupations_unique = list(df_data["occupation"].drop_duplicates())
print(len(list_str_occupations_unique)) # 21
list_str_occupations_unique
['writer', 'marketing', 'student', 'other', ... ]

The list of unique movie genres can be obtained like so:

series_genre_of_movies = df_data["genre"].str.split("|")
list_str_movie_genre_unique = list(set(np.concatenate(series_genre_of_movies).ravel()))
print(len(list_str_movie_genre_unique)) # 18
list_str_movie_genre_unique
['War', 'Animation', 'Sci-Fi', 'Comedy', 'Film-Noir', ... ]

Here, we are using the Series' split(-) method to obtain a list of genres (e.g. "Comedy|Musical" becomes ["Comedy","Musical"].

We can now build the LightFM dataset like so:

dataset = Dataset()
dataset.fit(users=df_data["userID"],
items=df_data["itemID"],
item_features=list_str_movie_genre_unique,
user_features=list_str_occupations_unique)

Preparing users features

Next, we must build the users' features. The input format expected by LightFM is as follows:

[(user_id_1, ['feature1']), (user_id_2, ['feature2']), ...]

We can obtain this input format like so:

df_data_with_unique_user_ids = df_data.drop_duplicates("userID")
list_user_features = [(x,[y]) for x,y in zip(df_data_with_unique_user_ids["userID"], df_data_with_unique_user_ids["occupation"])]
# print(len(list_user_features)) # 943
list_user_features
[(196, ['writer']), (63, ['marketing']), (226, ['student']), (154, ['student']), ... ]

We then pass this into the LightFM's build_user_features(-) method:

sm_user_features = dataset.build_user_features(list_user_features)
sm_user_features
<943x964 sparse matrix of type '<class 'numpy.float32'>'
with 1886 stored elements in Compressed Sparse Row format>

Here, the shape of sm_user_features is 943 by 964. This is because there are 943 unique users and each userID is treated as an user attribute. We also have 21 unique occupations, which means there are a total of 964 user attributes.

Preparing items features

Similarly, we build the items' features:

list_item_features = [(x,y) for x,y in zip(df_movies_uniq["itemID"], series_genre_of_movies)]
print(f"Length of item features: {len(list_item_features)}") # 352
list_item_features
[(242, ['Comedy']),
(257, ['Action', 'Adventure', 'Comedy', 'Sci-Fi']),
(111, ['Comedy', 'Romance']),
(25, ['Comedy']),
(382, ['Comedy', 'Drama']), ... ]

We then pass this into LightFM's build_item_features(-) method:

sm_item_features = dataset.build_item_features(list_item_features)
sm_item_features
<352x370 sparse matrix of type '<class 'numpy.float32'>'
with 1097 stored elements in Compressed Sparse Row format>

Preparing interaction data

Now that we've prepared the user and item data, we can move on to preparing the interaction data by using the build_interactions(-) method:

sm_interactions, sm_weights = dataset.build_interactions(df_data[["userID","itemID","rating"]].values)
sm_interactions
<943x352 sparse matrix of type '<class 'numpy.int32'>'
with 10987720 stored elements in COOrdinate format>

Internal mapping of IDs

Instead of using the original IDs of the users and items in our dataset, LightFM internally assigns a new consecutive non-negative integer ID to each user and item. We can see the mapping like so:

user_id_map, user_feature_map, item_id_map, feature_item_map = dataset.mapping()

The user_id_map is:

user_id_map
{196: 0,
63: 1,
226: 2,

The user_feature_map is:

user_feature_map
{196: 0,
63: 1,
...,
'healthcare': 962,
'marketing': 963}

Remember, LightFM also treats the ID of every user as a feature - this is why we see the ID of our users included in the user_feature_map. We will later use these mappings to perform predictions.

Evaluating performance

Since we want to evaluate the performance of our LightFM model, we will use the library's random_train_test_split(-) method:

sm_train_interactions, sm_test_interactions = random_train_test_split(sm_interactions, test_percentage=0.2, random_state=42)
print(f"Shape of train interactions: {sm_train_interactions.shape}")
print(f"Shape of test interactions: {sm_test_interactions.shape}")
Shape of train interactions: (943, 352)
Shape of test interactions: (943, 352)

It's finally time to fit the LightFM model:

LEARNING_RATE = 0.25
NO_EPOCHS = 20
NO_COMPONENTS = 20 # Number of latent factorization
ITEM_ALPHA = 1e-6 # Regularization factor for item features
USER_ALPHA = 1e-6 # Regularization factor for user features

model = LightFM(loss="warp",
no_components=NO_COMPONENTS,
learning_rate=LEARNING_RATE,
item_alpha=ITEM_ALPHA,
user_alpha=USER_ALPHA,
random_state=42)

model.fit(interactions=sm_train_interactions,
user_features=sm_user_features,
item_features=sm_item_features,
epochs=NO_EPOCHS)

It took me roughly 5 minutes to train the model using my M1-chip MacOS. Let's now evaluate the performance (precision@k) of our model using our testing data:

np_arr_prec = precision_at_k(model,
test_interactions=sm_test_interactions,
user_features=sm_user_features,
item_features=sm_item_features)

print(len(np_arr_prec.shape)) # 943
np_arr_prec[:10]
array([0.3, 0.3, 0.1, 0.3, 0.1, 0.5, 0.1, 0.9, 0.7, 0. ], dtype=float32)

Here, we have the precision@k value for every user. We can compute the mean average precision@k (MAP@K) like so:

np_arr_prec.mean()
0.38462353

User-to-Item recommendation

Suppose we wanted to recommend 5 movies to a particular user with ID 63. We can use the predict method of our model like so:

user_id = 63
list_scores = model.predict(user_id_map[user_id], list(item_id_map.values()))
print(len(list_scores)) # 352
list_scores[:5]
array([-89.237495, -40.246788, -49.907825, -54.08462 , -99.96051 ], dtype=float32)

Note the following:

  • we convert the user_id to the user ID used by LightFM internally using user_id_map dictionary that we obtained earlier by dataset.mapping().

  • we supply a list of movie IDs that we want to obtain a recommendation score for. Since we are interested in finding the top k movie recommendations for this user, we need to compute the recommendation score for every movie.

  • the item_id_map is a dictionary that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:

    item_id_map
    {242: 0,
    257: 1,
    111: 2, ...

We convert the list of recommendation scores into a Series such that we can assign the original movie IDs as the index:

series_scores = pd.Series(list_scores)
series_scores.index = item_id_map.keys()
print(len(list_scores)) # 352
series_scores[:5]
242 -89.237495
257 -40.246788
111 -49.907825
25 -54.084621
382 -99.960510
dtype: float32

Finally, we sort the scores in descending order to obtain the top movie recommendations for the user:

series_scores.sort_values(ascending=False, inplace=True)
series_scores[:5]
222 124.631828
1 21.564163
15 -17.863367
288 -31.532494
258 -34.471897
dtype: float32

Here, we see that the movie with ID 222 is the top recommended movie for this particular user. Note that the magnitude of the scores does not carry any meaning - they are simply used for ranking purpose only.

Item-to-item recommendation

To obtain the vector embedding of each movie, we can use the get_item_representations(-) method:

_, np_item_embeddings = model.get_item_representations(features=sm_item_features)
print(np_item_embeddings.shape) # (352, 20)
np_item_embeddings[:2]
array([[-0.79904926, -0.85671186, 0.42553982, 0.994905 , -0.06102959,
-0.41155615, -0.64710814, 0.38948753, -0.16504961, -0.24440393,
-0.46848 , -0.00726059, -0.5730575 , -0.12569594, -0.84235895,
0.9981231 , -0.36846963, 0.0336417 , 0.1883249 , 0.7433187 ],
[ 1.3654532 , 1.4837279 , 1.0903912 , -0.14545436, -1.0986278 ,
0.08251551, -2.776378 , -0.20987356, 1.8015835 , 2.2055554 ,
-0.22924855, -3.5627067 , -0.35516343, -0.79560184, -1.4587665 ,
1.6426092 , 1.2299991 , 0.26629227, 0.877507 , -0.35510343]],
dtype=float32)

Note the following:

  • for every movie, we get a vector representation that encodes the characteristics of the movie.

  • the shape is 352 by 20 because there are 352 movies in total and we set the latent vector size (NO_COMPONENTS) to 20 during model fitting.

We compute the cosine similarity for each pair of movies:

np_item_similarities = cosine_similarity(sparse.csr_matrix(np_item_embeddings))
print(np_item_similarities.shape) # (352, 352)
np_item_similarities[:2]
array([[ 9.99999940e-01, 9.83655304e-02, 8.39596242e-03,
-1.05853856e-01, 4.35901970e-01, -1.35404930e-01,
-7.39989057e-03, 5.28273523e-01, -5.74964844e-02, ...

It's good practise to convert this NumPy array into a DataFrame as we can assign the movie IDs to the rows and columns like so:

df_item_similarities = pd.DataFrame(np_item_similarities)
df_item_similarities.columns = item_id_map.keys()
df_item_similarities.index = item_id_map.keys()
df_item_similarities
242 257 111 25 382 202 153 286 66 845 ... 1181
242 1.000000 0.098366 0.008396 -0.105854 0.435902 -0.135405 -0.007400 0.528274 -0.057496 0.247484 ... 0.113449
257 0.098366 1.000000 0.079299 0.134674 -0.236781 -0.060198 0.160672 -0.037351 -0.040263 0.233311 ... 0.121223
111 0.008396 0.079299 1.000000 0.555100 0.156462 0.340305 0.059644 0.220777 0.389159 0.734334 ... 0.113434
25 -0.500537 -0.442127 -0.427205 -0.404253 -0.395989 -0.377560 -0.364232 -0.344025 -0.341550 -0.338591 ... 1.000000
382 0.435902 -0.236781 0.156462 0.271259 1.000000 0.204918 0.178467 0.311732 0.174588 0.147333 ... 0.076015

Remember, the item_id_map is a dictionary obtained by dataset.mapping() from earlier that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:

item_id_map
{242: 0,
257: 1,
111: 2, ...

To get the top 5 recommendations for a particular movie, say movie with ID 1049:

int_movie_id = 1049
series_rec_movie_ids = df_item_similarities.loc[int_movie_id,:]
series_top_5_rec_movie_ids = series_rec_movie_ids.sort_values(ascending=False).head(6)
series_top_5_rec_movie_ids
1049 1.000000
832 0.786510
1047 0.745040
756 0.741784
930 0.729942
407 0.729872
Name: 1049, dtype: float32

As expected, the first value is the movie itself, which receives a perfect score. We see that the movie with ID 832 is the most recommended (similar) movie!

Users' vector embeddings

Not only has LightFM learned the vector embedding of each movie, it has also learned it for each user. To obtain the users' vector embeddings, use the get_user_representations(-) method:

_, np_user_embeddings = model.get_user_representations(features=sm_user_features)
print(np_user_embeddings.shape) # (943,20)
np_user_embeddings[:2]
array([[-0.917974 , -0.19056752, 0.804319 , 1.2637781 , -1.1266037 ,
-1.5947983 , 0.4530285 , 0.78684676, 1.6366841 , -0.55212635,
-0.48068386, -1.92954 , -2.35468 , 0.41447783, -1.2052028 ,
-0.7124101 , -0.64502287, -1.2504478 , 1.0433927 , 0.9680061 ],
[-1.6857581 , 0.64886266, -0.39518487, -2.0046077 , 2.8225665 ,
-0.59971595, -0.90769523, 0.50067604, 1.2015885 , 0.06845629,
0.0112996 , -1.1589563 , -0.62648225, 0.6836995 , -0.53056884,
2.109985 , -0.28536165, -1.1948149 , -0.03558412, 2.0381184 ]],
dtype=float32)

To find out how similar two users are, we can do exactly what we did for item-to-item recommendations, that is, compute the pairwise cosine similarity and sort the values in descending order:

np_user_similarities = cosine_similarity(sparse.csr_matrix(np_user_embeddings))
print(np_user_similarities.shape) # (943, 943)

df_user_similarities = pd.DataFrame(np_user_similarities)
df_user_similarities.columns = user_id_map.keys()
df_user_similarities.index = user_id_map.keys()

int_user_id = 196
df_user_similarities.loc[int_user_id,:].sort_values(ascending=False)[:6]
196 1.000000
485 0.668303
888 0.643664
306 0.595293
482 0.588159
336 0.586000
Name: 196, dtype: float32

Here, we see that user with ID 485 is most similar to user 196.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...