list_str_occupations_unique = list(df_data["occupation"].drop_duplicates())
print(len(list_str_occupations_unique))   # 21
list_str_occupations_unique
                
            
            ['writer', 'marketing', 'student', 'other', ... ]

The list of unique movie genres can be obtained like so:


        
        
            
                
                
                    series_genre_of_movies = df_data["genre"].str.split("|")
list_str_movie_genre_unique = list(set(np.concatenate(series_genre_of_movies).ravel()))
print(len(list_str_movie_genre_unique))   # 18
list_str_movie_genre_unique
                
            
            ['War', 'Animation', 'Sci-Fi', 'Comedy', 'Film-Noir', ... ]

Here, we are using the Series' split(-) method to obtain a list of genres (e.g. "Comedy|Musical" becomes ["Comedy","Musical"].

We can now build the LightFM dataset like so:


        
        
            
                
                
                    dataset = Dataset()
dataset.fit(users=df_data["userID"],
            items=df_data["itemID"],
            item_features=list_str_movie_genre_unique,
            user_features=list_str_occupations_unique)

Preparing users features

Next, we must build the users' features. The input format expected by LightFM is as follows:


        
        
            
                
                
                    [(user_id_1, ['feature1']), (user_id_2, ['feature2']), ...]

We can obtain this input format like so:


        
        
            
                
                
                    df_data_with_unique_user_ids = df_data.drop_duplicates("userID")
list_user_features = [(x,[y]) for x,y in zip(df_data_with_unique_user_ids["userID"], df_data_with_unique_user_ids["occupation"])]
# print(len(list_user_features))   # 943
list_user_features
                
            
            [(196, ['writer']), (63, ['marketing']), (226, ['student']), (154, ['student']), ... ]

We then pass this into the LightFM's build_user_features(-) method:


        
        
            
                
                
                    sm_user_features = dataset.build_user_features(list_user_features)
sm_user_features
                
            
            <943x964 sparse matrix of type '<class 'numpy.float32'>'
  with 1886 stored elements in Compressed Sparse Row format>

Here, the shape of sm_user_features is 943 by 964. This is because there are 943 unique users and each userID is treated as an user attribute. We also have 21 unique occupations, which means there are a total of 964 user attributes.

Preparing items features

Similarly, we build the items' features:


        
        
            
                
                
                    list_item_features = [(x,y) for x,y in zip(df_movies_uniq["itemID"], series_genre_of_movies)]
print(f"Length of item features: {len(list_item_features)}")   # 352
list_item_features
                
            
            [(242, ['Comedy']),
 (257, ['Action', 'Adventure', 'Comedy', 'Sci-Fi']),
 (111, ['Comedy', 'Romance']),
 (25, ['Comedy']),
 (382, ['Comedy', 'Drama']), ... ]

We then pass this into LightFM's build_item_features(-) method:


        
        
            
                
                
                    sm_item_features = dataset.build_item_features(list_item_features)
sm_item_features
                
            
            <352x370 sparse matrix of type '<class 'numpy.float32'>'
  with 1097 stored elements in Compressed Sparse Row format>

Preparing interaction data

Now that we've prepared the user and item data, we can move on to preparing the interaction data by using the build_interactions(-) method:


        
        
            
                
                
                    sm_interactions, sm_weights = dataset.build_interactions(df_data[["userID","itemID","rating"]].values)
sm_interactions
                
            
            <943x352 sparse matrix of type '<class 'numpy.int32'>'
  with 10987720 stored elements in COOrdinate format>

Internal mapping of IDs

Instead of using the original IDs of the users and items in our dataset, LightFM internally assigns a new consecutive non-negative integer ID to each user and item. We can see the mapping like so:


        
        
            
                
                
                    user_id_map, user_feature_map, item_id_map, feature_item_map = dataset.mapping()

The user_id_map is:


        
        
            
                
                
                    user_id_map
                
            
            {196: 0,
 63: 1,
 226: 2,

The user_feature_map is:


        
        
            
                
                
                    user_feature_map
                
            
            {196: 0,
 63: 1,
 ...,
 'healthcare': 962,
 'marketing': 963}

Remember, LightFM also treats the ID of every user as a feature - this is why we see the ID of our users included in the user_feature_map. We will later use these mappings to perform predictions.

Evaluating performance

Since we want to evaluate the performance of our LightFM model, we will use the library's random_train_test_split(-) method:


        
        
            
                
                
                    sm_train_interactions, sm_test_interactions = random_train_test_split(sm_interactions, test_percentage=0.2, random_state=42)
print(f"Shape of train interactions: {sm_train_interactions.shape}")
print(f"Shape of test interactions: {sm_test_interactions.shape}")
                
            
            Shape of train interactions: (943, 352)
Shape of test interactions: (943, 352)

It's finally time to fit the LightFM model:


        
        
            
                
                
                    LEARNING_RATE = 0.25
NO_EPOCHS = 20
NO_COMPONENTS = 20  # Number of latent factorization
ITEM_ALPHA = 1e-6   # Regularization factor for item features
USER_ALPHA = 1e-6   # Regularization factor for user features

model = LightFM(loss="warp",
                no_components=NO_COMPONENTS, 
                learning_rate=LEARNING_RATE, 
                item_alpha=ITEM_ALPHA,
                user_alpha=USER_ALPHA,
                random_state=42)

model.fit(interactions=sm_train_interactions,
          user_features=sm_user_features,
          item_features=sm_item_features,
          epochs=NO_EPOCHS)

It took me roughly 5 minutes to train the model using my M1-chip MacOS. Let's now evaluate the performance (precision@k) of our model using our testing data:


        
        
            
                
                
                    np_arr_prec = precision_at_k(model,
                             test_interactions=sm_test_interactions,
                             user_features=sm_user_features,
                             item_features=sm_item_features)

print(len(np_arr_prec.shape))   # 943
np_arr_prec[:10]
                
            
            array([0.3, 0.3, 0.1, 0.3, 0.1, 0.5, 0.1, 0.9, 0.7, 0. ], dtype=float32)

Here, we have the precision@k value for every user. We can compute the mean average precision@k (MAP@K) like so:


        
        
            
                
                
                    np_arr_prec.mean()
                
            
            0.38462353

User-to-Item recommendation

Suppose we wanted to recommend 5 movies to a particular user with ID 63. We can use the predict method of our model like so:


        
        
            
                
                
                    user_id = 63
list_scores = model.predict(user_id_map[user_id], list(item_id_map.values()))
print(len(list_scores))   # 352
list_scores[:5]
                
            
            array([-89.237495, -40.246788, -49.907825, -54.08462 , -99.96051 ], dtype=float32)

Note the following:

we convert the user_id to the user ID used by LightFM internally using user_id_map dictionary that we obtained earlier by dataset.mapping().
we supply a list of movie IDs that we want to obtain a recommendation score for. Since we are interested in finding the top k movie recommendations for this user, we need to compute the recommendation score for every movie.
the item_id_map is a dictionary that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:
item_id_map {242: 0, 257: 1, 111: 2, ...

We convert the list of recommendation scores into a Series such that we can assign the original movie IDs as the index:


        
        
            
                
                
                    series_scores = pd.Series(list_scores)
series_scores.index = item_id_map.keys()
print(len(list_scores))   # 352
series_scores[:5]
                
            
            242   -89.237495
257   -40.246788
111   -49.907825
25    -54.084621
382   -99.960510
dtype: float32

Finally, we sort the scores in descending order to obtain the top movie recommendations for the user:


        
        
            
                
                
                    series_scores.sort_values(ascending=False, inplace=True)
series_scores[:5]
                
            
            222    124.631828
1       21.564163
15     -17.863367
288    -31.532494
258    -34.471897
dtype: float32

Here, we see that the movie with ID 222 is the top recommended movie for this particular user. Note that the magnitude of the scores does not carry any meaning - they are simply used for ranking purpose only.

Item-to-item recommendation

To obtain the vector embedding of each movie, we can use the get_item_representations(-) method:


        
        
            
                
                
                    _, np_item_embeddings = model.get_item_representations(features=sm_item_features)
print(np_item_embeddings.shape)   # (352, 20)
np_item_embeddings[:2]
                
            
            array([[-0.79904926, -0.85671186,  0.42553982,  0.994905  , -0.06102959,
        -0.41155615, -0.64710814,  0.38948753, -0.16504961, -0.24440393,
        -0.46848   , -0.00726059, -0.5730575 , -0.12569594, -0.84235895,
         0.9981231 , -0.36846963,  0.0336417 ,  0.1883249 ,  0.7433187 ],
       [ 1.3654532 ,  1.4837279 ,  1.0903912 , -0.14545436, -1.0986278 ,
         0.08251551, -2.776378  , -0.20987356,  1.8015835 ,  2.2055554 ,
        -0.22924855, -3.5627067 , -0.35516343, -0.79560184, -1.4587665 ,
         1.6426092 ,  1.2299991 ,  0.26629227,  0.877507  , -0.35510343]],
      dtype=float32)

Note the following:

for every movie, we get a vector representation that encodes the characteristics of the movie.
the shape is 352 by 20 because there are 352 movies in total and we set the latent vector size (NO_COMPONENTS) to 20 during model fitting.

We compute the cosine similarity for each pair of movies:


        
        
            
                
                
                    np_item_similarities = cosine_similarity(sparse.csr_matrix(np_item_embeddings))
print(np_item_similarities.shape)   # (352, 352)
np_item_similarities[:2]
                
            
            array([[ 9.99999940e-01,  9.83655304e-02,  8.39596242e-03,
        -1.05853856e-01,  4.35901970e-01, -1.35404930e-01,
        -7.39989057e-03,  5.28273523e-01, -5.74964844e-02, ...

It's good practise to convert this NumPy array into a DataFrame as we can assign the movie IDs to the rows and columns like so:


        
        
            
                
                
                    df_item_similarities = pd.DataFrame(np_item_similarities)
df_item_similarities.columns = item_id_map.keys()
df_item_similarities.index = item_id_map.keys()
df_item_similarities
                
            
                 242       257        111        25         382        202        153        286        66         845        ...    1181
242  1.000000  0.098366   0.008396   -0.105854  0.435902   -0.135405  -0.007400  0.528274   -0.057496  0.247484   ...    0.113449
257  0.098366  1.000000   0.079299   0.134674   -0.236781  -0.060198  0.160672   -0.037351  -0.040263  0.233311   ...    0.121223
111  0.008396  0.079299   1.000000   0.555100   0.156462   0.340305   0.059644   0.220777   0.389159   0.734334   ...    0.113434
25  -0.500537  -0.442127  -0.427205  -0.404253  -0.395989  -0.377560  -0.364232  -0.344025  -0.341550  -0.338591  ...    1.000000
382  0.435902  -0.236781  0.156462   0.271259   1.000000   0.204918   0.178467   0.311732   0.174588   0.147333   ...    0.076015

Remember, the item_id_map is a dictionary obtained by dataset.mapping() from earlier that maps the original movie IDs to the non-negative consecutive integers used by LightFM internally:


        
        
            
                
                
                    item_id_map
                
            
            {242: 0,
 257: 1,
 111: 2, ...

To get the top 5 recommendations for a particular movie, say movie with ID 1049:


        
        
            
                
                
                    int_movie_id = 1049
series_rec_movie_ids = df_item_similarities.loc[int_movie_id,:]
series_top_5_rec_movie_ids = series_rec_movie_ids.sort_values(ascending=False).head(6)
series_top_5_rec_movie_ids
                
            
            1049    1.000000
832     0.786510
1047    0.745040
756     0.741784
930     0.729942
407     0.729872
Name: 1049, dtype: float32

As expected, the first value is the movie itself, which receives a perfect score. We see that the movie with ID 832 is the most recommended (similar) movie!

Users' vector embeddings

Not only has LightFM learned the vector embedding of each movie, it has also learned it for each user. To obtain the users' vector embeddings, use the get_user_representations(-) method:


        
        
            
                
                
                    _, np_user_embeddings = model.get_user_representations(features=sm_user_features)
print(np_user_embeddings.shape)   # (943,20)
np_user_embeddings[:2]
                
            
            array([[-0.917974  , -0.19056752,  0.804319  ,  1.2637781 , -1.1266037 ,
        -1.5947983 ,  0.4530285 ,  0.78684676,  1.6366841 , -0.55212635,
        -0.48068386, -1.92954   , -2.35468   ,  0.41447783, -1.2052028 ,
        -0.7124101 , -0.64502287, -1.2504478 ,  1.0433927 ,  0.9680061 ],
       [-1.6857581 ,  0.64886266, -0.39518487, -2.0046077 ,  2.8225665 ,
        -0.59971595, -0.90769523,  0.50067604,  1.2015885 ,  0.06845629,
         0.0112996 , -1.1589563 , -0.62648225,  0.6836995 , -0.53056884,
         2.109985  , -0.28536165, -1.1948149 , -0.03558412,  2.0381184 ]],
      dtype=float32)

To find out how similar two users are, we can do exactly what we did for item-to-item recommendations, that is, compute the pairwise cosine similarity and sort the values in descending order:


        
        
            
                
                
                    np_user_similarities = cosine_similarity(sparse.csr_matrix(np_user_embeddings))
print(np_user_similarities.shape)   # (943, 943)

df_user_similarities = pd.DataFrame(np_user_similarities)
df_user_similarities.columns = user_id_map.keys()
df_user_similarities.index = user_id_map.keys()

int_user_id = 196
df_user_similarities.loc[int_user_id,:].sort_values(ascending=False)[:6]
                
            
            196    1.000000
485    0.668303
888    0.643664
306    0.595293
482    0.588159
336    0.586000
Name: 196, dtype: float32

Here, we see that user with ID 485 is most similar to user 196.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!