Recommendation Systems#

Basics#

There are two distinct data sets (they refer to different things, e.g., consumers and products, users and movies, etc.). The objective of the model is to transform the data into vectors (embeddings) in a way that both are in the same vector space; that is, the distance between a vector of “base A” and a vector of “base B” has some meaning.

Note: This is not possible using the tabular models (SelfSupervised or Supervised) - training a model for each database - because as the models were not trained “together,” it is not possible to guarantee that the distance between the vectors will have any meaning beyond mathematics.

Data sets are connected by a third base. (For example, in the case of movielens, user and movie data are related by the rating that each user rated the movies). I separate here into two cases and the discussion below:

  • with label: there is a numeric label that indicates the proximity between the data.

  • no label: there is no explicit label that indicates the proximity between the data.

Therefore, only 2 sets of embedding vectors are generated, as the “link” base is used only for training (determining which relations are close or not). To keep track of all operations, training generates 3 bases: - base link generates a “type 1” database (“RecommendationSystem”). There are no vectors in this base and no models, so it is not possible to use consumption operations (similar, recommendation or prediction) - twin bases generate a “type 2” (“Recommendation”) database each. They are generated from training the RecommendationSystem base, so they cannot be created from a setup and have similar and recommendation consumption operations.

[1]:
import pandas as pd

# Let's take a sample here
SAMPLE_SIZE = 10000

# Import Movielens 100k Database
# https://grouplens.org/datasets/movielens/
# Don't forget to change the file path
raw_movies = pd.read_csv("data/ml-25m/movies.csv")
raw_movies.head()
[1]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
[2]:
# If you wish to use the full dataset, just remove the nrows argument
raw_ratings = pd.read_csv("data/ml-25m/ratings.csv", nrows=SAMPLE_SIZE)
raw_ratings.head()
[2]:
userId movieId rating timestamp
0 1 296 5.0 1147880044
1 1 306 3.5 1147868817
2 1 307 5.0 1147868828
3 1 665 5.0 1147878820
4 1 899 3.5 1147868510
[3]:
from jai.utilities import split

# remove the excess of movie values
movies = raw_movies.loc[raw_movies["movieId"].isin(raw_ratings.loc[:, "movieId"].to_numpy())].copy()

# Here we preprocess the data to be inserted into jai
# We split the genres and title columns to be processed as text databases
# Furthermore, the genres column contains multiple values for each movie, so we specify a separator.

bases, movies = split(movies, {"genres": "|", "title": None})
movies.index = movies["movieId"]

movies.head()
[3]:
movieId id_genres id_title
movieId
1 1 [0, 1, 2, 3, 4] 0
2 2 [0, 2, 4] 1
3 3 [3, 5] 2
5 5 [3] 3
6 6 [6, 7, 8] 4
[4]:
# We drop the timestamp column because we won't use it in this example
ratings = raw_ratings.drop("timestamp", axis=1)
# Then we create the user dataframe from the ratings dataframe.
users = pd.DataFrame(
    {"userId": ratings["userId"].unique().astype(str)}, index=ratings["userId"].unique()
)
users.head()
[4]:
userId
1 1
2 2
3 3
4 4
5 5
[5]:
from jai import Trainer

# Create a movie titles collection in Jai
trainer = Trainer(name="movie_titles")
trainer.set_parameters(db_type="Text")
q_title = trainer.fit(bases["title"], overwrite=True)

Recognized fit arguments:
- db_type: Text
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  2.67it/s]

Recognized fit arguments:
- db_type: Text
JAI is working: 100%|██████████|12/12 [00:25]
[6]:
# Create a movies genre collection in Jai
trainer = Trainer(name="genre")
trainer.set_parameters(db_type="Text")
q_genre = trainer.fit(bases["genres"], overwrite=True)

Recognized fit arguments:
- db_type: Text
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  4.87it/s]

Recognized fit arguments:
- db_type: Text
JAI is working: 100%|██████████|12/12 [00:03]
[7]:
# Create the recommendation system
trainer = Trainer(name="ratings")
trainer.set_parameters(
    db_type="RecommendationSystem",
    # we explicitly define user id as category, because it's a numeric data.
    features={"userId": {"name": "userId", "dtype": "category"}},
    pretrained_bases=[
        # we must declare the tower databases as parents even though they don't exist yet
        {"id_name": "movieId", "db_parent": "movies"},
        {"id_name": "userId", "db_parent": "users"},
        # the text we just preprocessed as parents too.
        {"id_name": "id_title", "db_parent": "movie_titles"},
        {"id_name": "id_genres", "db_parent": "genre"},
    ],
    label={"label_name": "rating"},
)

Recognized fit arguments:
- db_type: RecommendationSystem
- features:
  * userId:
    - name: userId
    - dtype: category
- pretrained_bases:
  * id_name: movieId
    db_parent: movies
  * id_name: userId
    db_parent: users
  * id_name: id_title
    db_parent: movie_titles
  * id_name: id_genres
    db_parent: genre
- label:
  * label_name: rating
[8]:
# Let's check the parameters
trainer.fit_parameters
[8]:
{'db_type': 'RecommendationSystem',
 'hyperparams': {'check_val_every_n_epoch': 1,
  'gradient_clip_val': 0.0,
  'gradient_clip_algorithm': 'norm',
  'min_epochs': 15,
  'max_epochs': 500,
  'patience': 10,
  'min_delta': 1e-05,
  'random_seed': 42,
  'split': {'type': 'random', 'split_column': '', 'test_size': 0.2, 'gap': 0},
  'swa_parameters': {'swa_lrs': None,
   'swa_epoch_start': 0.8,
   'annealing_epochs': 10,
   'annealing_strategy': 'cos'},
  'pruning_method': 'l1_unstructured',
  'pruning_amount': 0,
  'batch_size': 512,
  'learning_rate': 0.001,
  'base_left': '',
  'base_right': '',
  'model': {'encoder_layer': '2LM',
   'hidden_latent_dim': 64,
   'dropout_rate': 0.1,
   'momentum': 0.1,
   'normalize': False}},
 'features': {'userId': {'name': 'userId',
   'dtype': 'category',
   'embedding_dim': 32,
   'ncats': 0,
   'fill_value': '_other',
   'min_freq': 1}},
 'pretrained_bases': [{'db_parent': 'movies',
   'id_name': 'movieId',
   'embedding_dim': 128,
   'aggregation_method': 'sum'},
  {'db_parent': 'users',
   'id_name': 'userId',
   'embedding_dim': 128,
   'aggregation_method': 'sum'},
  {'db_parent': 'movie_titles',
   'id_name': 'id_title',
   'embedding_dim': 128,
   'aggregation_method': 'sum'},
  {'db_parent': 'genre',
   'id_name': 'id_genres',
   'embedding_dim': 128,
   'aggregation_method': 'sum'}],
 'label': {'label_name': 'rating', 'label_scaler': 'Standard'}}
[9]:
# Creating the databases
queries = trainer.fit(
    {"users": users, "movies": movies, "main": ratings},
    overwrite=True,
)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  8.34it/s]
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  5.30it/s]
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  4.23it/s]

Recognized fit arguments:
- db_type: RecommendationSystem
- features:
  * userId:
    - name: userId
    - dtype: category
- pretrained_bases:
  * id_name: movieId
    db_parent: movies
  * id_name: userId
    db_parent: users
  * id_name: id_title
    db_parent: movie_titles
  * id_name: id_genres
    db_parent: genre
- label:
  * label_name: rating
JAI is working: 100%|██████████|24/24 [00:16]

Setup Report:

Best model at epoch: 04 val_loss: 0.87
[10]:
# Consuming the recommendation database
# since we're consuming the movies collection, the return will be of movies id
# We input user information to get recommended movies
user_ids = [1]
r = queries["movies"].recommendation(user_ids, orient='flat')

# merging to original data
dfr = pd.DataFrame(r)
dfr = dfr.merge(raw_movies,
                how="left",
                left_on="id",
                right_on="movieId")
dfr
Recommendation: 100%|██████████| 1/1 [00:00<00:00,  3.18it/s]
[10]:
query_id id distance movieId title genres
0 1 5349 1.246595 5349 Spider-Man (2002) Action|Adventure|Sci-Fi|Thriller
1 1 480 1.255526 480 Jurassic Park (1993) Action|Adventure|Sci-Fi|Thriller
2 1 6333 1.258553 6333 X2: X-Men United (2003) Action|Adventure|Sci-Fi|Thriller
3 1 3703 1.266036 3703 Road Warrior, The (Mad Max 2) (1981) Action|Adventure|Sci-Fi|Thriller
4 1 849 1.266751 849 Escape from L.A. (1996) Action|Adventure|Sci-Fi|Thriller
[11]:
# also possible to make the recommendation the other way around
# We input movies information to get recommended users
movie_ids = [508]
r = queries["users"].recommendation(movie_ids, orient='flat')
Recommendation: 100%|██████████| 1/1 [00:00<00:00,  5.10it/s]
[12]:
# Using the similarity on the database
# since we're consuming the users collection, the return will be of users id
s = queries["users"].similar(user_ids, orient='flat')
pd.DataFrame(s)
Similar: 100%|██████████| 1/1 [00:00<00:00,  7.13it/s]
[12]:
query_id id distance
0 1 1 0.000000
1 1 44 0.768691
2 1 27 0.831203
3 1 21 0.856622
4 1 25 0.931707