Self-aggregation on MovieLens#

MovieLens is a famous movie dataset used for both explicit and implicit recommender systems. It provides a main table, “ratings”, that can be viewed as logs or transactions, comprised of only 4 columns: userId, movieId, rating and timestamp. MovieLens also gives a contextual table “movies”, including movieId, title and types, to enable content-based feature extraction.

From the perspective of machine-learning pipelines, one challenge is to transform the transaction log into features that can be fed to supervised learning.

In this notebook, we only deal with the main table “ratings”. Our objective is not to achieve state-of-the-art performance on the explicit regression task, but rather to illustrate how to perform feature engineering in a simple way using AggJoiner and AggTarget. Note that our performance is higher than the baseline of using the mean rating per movies.

The benefit of using AggJoiner and AggTarget is that they readily provide a full pipeline, from the original tables to the prediction, that can be cross-validated or applied to new data to serve prediction. At the end of this example, we showcase hyper-parameter optimization on the whole pipeline.

The data#

We begin with loading the ratings table from MovieLens. Note that we use the light version (100k rows).

import pandas as pd

from skrub.datasets import fetch_movielens


ratings = fetch_movielens(dataset_id="ratings")
ratings = ratings.X.sort_values("timestamp").reset_index(drop=True)
ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], unit="s")

X = ratings[["userId", "movieId", "timestamp"]]
y = ratings["rating"]
X.shape, y.shape
((100836, 3), (100836,))
userId movieId timestamp
0 429 22 1996-03-29 18:36:55
1 429 150 1996-03-29 18:36:55
2 429 161 1996-03-29 18:36:55
3 429 165 1996-03-29 18:36:55
4 429 218 1996-03-29 18:36:55


Encoding the timestamp with a TableVectorizer#

Our first step is to extract features from the timestamp, using the TableVectorizer. Natively, it uses the DatetimeEncoder on datetime columns, and doesn’t interact with numerical columns.

userId movieId timestamp_year timestamp_month timestamp_day timestamp_hour timestamp_total_seconds timestamp_day_of_week
0 429.0 22.0 1996.0 3.0 29.0 18.0 828124615.0 4.0
1 429.0 150.0 1996.0 3.0 29.0 18.0 828124615.0 4.0
2 429.0 161.0 1996.0 3.0 29.0 18.0 828124615.0 4.0
3 429.0 165.0 1996.0 3.0 29.0 18.0 828124615.0 4.0
4 429.0 218.0 1996.0 3.0 29.0 18.0 828124615.0 4.0


We can now make a couple of plots and gain some insight on our dataset.

from matplotlib import pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")


def make_barplot(x, y, title):
    norm = plt.Normalize(y.min(), y.max())
    cmap = plt.get_cmap("magma")

    sns.barplot(x=x, y=y, palette=cmap(norm(y)))
    plt.title(title)
    plt.xticks(rotation=30)
    plt.ylabel(None)
    plt.tight_layout()


# O is Monday, 6 is Sunday

daily_volume = X_date_encoded["timestamp_day_of_week"].value_counts().sort_index()

make_barplot(
    x=daily_volume.index,
    y=daily_volume.values,
    title="Daily volume of ratings",
)
Daily volume of ratings
/home/circleci/project/examples/08_join_aggregation.py:109: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=x, y=y, palette=cmap(norm(y)))
/home/circleci/project/examples/08_join_aggregation.py:109: UserWarning: Numpy array is not a supported type for `palette`. Please convert your palette to a list. This will become an error in v0.14
  sns.barplot(x=x, y=y, palette=cmap(norm(y)))

We also display the distribution of our target y.

rating_count = y.value_counts().sort_index()

make_barplot(
    x=rating_count.index,
    y=rating_count.values,
    title="Distribution of ratings given to movies",
)
Distribution of ratings given to movies
/home/circleci/project/examples/08_join_aggregation.py:109: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=x, y=y, palette=cmap(norm(y)))
/home/circleci/project/examples/08_join_aggregation.py:109: UserWarning: Numpy array is not a supported type for `palette`. Please convert your palette to a list. This will become an error in v0.14
  sns.barplot(x=x, y=y, palette=cmap(norm(y)))

AggTarget: aggregate y, then join#

We have just extracted datetime features from timestamps.

Let’s now perform an expansion for the target y, by aggregating it before joining it back on the main table. The biggest risk of doing target expansion with multiple dataframe operations yourself is to end up leaking the target.

To solve this, the AggTarget transformer allows you to aggregate the target y before joining it on the main table, without risk of leaking. Note that to perform aggregation then joining on the features X, you need to use AggJoiner instead.

You can also think of it as a generalization of the TargetEncoder, which encodes categorical features based on the target.

We only focus on aggregating the target by users, but later we will also consider aggregating by movies. Here, we compute the histogram of the target with 3 bins, before joining it back on the initial table.

This feature answer questions like “How many times has this user given a bad, medium or good rate to movies?”.

from skrub import AggTarget


agg_target_user = AggTarget(
    main_key="userId",
    suffix="_user",
    operation="hist(3)",
)
X_transformed = agg_target_user.fit_transform(X, y)

X_transformed.shape
(100836, 6)
userId movieId timestamp rating_(0.499, 2.0]_user rating_(2.0, 3.5]_user rating_(3.5, 5.0]_user
0 429 22 1996-03-29 18:36:55 2 14 42
1 429 150 1996-03-29 18:36:55 2 14 42
2 429 161 1996-03-29 18:36:55 2 14 42
3 429 165 1996-03-29 18:36:55 2 14 42
4 429 218 1996-03-29 18:36:55 2 14 42


Similarly, we join on movieId instead of userId.

This feature answer questions like “How many times has this movie received a bad, medium or good rate from users?”.

agg_target_movie = AggTarget(
    main_key="movieId",
    suffix="_movie",
    operation="hist(3)",
)
X_transformed = agg_target_movie.fit_transform(X, y)
X_transformed.shape
(100836, 6)
userId movieId timestamp rating_(0.499, 2.0]_movie rating_(2.0, 3.5]_movie rating_(3.5, 5.0]_movie
0 429 22 1996-03-29 18:36:55 4 24 8
1 429 150 1996-03-29 18:36:55 13 61 127
2 429 161 1996-03-29 18:36:55 8 38 57
3 429 165 1996-03-29 18:36:55 14 64 66
4 429 218 1996-03-29 18:36:55 3 4 7


Chaining everything together in a pipeline#

To perform cross-validation and enable hyper-parameter tuning, we gather all elements into a scikit-learn Pipeline by using make_pipeline, and define a scikit-learn HistGradientBoostingRegressor.

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    table_vectorizer,
    agg_target_user,
    agg_target_movie,
    HistGradientBoostingRegressor(learning_rate=0.1, max_depth=4, max_iter=40),
)

pipeline
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime_transformer=DatetimeEncoder(add_day_of_the_week=True))),
                ('aggtarget-1',
                 AggTarget(main_key='userId', operation='hist(3)',
                           suffix='_user')),
                ('aggtarget-2',
                 AggTarget(main_key='movieId', operation='hist(3)',
                           suffix='_movie')),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(max_depth=4, max_iter=40))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Hyper-parameters tuning and cross validation#

We can finally create our hyper-parameter search space, and use a GridSearchCV. We select the cross validation splitter to be the TimeSeriesSplit to prevent leakage, since our data are timestamped logs.

Note that you need the name of the pipeline elements to assign them hyper-parameters search.

You can lookup the name of the pipeline elements by doing:

['tablevectorizer', 'aggtarget-1', 'aggtarget-2', 'histgradientboostingregressor']

Alternatively, you can use scikit-learn Pipeline to name your transformers: Pipeline([("agg_target_user", agg_target_user), ...])

We now perform the grid search over the AggTarget transformers to find the operation maximizing our validation score.

from sklearn.model_selection import GridSearchCV, TimeSeriesSplit

operations = ["mean", "hist(3)", "hist(5)", "hist(7)", "value_counts"]
param_grid = [
    {
        f"aggtarget-2__operation": [op],
    }
    for op in operations
]

cv = GridSearchCV(pipeline, param_grid, cv=TimeSeriesSplit(n_splits=10))
cv.fit(X, y)

results = pd.DataFrame(cv.cv_results_)

cols = [f"split{idx}_test_score" for idx in range(10)]
results = results.set_index("param_aggtarget-2__operation")[cols].T
results
param_aggtarget-2__operation mean hist(3) hist(5) hist(7) value_counts
split0_test_score 0.017906 0.023572 0.007540 0.007540 0.007540
split1_test_score 0.033002 0.042840 0.080098 0.061375 0.066677
split2_test_score 0.060051 0.078515 0.093596 0.099191 0.098471
split3_test_score 0.038842 0.064836 0.056812 0.073599 0.077640
split4_test_score 0.142220 0.128170 0.145539 0.142744 0.146731
split5_test_score 0.103772 0.114905 0.109094 0.109256 0.115901
split6_test_score 0.079807 0.095965 0.106291 0.107266 0.107440
split7_test_score 0.073432 0.072592 0.048308 0.067931 0.085349
split8_test_score 0.109640 0.112492 0.117007 0.124763 0.129425
split9_test_score 0.115726 0.158776 0.169859 0.162961 0.181152


The score used in this regression task is the R2. Remember that the R2 evaluates the relative performance compared to the naive baseline consisting in always predicting the mean value of y_test. Therefore, the R2 is 0 when y_pred = y_true.mean() and is upper bounded to 1 when y_pred = y_true.

To get a better sense of the learning performances of our simple pipeline, we also compute the average rating of each movie in the training set, and uses this average to predict the ratings in the test set.

from sklearn.metrics import r2_score


def baseline_r2(X, y, train_idx, test_idx):
    """Compute the average rating for all movies in the train set,
    and map these averages to the test set as a prediction.

    If a movie in the test set is not present in the training set,
    we simply predict the global average rating of the training set.
    """
    X_train, y_train = X.iloc[train_idx].copy(), y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

    X_train["y"] = y_train

    movie_avg_rating = X_train.groupby("movieId")["y"].mean().to_frame().reset_index()

    y_pred = X_test.merge(movie_avg_rating, on="movieId", how="left")["y"]
    y_pred = y_pred.fillna(y_pred.mean())

    return r2_score(y_true=y_test, y_pred=y_pred)


all_baseline_r2 = []
for train_idx, test_idx in TimeSeriesSplit(n_splits=10).split(X, y):
    all_baseline_r2.append(baseline_r2(X, y, train_idx, test_idx))

results.insert(0, "naive mean estimator", all_baseline_r2)

# we only keep the 5 out of 10 last results
# because the initial size of the train set is rather small
sns.boxplot(results.tail(5), palette="magma")
plt.ylabel("R2 score")
plt.title("Hyper parameters grid-search results")
plt.tight_layout()
Hyper parameters grid-search results

The naive estimator has a lower performance than our pipeline, which means that our extracted features brought some predictive power.

It seems that using the "value_counts" as an aggregation operator for AggTarget yields better performances than using the mean (which is equivalent to using the TargetEncoder).

Here, the number of bins encoding the target is proportional to the performance: computing the mean yields a single statistic, whereas histograms yield a density over a reduced set of bins, and "value_counts" yields an exhaustive histogram over all the possible values of ratings (here 10 different values, from 0.5 to 5).

Total running time of the script: (0 minutes 23.726 seconds)

Gallery generated by Sphinx-Gallery