Handling datetime features with the DatetimeEncoder#

In this example, we illustrate how to better integrate datetime features in machine learning models with the DatetimeEncoder.

This encoder breaks down passed datetime features into relevant numerical features, such as the month, the day of the week, the hour of the day, etc.

It is used by default in the TableVectorizer.

A problem with relevant datetime features#

We will use a dataset of bike sharing demand in 2011 and 2012. In this setting, we want to predict the number of bike rentals, based on the date, time and weather conditions.

from pprint import pprint

import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/skrub-data/datasets/master"
    "/data/bike-sharing-dataset.csv"
)
# Extract our input data (X) and the target column (y)
y = data["cnt"]
X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

X
date holiday temp hum windspeed weathersit
0 2011-01-01 00:00:00 0 0.24 0.81 0.0000 1
1 2011-01-01 01:00:00 0 0.22 0.80 0.0000 1
2 2011-01-01 02:00:00 0 0.22 0.80 0.0000 1
3 2011-01-01 03:00:00 0 0.24 0.75 0.0000 1
4 2011-01-01 04:00:00 0 0.24 0.75 0.0000 1
... ... ... ... ... ... ...
17374 2012-12-31 19:00:00 0 0.26 0.60 0.1642 2
17375 2012-12-31 20:00:00 0 0.26 0.60 0.1642 2
17376 2012-12-31 21:00:00 0 0.26 0.60 0.1642 1
17377 2012-12-31 22:00:00 0 0.26 0.56 0.1343 1
17378 2012-12-31 23:00:00 0 0.26 0.65 0.1343 1

17379 rows × 6 columns



y
0         16
1         40
2         32
3         13
4          1
        ...
17374    119
17375     89
17376     90
17377     61
17378     49
Name: cnt, Length: 17379, dtype: int64

We convert the dataframe’s "date" column using ToDatetime.

from skrub import ToDatetime

date = ToDatetime().fit_transform(X["date"])

print("original dtype:", X["date"].dtypes, "\n\nconverted dtype:", date.dtypes)
original dtype: object

converted dtype: datetime64[ns]

Encoding the features#

We now encode this column with a DatetimeEncoder.

During the instantiation of the DatetimeEncoder, we specify that we want to extract the day of the week, and that we don’t want to extract anything finer than hours. This is because we don’t want to extract minutes, seconds and lower units, as they are unimportant.

from skrub import DatetimeEncoder

date_enc = DatetimeEncoder().fit_transform(date)

print(date, "\n\nHas been encoded as:\n\n", date_enc)
0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
                ...
17374   2012-12-31 19:00:00
17375   2012-12-31 20:00:00
17376   2012-12-31 21:00:00
17377   2012-12-31 22:00:00
17378   2012-12-31 23:00:00
Name: date, Length: 17379, dtype: datetime64[ns]

Has been encoded as:

        date_year  date_month  date_day  date_hour  date_total_seconds
0         2011.0         1.0       1.0        0.0        1.293840e+09
1         2011.0         1.0       1.0        1.0        1.293844e+09
2         2011.0         1.0       1.0        2.0        1.293847e+09
3         2011.0         1.0       1.0        3.0        1.293851e+09
4         2011.0         1.0       1.0        4.0        1.293854e+09
...          ...         ...       ...        ...                 ...
17374     2012.0        12.0      31.0       19.0        1.356980e+09
17375     2012.0        12.0      31.0       20.0        1.356984e+09
17376     2012.0        12.0      31.0       21.0        1.356988e+09
17377     2012.0        12.0      31.0       22.0        1.356991e+09
17378     2012.0        12.0      31.0       23.0        1.356995e+09

[17379 rows x 5 columns]

We see that the encoder is working as expected: the column has been replaced by features extracting the month, day, hour, day of the week and total seconds since Epoch information.

One-liner with the TableVectorizer#

As mentioned earlier, the TableVectorizer makes use of the DatetimeEncoder by default. Note that X["date"] is still a string, but will be automatically transformed into a datetime in the TableVectorizer.

array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'holiday', 'temp', 'hum', 'windspeed',
       'weathersit'], dtype='<U18')

If we want to customize the DatetimeEncoder inside the TableVectorizer, we can replace its default parameter with a new, custom instance.

Here, for example, we want it to extract the day of the week:

# use the ``datetime`` argument to customize how datetimes are handled
table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
pprint(table_vec_weekday.get_feature_names_out())
array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
       'windspeed', 'weathersit'], dtype='<U18')

Inspecting the TableVectorizer further, we can check that the DatetimeEncoder is used on the correct column(s).

{'date': DatetimeEncoder(add_weekday=True),
 'holiday': PassThrough(),
 'hum': PassThrough(),
 'temp': PassThrough(),
 'weathersit': PassThrough(),
 'windspeed': PassThrough()}

Prediction with datetime features#

For prediction tasks, we recommend using the TableVectorizer inside a pipeline, combined with a model that can use the features extracted by the DatetimeEncoder. Here we’ll use a HistGradientBoostingRegressor as our learner.

Evaluating the model#

When using date and time features, we often care about predicting the future. In this case, we have to be careful when evaluating our model, because the standard settings of the cross-validation do not respect time ordering.

Instead, we can use the TimeSeriesSplit, which ensures that the test set is always in the future.

from sklearn.model_selection import TimeSeriesSplit, cross_val_score

cross_val_score(
    pipeline,
    X,
    y,
    scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)
array([ -6664.06809097,  -6520.55512687, -21139.48809623, -13442.23050796,
       -14260.1956296 ])

Plotting the prediction#

The mean squared error is not obvious to interpret, so we visually compare the prediction of our model with the actual values. To do so, we will divide our dataset into a train and a test set: we use 2011 data to predict what happened in 2012.

import matplotlib.dates as mdates
import matplotlib.pyplot as plt

mask_train = X["date"] < "2012-01-01"
X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
y_train, y_test = y.loc[mask_train], y.loc[~mask_train]

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

pipeline_weekday.fit(X_train, y_train)
y_pred_weekday = pipeline_weekday.predict(X_test)

fig, ax = plt.subplots(figsize=(12, 3))
fig.suptitle("Predictions with tree models")
ax.plot(
    X.tail(96)["date"],
    y.tail(96).values,
    "x-",
    alpha=0.2,
    label="Actual demand",
    color="black",
)
ax.plot(
    X_test.tail(96)["date"],
    y_pred[-96:],
    "x-",
    label="DatetimeEncoder() + HGBR prediction",
)
ax.plot(
    X_test.tail(96)["date"],
    y_pred_weekday[-96:],
    "x-",
    label="DatetimeEncoder(add_weekday=True) + HGBR prediction",
)

ax.tick_params(axis="x", labelsize=7, labelrotation=75)
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))
_ = ax.legend()
plt.tight_layout()
plt.show()
Predictions with tree models

As we can see, adding the weekday yields better predictions on our test set.

Feature importances#

Using the DatetimeEncoder allows us to better understand how the date impacts the bike sharing demand. To this aim, we can compute the importance of the features created by the DatetimeEncoder, using the permutation_importance() function, which basically shuffles a feature and sees how the model changes its prediction.

from sklearn.inspection import permutation_importance

# In this case, we don't use a pipeline, because we want to compute the
# importance of the features created by the DatetimeEncoder
X_test_transform = pipeline[:-1].transform(X_test)

result = permutation_importance(
    pipeline[-1], X_test_transform, y_test, n_repeats=10, random_state=0
)

result = pd.DataFrame(
    dict(
        feature_names=X_test_transform.columns,
        std=result.importances_std,
        importances=result.importances_mean,
    )
).sort_values("importances", ascending=True)

result.plot.barh(
    y="importances",
    x="feature_names",
    title="Feature Importances",
    xerr="std",
    figsize=(12, 9),
)
plt.tight_layout()
plt.show()
Feature Importances

We can see that the hour of the day, the temperature and the humidity are the most important features, which seems reasonable.

Conclusion#

In this example, we saw how to use the DatetimeEncoder to create features from a datetime column. Also check out the TableVectorizer, which automatically recognizes and transforms datetime columns by default.

Total running time of the script: (0 minutes 5.677 seconds)

Gallery generated by Sphinx-Gallery