Handling datetime features with the DatetimeEncoder#

In this example, we illustrate how to better integrate datetime features in machine learning models with the DatetimeEncoder.

This encoder breaks down passed datetime features into relevant numerical features, such as the month, the day of the week, the hour of the day, etc.

It is used by default in the TableVectorizer.

A problem with relevant datetime features#

We will use a dataset of air quality measurements in different cities. In this setting, we want to predict the NO2 air concentration, based on the location, date and time of measurement.

from pprint import pprint
import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/pandas-dev/pandas"
    "/main/doc/data/air_quality_no2_long.csv"
).sort_values("date.utc")
# Extract our input data (X) and the target column (y)
y = data["value"]
X = data[["city", "date.utc"]]

X
city date.utc
2067 London 2019-05-07 01:00:00+00:00
1098 Antwerpen 2019-05-07 01:00:00+00:00
1003 Paris 2019-05-07 01:00:00+00:00
1002 Paris 2019-05-07 02:00:00+00:00
2066 London 2019-05-07 02:00:00+00:00
... ... ...
4 Paris 2019-06-20 20:00:00+00:00
3 Paris 2019-06-20 21:00:00+00:00
2 Paris 2019-06-20 22:00:00+00:00
1 Paris 2019-06-20 23:00:00+00:00
0 Paris 2019-06-21 00:00:00+00:00

2068 rows × 2 columns



We convert the dataframe date columns using to_datetime(). Notice how we don’t need to specify the columns to convert.

from skrub import to_datetime

X = to_datetime(X)
X.dtypes
city                     object
date.utc    datetime64[ns, UTC]
dtype: object

Encoding the features#

We will construct a ColumnTransformer in which we will encode the city names with a OneHotEncoder, and the date with a DatetimeEncoder.

During the instantiation of the DatetimeEncoder, we specify that we want to extract the day of the week, and that we don’t want to extract anything finer than minutes. This is because we don’t want to extract seconds and lower units, as they are probably unimportant.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from skrub import DatetimeEncoder

encoder = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), ["city"]),
    (DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
    remainder="drop",
)

X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())
array(['onehotencoder__city_Antwerpen', 'onehotencoder__city_London',
       'onehotencoder__city_Paris', 'datetimeencoder__date.utc_year',
       'datetimeencoder__date.utc_month', 'datetimeencoder__date.utc_day',
       'datetimeencoder__date.utc_hour',
       'datetimeencoder__date.utc_minute',
       'datetimeencoder__date.utc_total_seconds',
       'datetimeencoder__date.utc_day_of_week'], dtype=object)

We see that the encoder is working as expected: the "date.utc" column has been replaced by features extracting the month, day, hour, minute, day of the week and total second since Epoch information.

One-liner with the TableVectorizer#

As mentioned earlier, the TableVectorizer makes use of the DatetimeEncoder by default.

array(['date.utc_year', 'date.utc_month', 'date.utc_day', 'date.utc_hour',
       'date.utc_total_seconds', 'city_Antwerpen', 'city_London',
       'city_Paris'], dtype=object)

If we want to customize the DatetimeEncoder inside the TableVectorizer, we can replace its default parameter with a new, custom instance:

Here, for example, we want it to extract the day of the week.

table_vec = TableVectorizer(
    datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
).fit(X)
pprint(table_vec.get_feature_names_out())
array(['date.utc_year', 'date.utc_month', 'date.utc_day', 'date.utc_hour',
       'date.utc_total_seconds', 'date.utc_day_of_week', 'city_Antwerpen',
       'city_London', 'city_Paris'], dtype=object)

Inspecting the TableVectorizer further, we can check that the DatetimeEncoder is used on the correct column(s).

[('datetime', DatetimeEncoder(add_day_of_the_week=True), ['date.utc']),
 ('low_cardinality',
  OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False),
  ['city'])]

Prediction with datetime features#

For prediction tasks, we recommend using the TableVectorizer inside a pipeline, combined with a model that can use the features extracted by the DatetimeEncoder. Here’s we’ll use a HistGradientBoostingRegressor as our learner.

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(table_vec, HistGradientBoostingRegressor())

Evaluating the model#

When using date and time features, we often care about predicting the future. In this case, we have to be careful when evaluating our model, because the standard settings of the cross-validation do not respect time ordering.

Instead, we can use the TimeSeriesSplit, which ensures that the test set is always in the future.

from sklearn.model_selection import TimeSeriesSplit, cross_val_score

cross_val_score(
    pipeline,
    X,
    y,
    scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)
array([-108.89923208, -189.50744903, -181.15543923, -143.86698948,
       -207.90131453])

Plotting the prediction#

The mean squared error is not obvious to interpret, so we compare visually the prediction of our model with the actual values.

import numpy as np
import matplotlib.pyplot as plt

mask_train = X["date.utc"] < "2019-06-01"
X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
y_train, y_test = y.loc[mask_train], y.loc[~mask_train]

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

all_cities = X_test["city"].unique()

fig, axes = plt.subplots(nrows=len(all_cities), ncols=1, figsize=(12, 9))
for ax, city in zip(axes, all_cities):
    mask_prediction = X_test["city"] == city
    date_prediction = X_test.loc[mask_prediction]["date.utc"]
    y_prediction = y_pred[mask_prediction]

    mask_reference = X["city"] == city
    date_reference = X.loc[mask_reference]["date.utc"]
    y_reference = y[mask_reference]

    ax.plot(date_reference, y_reference, label="Actual")
    ax.plot(date_prediction, y_prediction, label="Predicted")

    ax.set(
        ylabel="NO2",
        title=city,
    )
    ax.legend()

fig.subplots_adjust(hspace=0.5)
plt.show()
Paris, London, Antwerpen

Let’s zoom on a few days:

mask_zoom_reference = (X["date.utc"] >= "2019-06-01") & (X["date.utc"] < "2019-06-04")
mask_zoom_prediction = (X_test["date.utc"] >= "2019-06-01") & (
    X_test["date.utc"] < "2019-06-04"
)

all_cities = ["Paris", "London"]
fig, axes = plt.subplots(nrows=len(all_cities), ncols=1, figsize=(12, 9))
for ax, city in zip(axes, all_cities):
    mask_prediction = (X_test["city"] == city) & mask_zoom_prediction
    date_prediction = X_test.loc[mask_prediction]["date.utc"]
    y_prediction = y_pred[mask_prediction]

    mask_reference = (X["city"] == city) & mask_zoom_reference
    date_reference = X.loc[mask_reference]["date.utc"]
    y_reference = y[mask_reference]

    ax.plot(date_reference, y_reference, label="Actual")
    ax.plot(date_prediction, y_prediction, label="Predicted")

    ax.set(
        ylabel="NO2",
        title=city,
    )
    ax.legend()

plt.show()
Paris, London

Features importance#

Using the DatetimeEncoder allows us to better understand how the date impacts the NO2 concentration. To this aim, we can compute the importance of the features created by the DatetimeEncoder, using the permutation_importance() function, which basically shuffles a feature and sees how the model changes its prediction.

from sklearn.inspection import permutation_importance

table_vec = TableVectorizer(
    datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
)

# In this case, we don't use a pipeline, because we want to compute the
# importance of the features created by the DatetimeEncoder
X_transform = table_vec.fit_transform(X)
feature_names = table_vec.get_feature_names_out()

model = HistGradientBoostingRegressor().fit(X_transform, y)
result = permutation_importance(model, X_transform, y, n_repeats=10, random_state=0)

result = pd.DataFrame(
    dict(
        feature_names=feature_names,
        std=result.importances_std,
        importances=result.importances_mean,
    )
).sort_values("importances", ascending=False)

result.plot.barh(
    y="importances", x="feature_names", title="Feature Importances", figsize=(12, 9)
)
plt.tight_layout()
Feature Importances

We can see that the total seconds since Epoch and the hour of the day are the most important feature, which seems reasonable.

Conclusion#

In this example, we saw how to use the DatetimeEncoder to create features from a date column. Also check out the TableVectorizer, which automatically recognizes and transforms datetime columns by default.

Total running time of the script: (0 minutes 5.097 seconds)

Gallery generated by Sphinx-Gallery