Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder
Handling datetime features with the DatetimeEncoder#
In this example, we illustrate how to better integrate datetime features
in machine learning models with the DatetimeEncoder
.
This encoder breaks down passed datetime features into relevant numerical features, such as the month, the day of the week, the hour of the day, etc.
It is used by default in the TableVectorizer
.
A problem with relevant datetime features#
We will use a dataset of air quality measurements in different cities. In this setting, we want to predict the NO2 air concentration, based on the location, date and time of measurement.
We convert the dataframe date columns using to_datetime()
. Notice how
we don’t need to specify the columns to convert.
from skrub import to_datetime
X = to_datetime(X)
X.dtypes
city object
date.utc datetime64[ns, UTC]
dtype: object
Encoding the features#
We will construct a ColumnTransformer
in which we will encode
the city names with a OneHotEncoder
, and the date
with a DatetimeEncoder
.
During the instantiation of the DatetimeEncoder
, we specify that we want
to extract the day of the week, and that we don’t want to extract anything
finer than minutes. This is because we don’t want to extract seconds and
lower units, as they are probably unimportant.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from skrub import DatetimeEncoder
encoder = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), ["city"]),
(DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
remainder="drop",
)
X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())
array(['onehotencoder__city_Antwerpen', 'onehotencoder__city_London',
'onehotencoder__city_Paris', 'datetimeencoder__date.utc_year',
'datetimeencoder__date.utc_month', 'datetimeencoder__date.utc_day',
'datetimeencoder__date.utc_hour',
'datetimeencoder__date.utc_minute',
'datetimeencoder__date.utc_total_seconds',
'datetimeencoder__date.utc_day_of_week'], dtype=object)
We see that the encoder is working as expected: the "date.utc"
column has
been replaced by features extracting the month, day, hour, minute, day of the
week and total second since Epoch information.
One-liner with the TableVectorizer
#
As mentioned earlier, the TableVectorizer
makes use of the
DatetimeEncoder
by default.
from skrub import TableVectorizer
table_vec = TableVectorizer().fit(X)
pprint(table_vec.get_feature_names_out())
array(['date.utc_year', 'date.utc_month', 'date.utc_day', 'date.utc_hour',
'date.utc_total_seconds', 'city_Antwerpen', 'city_London',
'city_Paris'], dtype=object)
If we want to customize the DatetimeEncoder
inside the TableVectorizer
,
we can replace its default parameter with a new, custom instance:
Here, for example, we want it to extract the day of the week.
table_vec = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
).fit(X)
pprint(table_vec.get_feature_names_out())
array(['date.utc_year', 'date.utc_month', 'date.utc_day', 'date.utc_hour',
'date.utc_total_seconds', 'date.utc_day_of_week', 'city_Antwerpen',
'city_London', 'city_Paris'], dtype=object)
Inspecting the TableVectorizer
further, we can check that the
DatetimeEncoder
is used on the correct column(s).
[('datetime', DatetimeEncoder(add_day_of_the_week=True), ['date.utc']),
('low_card_cat',
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False),
['city'])]
Prediction with datetime features#
For prediction tasks, we recommend using the TableVectorizer
inside a
pipeline, combined with a model that can use the features extracted by the
DatetimeEncoder
.
Here’s we’ll use a HistGradientBoostingRegressor
as our learner.
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(table_vec, HistGradientBoostingRegressor())
Evaluating the model#
When using date and time features, we often care about predicting the future. In this case, we have to be careful when evaluating our model, because the standard settings of the cross-validation do not respect time ordering.
Instead, we can use the TimeSeriesSplit
,
which ensures that the test set is always in the future.
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
cross_val_score(
pipeline,
X,
y,
scoring="neg_mean_squared_error",
cv=TimeSeriesSplit(n_splits=5),
)
array([-108.89923208, -189.50744903, -181.15543923, -143.86698948,
-207.90131453])
Plotting the prediction#
The mean squared error is not obvious to interpret, so we compare visually the prediction of our model with the actual values.
import numpy as np
import matplotlib.pyplot as plt
mask_train = X["date.utc"] < "2019-06-01"
X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
y_train, y_test = y.loc[mask_train], y.loc[~mask_train]
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
all_cities = X_test["city"].unique()
fig, axes = plt.subplots(nrows=len(all_cities), ncols=1, figsize=(12, 9))
for ax, city in zip(axes, all_cities):
mask_prediction = X_test["city"] == city
date_prediction = X_test.loc[mask_prediction]["date.utc"]
y_prediction = y_pred[mask_prediction]
mask_reference = X["city"] == city
date_reference = X.loc[mask_reference]["date.utc"]
y_reference = y[mask_reference]
ax.plot(date_reference, y_reference, label="Actual")
ax.plot(date_prediction, y_prediction, label="Predicted")
ax.set(
ylabel="NO2",
title=city,
)
ax.legend()
fig.subplots_adjust(hspace=0.5)
plt.show()

Let’s zoom on a few days:
mask_zoom_reference = (X["date.utc"] >= "2019-06-01") & (X["date.utc"] < "2019-06-04")
mask_zoom_prediction = (X_test["date.utc"] >= "2019-06-01") & (
X_test["date.utc"] < "2019-06-04"
)
all_cities = ["Paris", "London"]
fig, axes = plt.subplots(nrows=len(all_cities), ncols=1, figsize=(12, 9))
for ax, city in zip(axes, all_cities):
mask_prediction = (X_test["city"] == city) & mask_zoom_prediction
date_prediction = X_test.loc[mask_prediction]["date.utc"]
y_prediction = y_pred[mask_prediction]
mask_reference = (X["city"] == city) & mask_zoom_reference
date_reference = X.loc[mask_reference]["date.utc"]
y_reference = y[mask_reference]
ax.plot(date_reference, y_reference, label="Actual")
ax.plot(date_prediction, y_prediction, label="Predicted")
ax.set(
ylabel="NO2",
title=city,
)
ax.legend()
plt.show()

Features importance#
Using the DatetimeEncoder
allows us to better understand how the date
impacts the NO2 concentration. To this aim, we can compute the
importance of the features created by the DatetimeEncoder
, using the
permutation_importance()
function, which
basically shuffles a feature and sees how the model changes its prediction.
from sklearn.inspection import permutation_importance
table_vec = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
)
# In this case, we don't use a pipeline, because we want to compute the
# importance of the features created by the DatetimeEncoder
X_transform = table_vec.fit_transform(X)
feature_names = table_vec.get_feature_names_out()
model = HistGradientBoostingRegressor().fit(X_transform, y)
result = permutation_importance(model, X_transform, y, n_repeats=10, random_state=0)
result = pd.DataFrame(
dict(
feature_names=feature_names,
std=result.importances_std,
importances=result.importances_mean,
)
).sort_values("importances", ascending=False)
result.plot.barh(
y="importances", x="feature_names", title="Feature Importances", figsize=(12, 9)
)
plt.tight_layout()

We can see that the total seconds since Epoch and the hour of the day are the most important feature, which seems reasonable.
Conclusion#
In this example, we saw how to use the DatetimeEncoder
to create
features from a date column.
Also check out the TableVectorizer
, which automatically recognizes
and transforms datetime columns by default.
Total running time of the script: (0 minutes 5.948 seconds)