10 Encoding datetime features with `DatetimeEncoder`

10.1 Introduction

Datetime features are very important for many data analysis and machine learning tasks, as they often carry significant information about temporal patterns and trends. For instance, including as features the day of the week, time of day, or season can provide valuable insights for predictive modeling.

However, working with datetime data can be difficult due to the variety of formats in which dates and times are represented. Typical formats include "%Y-%m-%d", "%d/%m/%Y", and "%d %B %Y", among others. Correct parsing of these and more exotic formats is essential to avoid errors and ensure accurate feature extraction.

In this section we are going to cover how skrub can help with dealing with datetimes using to_datetime, ToDatetime, and the DatetimeEncoder.

10.2 Converting datetime strings to datetime objects

Often, the first operation that must be done to work with datetime objects is converting the datetimes from a string representation to a proper datetime object. This is beneficial because using datetimes gives access to datetime-specific features, and allows to access the different parts of the datetime.

Skrub provides different objects to deal with the conversion problem.

ToDatetime is a single column transformer that tries to conver the given column to datetime either by relying on a user-provided format, or by guessing common formats. Since this transformer must be applied to single columns (rather than dataframes), it is typically better to use it in conjunction with ApplyToCols. Additionally, the allow_reject parameter of ApplyToCols should be set to True to avoid raising exceptions for non-datetime columns:

from skrub import ApplyToCols, ToDatetime

import pandas as pd

data = {
    "dates": [
        "2023-01-03",
        "2023-02-15",
        "2023-03-27",
        "2023-04-10",
    ]
}
df = pd.DataFrame(data)

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)
df_enc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   dates   4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 164.0 bytes

to_datetime works similarly to pd.to_datetime, or the example shown above with ApplyToCols.

Warning

to_datetime is a stateless function, so it should not be used in a pipeline, because it does not guarantee consistency between fit_transform and successive transform. ApplyToCols(ToDatetime(), allow_reject=True) is a better solution for pipelines.

Finally, the standard Cleaner can be used for parsing datetimes, as it uses ToDatetime under the hood, and can take the datetime_format. As the Cleaner is a transformer, it guarantees consistency between fit_transform and transform.

10.3 Encoding datetime features

Datetimes cannot be used “as-is” for training ML models, and must instead be converted to numerical features. Typically, this is done by “splitting” the datetime parts (year, month, day etc.) into separate columns, so that each column contains only one number.

Additional features may also be of interest, such as the number of seconds since epoch (which increases monotonically and gives an indication of the order of entries), whether a date is a weekday or weekend, or the day of the year.

To achieve this with standard dataframe libraries, the code looks like this:

df_enc["year"] = df_enc["dates"].dt.year
df_enc["month"] = df_enc["dates"].dt.month
df_enc["day"] = df_enc["dates"].dt.day
df_enc["weekday"] = df_enc["dates"].dt.weekday
df_enc["day_of_year"] = df_enc["dates"].dt.day_of_year
df_enc["total_seconds"] = (
    df_enc["dates"] - pd.Timestamp("1970-01-01")
) // pd.Timedelta(seconds=1)

df_enc

	dates	year	month	day	weekday	day_of_year	total_seconds
0	2023-01-03	2023	1	3	1	3	1672704000
1	2023-02-15	2023	2	15	2	46	1676419200
2	2023-03-27	2023	3	27	0	86	1679875200
3	2023-04-10	2023	4	10	0	100	1681084800

Skrub’s DatetimeEncoder allows to add the same features with a simpler interface. As the DatetimeEncoder is a single column transformer, we use again ApplyToCols.

The DatetimeEncoder includes various parameters to add more features to the transformed dataframe: - add_total_seconds adds the number of seconds since Epoch (1970-01-01) - add_weekday adds the day in the week (to highlight weekends, for example) - add_day_of_year adds the day in year of the datetime

from skrub import DatetimeEncoder

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)

de = DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True)

df_enc = ApplyToCols(de, cols="dates").fit_transform(df_enc)
df_enc

	dates_year	dates_month	dates_day	dates_total_seconds	dates_weekday	dates_day_of_year
0	2023.0	1.0	3.0	1.672704e+09	2.0	3.0
1	2023.0	2.0	15.0	1.676419e+09	3.0	46.0
2	2023.0	3.0	27.0	1.679875e+09	1.0	86.0
3	2023.0	4.0	10.0	1.681085e+09	1.0	100.0

10.4 Periodic features

Periodic features are useful for training machine learning models because they capture the cyclical nature of certain data patterns. For example, features such as hours in a day or days in a week often exhibit periodic behavior. By encoding these features periodically, models can better understand and predict patterns that repeat over time, such as daily traffic trends, or seasonal variations. This ensures that the model treats the start and end of a cycle as close neighbors, improving its ability to generalize and make accurate predictions.

This can be done manually with dataframe libraries. For example, circular encoding (a.k.a., trigonometric or sin/cos encoding) can be implemented with Pandas like so:

import numpy as np 

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)

df_enc["day_of_year"] = df_enc["dates"].dt.day_of_year
df_enc["day_of_year_sin"] = np.sin(2 * np.pi * df_enc["day_of_year"] / 365)
df_enc["day_of_year_cos"] = np.cos(2 * np.pi * df_enc["day_of_year"] / 365)

df_enc["weekday"] = df_enc["dates"].dt.weekday
df_enc["weekday_sin"] = np.sin(2 * np.pi * df_enc["weekday"] / 7)
df_enc["weekday_cos"] = np.cos(2 * np.pi * df_enc["weekday"] / 7)

df_enc

	dates	day_of_year	day_of_year_sin	day_of_year_cos	weekday	weekday_sin	weekday_cos
0	2023-01-03	3	0.051620	0.998667	1	0.781831	0.623490
1	2023-02-15	46	0.711657	0.702527	2	0.974928	-0.222521
2	2023-03-27	86	0.995919	0.090252	0	0.000000	1.000000
3	2023-04-10	100	0.988678	-0.150055	0	0.000000	1.000000

Alternatively, the DatetimeEncoder can add periodic features using either circular or spline encoding through the periodic_encoding parameter:

de = DatetimeEncoder(periodic_encoding="circular")

df_enc = ApplyToCols(de, cols="dates").fit_transform(df_enc)
df_enc

	dates_year	dates_total_seconds	dates_month_circular_0	dates_month_circular_1	dates_day_circular_0	dates_day_circular_1	day_of_year	day_of_year_sin	day_of_year_cos	weekday	weekday_sin	weekday_cos
0	2023.0	1.672704e+09	0.500000	8.660254e-01	5.877853e-01	0.809017	3	0.051620	0.998667	1	0.781831	0.623490
1	2023.0	1.676419e+09	0.866025	5.000000e-01	1.224647e-16	-1.000000	46	0.711657	0.702527	2	0.974928	-0.222521
2	2023.0	1.679875e+09	1.000000	6.123234e-17	-5.877853e-01	0.809017	86	0.995919	0.090252	0	0.000000	1.000000
3	2023.0	1.681085e+09	0.866025	-5.000000e-01	8.660254e-01	-0.500000	100	0.988678	-0.150055	0	0.000000	1.000000

10.5 Conclusions

In this chapter, we explored the importance and challenges of working with datetime features. We covered how to convert string representations of dates to datetime objects using skrub’s ToDatetime transformer and the Cleaner, both of which can be integrated into pipelines for robust preprocessing.

We also discussed the need to encode datetime features into numerical representations suitable for machine learning models. The DatetimeEncoder provides a convenient way to extract useful components such as year, month, day, weekday, day of year, and total seconds since epoch. Additionally, we saw how periodic (circular) encoding can be used to capture cyclical patterns in time-based data.

In the next chapter, we will cover the final type of columns: categorical/string columns.

11 Exercise

Path to the exercise: content/exercises/06_feat_eng_datetimes.ipynb

Use one of the methods explained so far (Cleaner/ApplyToCols) to convert the provided dataframe to datetime dtype, then extract the following features:

All parts of the datetime
The number of seconds from epoch
The day in the week
The day of the year

Hint: use the format "%d %B %Y" for the datetime.

import pandas as pd

data = {
    "admission_dates": [
        "03 January 2023",
        "15 February 2023",
        "27 March 2023",
        "10 April 2023",
    ],
    "patient_ids": [101, 102, 103, 104],
    "age": [25, 34, 45, 52],
    "outcome": ["Recovered", "Under Treatment", "Recovered", "Deceased"],
}
df = pd.DataFrame(data)
print(df)

    admission_dates  patient_ids  age          outcome
0   03 January 2023          101   25        Recovered
1  15 February 2023          102   34  Under Treatment
2     27 March 2023          103   45        Recovered
3     10 April 2023          104   52         Deceased

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
#

# Solution with ApplyToCols and ToDatetime
from skrub import ApplyToCols, ToDatetime, DatetimeEncoder
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

to_datetime_encoder = ApplyToCols(ToDatetime(format="%d %B %Y"), cols="admission_dates")

datetime_encoder = ApplyToCols(
    DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True),
    cols=s.any_date(),
)

encoder = make_pipeline(to_datetime_encoder, datetime_encoder)
encoder.fit_transform(df)

	admission_dates_year	admission_dates_month	admission_dates_day	admission_dates_total_seconds	admission_dates_weekday	admission_dates_day_of_year	patient_ids	age	outcome
0	2023.0	1.0	3.0	1.672704e+09	2.0	3.0	101	25	Recovered
1	2023.0	2.0	15.0	1.676419e+09	3.0	46.0	102	34	Under Treatment
2	2023.0	3.0	27.0	1.679875e+09	1.0	86.0	103	45	Recovered
3	2023.0	4.0	10.0	1.681085e+09	1.0	100.0	104	52	Deceased

# Solution with Cleaner
from skrub import Cleaner
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

datetime_encoder = ApplyToCols(
    DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True),
    cols=s.any_date(),
)

encoder = make_pipeline(Cleaner(datetime_format="%d %B %Y"), datetime_encoder)
encoder.fit_transform(df)

	admission_dates_year	admission_dates_month	admission_dates_day	admission_dates_total_seconds	admission_dates_weekday	admission_dates_day_of_year	patient_ids	age	outcome
0	2023.0	1.0	3.0	1.672704e+09	2.0	3.0	101	25	Recovered
1	2023.0	2.0	15.0	1.676419e+09	3.0	46.0	102	34	Under Treatment
2	2023.0	3.0	27.0	1.679875e+09	1.0	86.0	103	45	Recovered
3	2023.0	4.0	10.0	1.681085e+09	1.0	100.0	104	52	Deceased

Modify the script so that the DatetimeEncoder adds periodic encoding with sine and cosine (aka circular encoding):

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
#

Now modify the script above to add spline features (periodic_encoding="spline").

# Solution
from skrub import Cleaner
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

datetime_encoder = ApplyToCols(
    DatetimeEncoder(
        periodic_encoding="spline",
        add_total_seconds=True,
        add_weekday=True,
        add_day_of_year=True,
    ),
    cols=s.any_date(),
)

encoder = make_pipeline(Cleaner(datetime_format="%d %B %Y"), datetime_encoder)
encoder.fit_transform(df)

	admission_dates_year	admission_dates_total_seconds	admission_dates_day_of_year	admission_dates_month_spline_01	admission_dates_month_spline_02	admission_dates_month_spline_03	admission_dates_month_spline_04	admission_dates_month_spline_05	admission_dates_month_spline_06	...	admission_dates_weekday_spline_1	admission_dates_weekday_spline_2	admission_dates_weekday_spline_3	admission_dates_weekday_spline_4	admission_dates_weekday_spline_5	patient_ids	age	outcome
0	2023.0	1.672704e+09	3.0	0.166667	0.666667	0.166667	0.000000	0.000000	0.000000	...	0.000000	0.166667	0.666667	0.166667	0.000000	101	25	Recovered
1	2023.0	1.676419e+09	46.0	0.000000	0.166667	0.666667	0.166667	0.000000	0.000000	...	0.000000	0.000000	0.166667	0.666667	0.166667	102	34	Under Treatment
2	2023.0	1.679875e+09	86.0	0.000000	0.000000	0.166667	0.666667	0.166667	0.000000	...	0.166667	0.666667	0.166667	0.000000	0.000000	103	45	Recovered
3	2023.0	1.681085e+09	100.0	0.000000	0.000000	0.000000	0.166667	0.666667	0.166667	...	0.166667	0.666667	0.166667	0.000000	0.000000	104	52	Deceased

4 rows × 29 columns