10  Encoding datetime features with DatetimeEncoder

10.1 Introduction

Datetime features are very important for many data analysis and machine learning tasks, as they often carry significant information about temporal patterns and trends. For instance, including as features the day of the week, time of day, or season can provide valuable insights for predictive modeling.

However, working with datetime data can be difficult due to the variety of formats in which dates and times are represented. Typical formats include "%Y-%m-%d", "%d/%m/%Y", and "%d %B %Y", among others. Correct parsing of these and more exotic formats is essential to avoid errors and ensure accurate feature extraction.

In this section we are going to cover how skrub can help with dealing with datetimes using to_datetime, ToDatetime, and the DatetimeEncoder.

10.2 Converting datetime strings to datetime objects

Often, the first operation that must be done to work with datetime objects is converting the datetimes from a string representation to a proper datetime object. This is beneficial because using datetimes gives access to datetime-specific features, and allows to access the different parts of the datetime.

Skrub provides different objects to deal with the conversion problem.

ToDatetime is a single column transformer that tries to conver the given column to datetime either by relying on a user-provided format, or by guessing common formats. Since this transformer must be applied to single columns (rather than dataframes), it is typically better to use it in conjunction with ApplyToCols. Additionally, the allow_reject parameter of ApplyToCols should be set to True to avoid raising exceptions for non-datetime columns:

from skrub import ApplyToCols, ToDatetime

import pandas as pd

data = {
    "dates": [
        "2023-01-03",
        "2023-02-15",
        "2023-03-27",
        "2023-04-10",
    ]
}
df = pd.DataFrame(data)

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)
df_enc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   dates   4 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 164.0 bytes

to_datetime works similarly to pd.to_datetime, or the example shown above with ApplyToCols.

Warning

to_datetime is a stateless function, so it should not be used in a pipeline, because it does not guarantee consistency between fit_transform and successive transform. ApplyToCols(ToDatetime(), allow_reject=True) is a better solution for pipelines.

Finally, the standard Cleaner can be used for parsing datetimes, as it uses ToDatetime under the hood, and can take the datetime_format. As the Cleaner is a transformer, it guarantees consistency between fit_transform and transform.

10.3 Encoding datetime features

Datetimes cannot be used “as-is” for training ML models, and must instead be converted to numerical features. Typically, this is done by “splitting” the datetime parts (year, month, day etc.) into separate columns, so that each column contains only one number.

Additional features may also be of interest, such as the number of seconds since epoch (which increases monotonically and gives an indication of the order of entries), whether a date is a weekday or weekend, or the day of the year.

To achieve this with standard dataframe libraries, the code looks like this:

df_enc["year"] = df_enc["dates"].dt.year
df_enc["month"] = df_enc["dates"].dt.month
df_enc["day"] = df_enc["dates"].dt.day
df_enc["weekday"] = df_enc["dates"].dt.weekday
df_enc["day_of_year"] = df_enc["dates"].dt.day_of_year
df_enc["total_seconds"] = (
    df_enc["dates"] - pd.Timestamp("1970-01-01")
) // pd.Timedelta(seconds=1)

df_enc
dates year month day weekday day_of_year total_seconds
0 2023-01-03 2023 1 3 1 3 1672704000
1 2023-02-15 2023 2 15 2 46 1676419200
2 2023-03-27 2023 3 27 0 86 1679875200
3 2023-04-10 2023 4 10 0 100 1681084800

Skrub’s DatetimeEncoder allows to add the same features with a simpler interface. As the DatetimeEncoder is a single column transformer, we use again ApplyToCols.

The DatetimeEncoder includes various parameters to add more features to the transformed dataframe: - add_total_seconds adds the number of seconds since Epoch (1970-01-01) - add_weekday adds the day in the week (to highlight weekends, for example) - add_day_of_year adds the day in year of the datetime

from skrub import DatetimeEncoder

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)

de = DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True)

df_enc = ApplyToCols(de, cols="dates").fit_transform(df_enc)
df_enc
dates_year dates_month dates_day dates_total_seconds dates_weekday dates_day_of_year
0 2023.0 1.0 3.0 1.672704e+09 2.0 3.0
1 2023.0 2.0 15.0 1.676419e+09 3.0 46.0
2 2023.0 3.0 27.0 1.679875e+09 1.0 86.0
3 2023.0 4.0 10.0 1.681085e+09 1.0 100.0

10.4 Periodic features

Periodic features are useful for training machine learning models because they capture the cyclical nature of certain data patterns. For example, features such as hours in a day or days in a week often exhibit periodic behavior. By encoding these features periodically, models can better understand and predict patterns that repeat over time, such as daily traffic trends, or seasonal variations. This ensures that the model treats the start and end of a cycle as close neighbors, improving its ability to generalize and make accurate predictions.

This can be done manually with dataframe libraries. For example, circular encoding (a.k.a., trigonometric or sin/cos encoding) can be implemented with Pandas like so:

import numpy as np 

df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)

df_enc["day_of_year"] = df_enc["dates"].dt.day_of_year
df_enc["day_of_year_sin"] = np.sin(2 * np.pi * df_enc["day_of_year"] / 365)
df_enc["day_of_year_cos"] = np.cos(2 * np.pi * df_enc["day_of_year"] / 365)

df_enc["weekday"] = df_enc["dates"].dt.weekday
df_enc["weekday_sin"] = np.sin(2 * np.pi * df_enc["weekday"] / 7)
df_enc["weekday_cos"] = np.cos(2 * np.pi * df_enc["weekday"] / 7)

df_enc
dates day_of_year day_of_year_sin day_of_year_cos weekday weekday_sin weekday_cos
0 2023-01-03 3 0.051620 0.998667 1 0.781831 0.623490
1 2023-02-15 46 0.711657 0.702527 2 0.974928 -0.222521
2 2023-03-27 86 0.995919 0.090252 0 0.000000 1.000000
3 2023-04-10 100 0.988678 -0.150055 0 0.000000 1.000000

Alternatively, the DatetimeEncoder can add periodic features using either circular or spline encoding through the periodic_encoding parameter:

de = DatetimeEncoder(periodic_encoding="circular")

df_enc = ApplyToCols(de, cols="dates").fit_transform(df_enc)
df_enc
dates_year dates_total_seconds dates_month_circular_0 dates_month_circular_1 dates_day_circular_0 dates_day_circular_1 day_of_year day_of_year_sin day_of_year_cos weekday weekday_sin weekday_cos
0 2023.0 1.672704e+09 0.500000 8.660254e-01 5.877853e-01 0.809017 3 0.051620 0.998667 1 0.781831 0.623490
1 2023.0 1.676419e+09 0.866025 5.000000e-01 1.224647e-16 -1.000000 46 0.711657 0.702527 2 0.974928 -0.222521
2 2023.0 1.679875e+09 1.000000 6.123234e-17 -5.877853e-01 0.809017 86 0.995919 0.090252 0 0.000000 1.000000
3 2023.0 1.681085e+09 0.866025 -5.000000e-01 8.660254e-01 -0.500000 100 0.988678 -0.150055 0 0.000000 1.000000

10.5 Conclusions

In this chapter, we explored the importance and challenges of working with datetime features. We covered how to convert string representations of dates to datetime objects using skrub’s ToDatetime transformer and the Cleaner, both of which can be integrated into pipelines for robust preprocessing.

We also discussed the need to encode datetime features into numerical representations suitable for machine learning models. The DatetimeEncoder provides a convenient way to extract useful components such as year, month, day, weekday, day of year, and total seconds since epoch. Additionally, we saw how periodic (circular) encoding can be used to capture cyclical patterns in time-based data.

In the next chapter, we will cover the final type of columns: categorical/string columns.

11 Exercise

Path to the exercise: content/exercises/06_feat_eng_datetimes.ipynb

Use one of the methods explained so far (Cleaner/ApplyToCols) to convert the provided dataframe to datetime dtype, then extract the following features:

  • All parts of the datetime
  • The number of seconds from epoch
  • The day in the week
  • The day of the year

Hint: use the format "%d %B %Y" for the datetime.

import pandas as pd

data = {
    "admission_dates": [
        "03 January 2023",
        "15 February 2023",
        "27 March 2023",
        "10 April 2023",
    ],
    "patient_ids": [101, 102, 103, 104],
    "age": [25, 34, 45, 52],
    "outcome": ["Recovered", "Under Treatment", "Recovered", "Deceased"],
}
df = pd.DataFrame(data)
print(df)
    admission_dates  patient_ids  age          outcome
0   03 January 2023          101   25        Recovered
1  15 February 2023          102   34  Under Treatment
2     27 March 2023          103   45        Recovered
3     10 April 2023          104   52         Deceased
# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# Solution with ApplyToCols and ToDatetime
from skrub import ApplyToCols, ToDatetime, DatetimeEncoder
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

to_datetime_encoder = ApplyToCols(ToDatetime(format="%d %B %Y"), cols="admission_dates")

datetime_encoder = ApplyToCols(
    DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True),
    cols=s.any_date(),
)

encoder = make_pipeline(to_datetime_encoder, datetime_encoder)
encoder.fit_transform(df)
admission_dates_year admission_dates_month admission_dates_day admission_dates_total_seconds admission_dates_weekday admission_dates_day_of_year patient_ids age outcome
0 2023.0 1.0 3.0 1.672704e+09 2.0 3.0 101 25 Recovered
1 2023.0 2.0 15.0 1.676419e+09 3.0 46.0 102 34 Under Treatment
2 2023.0 3.0 27.0 1.679875e+09 1.0 86.0 103 45 Recovered
3 2023.0 4.0 10.0 1.681085e+09 1.0 100.0 104 52 Deceased
# Solution with Cleaner
from skrub import Cleaner
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

datetime_encoder = ApplyToCols(
    DatetimeEncoder(add_total_seconds=True, add_weekday=True, add_day_of_year=True),
    cols=s.any_date(),
)

encoder = make_pipeline(Cleaner(datetime_format="%d %B %Y"), datetime_encoder)
encoder.fit_transform(df)
admission_dates_year admission_dates_month admission_dates_day admission_dates_total_seconds admission_dates_weekday admission_dates_day_of_year patient_ids age outcome
0 2023.0 1.0 3.0 1.672704e+09 2.0 3.0 101 25 Recovered
1 2023.0 2.0 15.0 1.676419e+09 3.0 46.0 102 34 Under Treatment
2 2023.0 3.0 27.0 1.679875e+09 1.0 86.0 103 45 Recovered
3 2023.0 4.0 10.0 1.681085e+09 1.0 100.0 104 52 Deceased

Modify the script so that the DatetimeEncoder adds periodic encoding with sine and cosine (aka circular encoding):

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 

Now modify the script above to add spline features (periodic_encoding="spline").

# Solution
from skrub import Cleaner
from sklearn.pipeline import make_pipeline
import skrub.selectors as s

datetime_encoder = ApplyToCols(
    DatetimeEncoder(
        periodic_encoding="spline",
        add_total_seconds=True,
        add_weekday=True,
        add_day_of_year=True,
    ),
    cols=s.any_date(),
)

encoder = make_pipeline(Cleaner(datetime_format="%d %B %Y"), datetime_encoder)
encoder.fit_transform(df)
admission_dates_year admission_dates_total_seconds admission_dates_day_of_year admission_dates_month_spline_00 admission_dates_month_spline_01 admission_dates_month_spline_02 admission_dates_month_spline_03 admission_dates_month_spline_04 admission_dates_month_spline_05 admission_dates_month_spline_06 ... admission_dates_weekday_spline_0 admission_dates_weekday_spline_1 admission_dates_weekday_spline_2 admission_dates_weekday_spline_3 admission_dates_weekday_spline_4 admission_dates_weekday_spline_5 admission_dates_weekday_spline_6 patient_ids age outcome
0 2023.0 1.672704e+09 3.0 0.0 0.166667 0.666667 0.166667 0.000000 0.000000 0.000000 ... 0.0 0.000000 0.166667 0.666667 0.166667 0.000000 0.0 101 25 Recovered
1 2023.0 1.676419e+09 46.0 0.0 0.000000 0.166667 0.666667 0.166667 0.000000 0.000000 ... 0.0 0.000000 0.000000 0.166667 0.666667 0.166667 0.0 102 34 Under Treatment
2 2023.0 1.679875e+09 86.0 0.0 0.000000 0.000000 0.166667 0.666667 0.166667 0.000000 ... 0.0 0.166667 0.666667 0.166667 0.000000 0.000000 0.0 103 45 Recovered
3 2023.0 1.681085e+09 100.0 0.0 0.000000 0.000000 0.000000 0.166667 0.666667 0.166667 ... 0.0 0.166667 0.666667 0.166667 0.000000 0.000000 0.0 104 52 Deceased

4 rows × 29 columns