Datetime features are very important for many data analysis and machine learning tasks, as they often carry significant information about temporal patterns and trends. For instance, including as features the day of the week, time of day, or season can provide valuable insights for predictive modeling.
However, working with datetime data can be difficult due to the variety of formats in which dates and times are represented. Typical formats include "%Y-%m-%d", "%d/%m/%Y", and "%d %B %Y", among others. Correct parsing of these and more exotic formats is essential to avoid errors and ensure accurate feature extraction.
In this section we are going to cover how skrub can help with dealing with datetimes using to_datetime, ToDatetime, and the DatetimeEncoder.
10.2 Converting datetime strings to datetime objects
Often, the first operation that must be done to work with datetime objects is converting the datetimes from a string representation to a proper datetime object. This is beneficial because using datetimes gives access to datetime-specific features, and allows to access the different parts of the datetime.
Skrub provides different objects to deal with the conversion problem.
ToDatetime is a single column transformer that tries to conver the given column to datetime either by relying on a user-provided format, or by guessing common formats. Since this transformer must be applied to single columns (rather than dataframes), it is typically better to use it in conjunction with ApplyToCols. Additionally, the allow_reject parameter of ApplyToCols should be set to True to avoid raising exceptions for non-datetime columns:
from skrub import ApplyToCols, ToDatetimeimport pandas as pddata = {"dates": ["2023-01-03","2023-02-15","2023-03-27","2023-04-10", ]}df = pd.DataFrame(data)df_enc = ApplyToCols(ToDatetime(), allow_reject=True).fit_transform(df)df_enc.info()
to_datetime works similarly to pd.to_datetime, or the example shown above with ApplyToCols.
Warning
to_datetime is a stateless function, so it should not be used in a pipeline, because it does not guarantee consistency between fit_transform and successive transform. ApplyToCols(ToDatetime(), allow_reject=True) is a better solution for pipelines.
Finally, the standard Cleaner can be used for parsing datetimes, as it uses ToDatetime under the hood, and can take the datetime_format. As the Cleaner is a transformer, it guarantees consistency between fit_transform and transform.
10.3 Encoding datetime features
Datetimes cannot be used “as-is” for training ML models, and must instead be converted to numerical features. Typically, this is done by “splitting” the datetime parts (year, month, day etc.) into separate columns, so that each column contains only one number.
Additional features may also be of interest, such as the number of seconds since epoch (which increases monotonically and gives an indication of the order of entries), whether a date is a weekday or weekend, or the day of the year.
To achieve this with standard dataframe libraries, the code looks like this:
Skrub’s DatetimeEncoder allows to add the same features with a simpler interface. As the DatetimeEncoder is a single column transformer, we use again ApplyToCols.
The DatetimeEncoder includes various parameters to add more features to the transformed dataframe: - add_total_seconds adds the number of seconds since Epoch (1970-01-01) - add_weekday adds the day in the week (to highlight weekends, for example) - add_day_of_year adds the day in year of the datetime
Periodic features are useful for training machine learning models because they capture the cyclical nature of certain data patterns. For example, features such as hours in a day or days in a week often exhibit periodic behavior. By encoding these features periodically, models can better understand and predict patterns that repeat over time, such as daily traffic trends, or seasonal variations. This ensures that the model treats the start and end of a cycle as close neighbors, improving its ability to generalize and make accurate predictions.
This can be done manually with dataframe libraries. For example, circular encoding (a.k.a., trigonometric or sin/cos encoding) can be implemented with Pandas like so:
Alternatively, the DatetimeEncoder can add periodic features using either circular or spline encoding through the periodic_encoding parameter:
de = DatetimeEncoder(periodic_encoding="circular")df_enc = ApplyToCols(de, cols="dates").fit_transform(df_enc)df_enc
dates_year
dates_total_seconds
dates_month_circular_0
dates_month_circular_1
dates_day_circular_0
dates_day_circular_1
day_of_year
day_of_year_sin
day_of_year_cos
weekday
weekday_sin
weekday_cos
0
2023.0
1.672704e+09
0.500000
8.660254e-01
5.877853e-01
0.809017
3
0.051620
0.998667
1
0.781831
0.623490
1
2023.0
1.676419e+09
0.866025
5.000000e-01
1.224647e-16
-1.000000
46
0.711657
0.702527
2
0.974928
-0.222521
2
2023.0
1.679875e+09
1.000000
6.123234e-17
-5.877853e-01
0.809017
86
0.995919
0.090252
0
0.000000
1.000000
3
2023.0
1.681085e+09
0.866025
-5.000000e-01
8.660254e-01
-0.500000
100
0.988678
-0.150055
0
0.000000
1.000000
10.5 Conclusions
In this chapter, we explored the importance and challenges of working with datetime features. We covered how to convert string representations of dates to datetime objects using skrub’s ToDatetime transformer and the Cleaner, both of which can be integrated into pipelines for robust preprocessing.
We also discussed the need to encode datetime features into numerical representations suitable for machine learning models. The DatetimeEncoder provides a convenient way to extract useful components such as year, month, day, weekday, day of year, and total seconds since epoch. Additionally, we saw how periodic (circular) encoding can be used to capture cyclical patterns in time-based data.
In the next chapter, we will cover the final type of columns: categorical/string columns.
11 Exercise
Path to the exercise: content/exercises/06_feat_eng_datetimes.ipynb
Use one of the methods explained so far (Cleaner/ApplyToCols) to convert the provided dataframe to datetime dtype, then extract the following features:
All parts of the datetime
The number of seconds from epoch
The day in the week
The day of the year
Hint: use the format "%d %B %Y" for the datetime.
import pandas as pddata = {"admission_dates": ["03 January 2023","15 February 2023","27 March 2023","10 April 2023", ],"patient_ids": [101, 102, 103, 104],"age": [25, 34, 45, 52],"outcome": ["Recovered", "Under Treatment", "Recovered", "Deceased"],}df = pd.DataFrame(data)print(df)
admission_dates patient_ids age outcome
0 03 January 2023 101 25 Recovered
1 15 February 2023 102 34 Under Treatment
2 27 March 2023 103 45 Recovered
3 10 April 2023 104 52 Deceased