Handling datetimes: parsing from strings and encoding as numbers#

Depending on the input data, timestamps and dates can cause issues, or require specific parsing. For example, reading input data stored in csv format results in datetime columns that are treated as strings.

In such cases, parsing columns that contain timestamps or dates so that they are treated as datetime objects allows to make use of advanced functionalities available in the standard Python library, Pandas and Polars.

Skrub provides objects that help with parsing such data (ToDatetime), as well as the DatetimeEncoder, a datetime-specific encoder that feature engineers datetime columns.

Parsing Datetime Strings with ToDatetime#

Skrub provides helpers to parse datetime string columns automatically:

  • The ToDatetime transformer learns a mapping between columns and their formats. It then applies this mapping during the transform step.

  • The to_datetime() function applies the ToDatetime transformer to all columns in the dataframe, and tries to parse them as datetimes. The format can be inferred or user-specified with the format argument.

>>> import pandas as pd
>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> s
0    2024-05-05T13:17:52
1                   None
2    2024-05-07T13:17:52
Name: when, dtype: object
>>> from skrub import ToDatetime
>>> to_dt = ToDatetime()
>>> to_dt.fit_transform(s)
0   2024-05-05 13:17:52
1                   NaT
2   2024-05-07 13:17:52
Name: when, dtype: datetime64[...]

The attributes format_, output_dtype_, output_time_zone_ record information about the conversion result.

>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S'
>>> to_dt.output_dtype_
dtype('<M8[...]')
>>> to_dt.output_time_zone_ is None
True

Once ToDatetime was successfully fitted, transform will always try to parse datetimes with the same format and output the same dtype. Entries that fail to be converted result in a null value:

>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> to_dt = ToDatetime().fit(s)
>>> to_dt.transform(s)
0   2024-05-05 13:17:52
1                   NaT
2   2024-05-07 13:17:52
Name: when, dtype: datetime64[...]
>>> s = pd.Series(["05/05/2024", None, "07/05/2024"], name="when")
>>> to_dt.transform(s)
0   NaT
1   NaT
2   NaT
Name: when, dtype: datetime64[...]

Dealing with Time zones#

During fit, parsing strings that contain fixed offsets results in datetimes in UTC. Mixed offsets are supported and will all be converted to UTC.

>>> s = pd.Series(["2020-01-01T04:00:00+02:00", "2020-01-01T04:00:00+03:00"])
>>> to_dt.fit_transform(s)
0   2020-01-01 02:00:00+00:00
1   2020-01-01 01:00:00+00:00
dtype: datetime64[..., UTC]
>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S%z'
>>> to_dt.output_time_zone_
'UTC'

Strings with no timezone indication result in naive datetimes:

>>> s = pd.Series(["2020-01-01T04:00:00", "2020-01-01T04:00:00"])
>>> to_dt.fit_transform(s)
0   2020-01-01 04:00:00
1   2020-01-01 04:00:00
dtype: datetime64[...]
>>> to_dt.output_time_zone_ is None
True

During transform, outputs are cast to the same dtype that was found during fit. This includes the timezone, which is converted if necessary.

>>> s_paris = pd.to_datetime(
...     pd.Series(["2024-05-07T14:24:49", "2024-05-06T14:24:49"])
... ).dt.tz_localize("Europe/Paris")
>>> s_paris
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]
>>> to_dt = ToDatetime().fit(s_paris)
>>> to_dt.output_dtype_
datetime64[..., Europe/Paris]

Here our converter is set to output datetimes with nanosecond resolution, localized in “Europe/Paris”.

We may have a column in a different timezone:

>>> s_london = s_paris.dt.tz_convert("Europe/London")
>>> s_london
0   2024-05-07 13:24:49+01:00
1   2024-05-06 13:24:49+01:00
dtype: datetime64[..., Europe/London]

Here the timezone is “Europe/London” and the times are offset by 1 hour. During transform datetimes will be converted to the original dtype and the “Europe/Paris” timezone:

>>> to_dt.transform(s_london)
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]

Moreover, we may have to transform a timezone-naive column whereas the transformer was fitted on a timezone-aware column. Note that this is somewhat a corner case unlikely to happen in practice if the inputs to fit and transform come from the same dataframe.

In this case, we make the arbitrary choice to assume that the timezone-naive datetimes are in UTC.

>>> s_naive = s_paris.dt.tz_convert(None)
>>> to_dt.transform(s_naive)
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]

Conversely, a transformer fitted on a timezone-naive column can convert timezone-aware columns. Here also, we assume the naive datetimes were in UTC.

>>> to_dt = ToDatetime().fit(s_naive)
>>> to_dt.transform(s_london)
0   2024-05-07 12:24:49
1   2024-05-06 12:24:49
dtype: datetime64[...]

Caveats when dealing with month first/day first conventions#

When parsing strings in one of the formats above, ToDatetime tries to guess if the month comes first (USA convention) or the day (rest of the world) from the data.

>>> s = pd.Series(["05/23/2024"])
>>> to_dt.fit_transform(s)
0   2024-05-23
dtype: datetime64[...]
>>> to_dt.format_
'%m/%d/%Y'

Here we could infer '%m/%d/%Y' because there is no 23rd month in a year. Similarly,

>>> s = pd.Series(["23/05/2024"])
>>> to_dt.fit_transform(s)
0   2024-05-23
dtype: datetime64[...]
>>> to_dt.format_
'%d/%m/%Y'

In the case where it cannot be inferred, the USA convention is used:

>>> s = pd.Series(["03/05/2024"])
>>> to_dt.fit_transform(s)
0   2024-03-05
dtype: datetime64[...]
>>> to_dt.format_
'%m/%d/%Y'

Encoding and Feature Engineering with DatetimeEncoder#

Once datetime columns have been parsed, they can be encoded as numeric features with the DatetimeEncoder, by extracting temporal features (year, month, day, hour, etc.). No timezone conversion is done; the timezone in the feature is retained. The DatetimeEncoder rejects non-datetime columns, so it should only be applied after conversion using ToDatetime. If the input column is timezone aware, the extracted features will be in the column’s timezone; this is normally the case when the datetime column has been encoded with ToDatetime.

>>> import pandas as pd
>>> login = pd.to_datetime(
...     pd.Series(
...         ["2024-05-13T12:05:36", None, "2024-05-15T13:46:02"], name="login")
... )
>>> login
0   2024-05-13 12:05:36
1                   NaT
2   2024-05-15 13:46:02
Name: login, dtype: datetime64[...]
>>> from skrub import DatetimeEncoder
>>> DatetimeEncoder().fit_transform(login)
login_year  login_month  login_day  login_hour  login_total_seconds
0      2024.0          5.0       13.0        12.0         1.715602e+09
1         NaN          NaN        NaN         NaN                  NaN
2      2024.0          5.0       15.0        13.0         1.715781e+09

Additionally, the DatetimeEncoder can include the following features:

  • Number of seconds from epoch (add_total_seconds, True by default)

  • Day of the week (add_weekday)

  • Day of the year (add_day_of_year)

Periodic encoding is supported through trigonometric (circular) and spline encoding: set the periodic_encoding parameter to circular or spline.

Periodic encoding of datetime features

Example of periodic encoding of datetime features using circular and spline methods.#

Note that if periodic_encoding is set, the respective features are removed to reduce redundancy:

>>> encoder = DatetimeEncoder()
>>> encoder.fit_transform(login).columns
Index(['login_year', 'login_month', 'login_day', 'login_hour',
       'login_total_seconds'],
      dtype='object')
>>> from sklearn.pipeline import make_pipeline
>>> encoder = make_pipeline(ToDatetime(), DatetimeEncoder(periodic_encoding="circular"))
>>> encoder.fit_transform(login).columns
Index(['login_year', 'login_total_seconds', 'login_month_circular_0',
       'login_month_circular_1', 'login_day_circular_0',
       'login_day_circular_1', 'login_hour_circular_0',
       'login_hour_circular_1'],
      dtype='object')

The DatetimeEncoder uses hardcoded values for generating periodic features. The period of each feature is:

  • month: 12 (month in year)

  • day: 30 (day in month)

  • hour: 24 (hour in day)

  • weekday: 7 (day in week)

Additionally, we specify the number of splines for each feature to avoid generating too many features:

  • month: 12

  • day: 4

  • hour: 12

  • weekday: 7

All extracted features are provided as float32 columns.