Handling datetimes: parsing from strings and encoding as numbers#
Depending on the input data, timestamps and dates can cause issues, or require
specific parsing. For example, reading input data stored in csv
format results
in datetime columns that are treated as strings.
In such cases, parsing columns that contain timestamps or dates so that they are treated as datetime objects allows to make use of advanced functionalities available in the standard Python library, Pandas and Polars.
Skrub provides objects that help with parsing such data (ToDatetime
), as well
as the DatetimeEncoder
, a datetime-specific encoder that feature engineers
datetime columns.
Parsing Datetime Strings with ToDatetime
#
Skrub provides helpers to parse datetime string columns automatically:
The
ToDatetime
transformer learns a mapping between columns and their formats. It then applies this mapping during the transform step.The
to_datetime()
function applies theToDatetime
transformer to all columns in the dataframe, and tries to parse them as datetimes. The format can be inferred or user-specified with theformat
argument.
>>> import pandas as pd
>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> s
0 2024-05-05T13:17:52
1 None
2 2024-05-07T13:17:52
Name: when, dtype: object
>>> from skrub import ToDatetime
>>> to_dt = ToDatetime()
>>> to_dt.fit_transform(s)
0 2024-05-05 13:17:52
1 NaT
2 2024-05-07 13:17:52
Name: when, dtype: datetime64[...]
The attributes format_
, output_dtype_
, output_time_zone_
record information about the conversion result.
>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S'
>>> to_dt.output_dtype_
dtype('<M8[...]')
>>> to_dt.output_time_zone_ is None
True
Once ToDatetime
was successfully fitted, transform
will always try to
parse datetimes with the same format and output the same dtype
. Entries that
fail to be converted result in a null value:
>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> to_dt = ToDatetime().fit(s)
>>> to_dt.transform(s)
0 2024-05-05 13:17:52
1 NaT
2 2024-05-07 13:17:52
Name: when, dtype: datetime64[...]
>>> s = pd.Series(["05/05/2024", None, "07/05/2024"], name="when")
>>> to_dt.transform(s)
0 NaT
1 NaT
2 NaT
Name: when, dtype: datetime64[...]
Dealing with Time zones#
During fit
, parsing strings that contain fixed offsets results in datetimes
in UTC. Mixed offsets are supported and will all be converted to UTC.
>>> s = pd.Series(["2020-01-01T04:00:00+02:00", "2020-01-01T04:00:00+03:00"])
>>> to_dt.fit_transform(s)
0 2020-01-01 02:00:00+00:00
1 2020-01-01 01:00:00+00:00
dtype: datetime64[..., UTC]
>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S%z'
>>> to_dt.output_time_zone_
'UTC'
Strings with no timezone indication result in naive datetimes:
>>> s = pd.Series(["2020-01-01T04:00:00", "2020-01-01T04:00:00"])
>>> to_dt.fit_transform(s)
0 2020-01-01 04:00:00
1 2020-01-01 04:00:00
dtype: datetime64[...]
>>> to_dt.output_time_zone_ is None
True
During transform
, outputs are cast to the same dtype
that was found
during fit
. This includes the timezone, which is converted if necessary.
>>> s_paris = pd.to_datetime(
... pd.Series(["2024-05-07T14:24:49", "2024-05-06T14:24:49"])
... ).dt.tz_localize("Europe/Paris")
>>> s_paris
0 2024-05-07 14:24:49+02:00
1 2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]
>>> to_dt = ToDatetime().fit(s_paris)
>>> to_dt.output_dtype_
datetime64[..., Europe/Paris]
Here our converter is set to output datetimes with nanosecond resolution, localized in “Europe/Paris”.
We may have a column in a different timezone:
>>> s_london = s_paris.dt.tz_convert("Europe/London")
>>> s_london
0 2024-05-07 13:24:49+01:00
1 2024-05-06 13:24:49+01:00
dtype: datetime64[..., Europe/London]
Here the timezone is “Europe/London” and the times are offset by 1 hour. During
transform
datetimes will be converted to the original dtype and the
“Europe/Paris” timezone:
>>> to_dt.transform(s_london)
0 2024-05-07 14:24:49+02:00
1 2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]
Moreover, we may have to transform a timezone-naive column whereas the
transformer was fitted on a timezone-aware column. Note that this is somewhat a
corner case unlikely to happen in practice if the inputs to fit
and
transform
come from the same dataframe.
In this case, we make the arbitrary choice to assume that the timezone-naive datetimes are in UTC.
>>> s_naive = s_paris.dt.tz_convert(None)
>>> to_dt.transform(s_naive)
0 2024-05-07 14:24:49+02:00
1 2024-05-06 14:24:49+02:00
dtype: datetime64[..., Europe/Paris]
Conversely, a transformer fitted on a timezone-naive column can convert timezone-aware columns. Here also, we assume the naive datetimes were in UTC.
>>> to_dt = ToDatetime().fit(s_naive)
>>> to_dt.transform(s_london)
0 2024-05-07 12:24:49
1 2024-05-06 12:24:49
dtype: datetime64[...]
Caveats when dealing with month first/day first conventions#
When parsing strings in one of the formats above, ToDatetime
tries to guess
if the month comes first (USA convention) or the day (rest of the world) from
the data.
>>> s = pd.Series(["05/23/2024"])
>>> to_dt.fit_transform(s)
0 2024-05-23
dtype: datetime64[...]
>>> to_dt.format_
'%m/%d/%Y'
Here we could infer '%m/%d/%Y'
because there is no 23rd month in a year.
Similarly,
>>> s = pd.Series(["23/05/2024"])
>>> to_dt.fit_transform(s)
0 2024-05-23
dtype: datetime64[...]
>>> to_dt.format_
'%d/%m/%Y'
In the case where it cannot be inferred, the USA convention is used:
>>> s = pd.Series(["03/05/2024"])
>>> to_dt.fit_transform(s)
0 2024-03-05
dtype: datetime64[...]
>>> to_dt.format_
'%m/%d/%Y'
Encoding and Feature Engineering with DatetimeEncoder
#
Once datetime columns have been parsed, they can be encoded as numeric features with
the DatetimeEncoder
, by extracting temporal features (year, month, day,
hour, etc.). No timezone conversion is done; the timezone
in the feature is retained. The DatetimeEncoder
rejects non-datetime columns,
so it should only be applied after conversion using ToDatetime
.
If the input column is timezone aware, the extracted features will be in the column’s
timezone; this is normally the case when the datetime column has been encoded with ToDatetime
.
>>> import pandas as pd
>>> login = pd.to_datetime(
... pd.Series(
... ["2024-05-13T12:05:36", None, "2024-05-15T13:46:02"], name="login")
... )
>>> login
0 2024-05-13 12:05:36
1 NaT
2 2024-05-15 13:46:02
Name: login, dtype: datetime64[...]
>>> from skrub import DatetimeEncoder
>>> DatetimeEncoder().fit_transform(login)
login_year login_month login_day login_hour login_total_seconds
0 2024.0 5.0 13.0 12.0 1.715602e+09
1 NaN NaN NaN NaN NaN
2 2024.0 5.0 15.0 13.0 1.715781e+09
Additionally, the DatetimeEncoder
can include the following features:
Number of seconds from epoch (
add_total_seconds
,True
by default)Day of the week (
add_weekday
)Day of the year (
add_day_of_year
)
Periodic encoding is supported through trigonometric (circular) and spline
encoding: set the periodic_encoding
parameter to circular
or spline
.

Example of periodic encoding of datetime features using circular and spline methods.#
Note that if periodic_encoding
is set, the respective features are removed
to reduce redundancy:
>>> encoder = DatetimeEncoder()
>>> encoder.fit_transform(login).columns
Index(['login_year', 'login_month', 'login_day', 'login_hour',
'login_total_seconds'],
dtype='object')
>>> from sklearn.pipeline import make_pipeline
>>> encoder = make_pipeline(ToDatetime(), DatetimeEncoder(periodic_encoding="circular"))
>>> encoder.fit_transform(login).columns
Index(['login_year', 'login_total_seconds', 'login_month_circular_0',
'login_month_circular_1', 'login_day_circular_0',
'login_day_circular_1', 'login_hour_circular_0',
'login_hour_circular_1'],
dtype='object')
The DatetimeEncoder
uses hardcoded values for generating periodic features.
The period of each feature is:
month
: 12 (month in year)day
: 30 (day in month)hour
: 24 (hour in day)weekday
: 7 (day in week)
Additionally, we specify the number of splines for each feature to avoid generating too many features:
month
: 12day
: 4hour
: 12weekday
: 7
All extracted features are provided as float32
columns.