.. |ToDatetime| replace:: :class:`~skrub.ToDatetime` .. |to_datetime| replace:: :func:`~skrub.to_datetime` .. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder` .. _user_guide_feature_engineering_datetimes: Handling datetimes: parsing from strings and encoding as numbers ================================================================ Depending on the input data, timestamps and dates can cause issues, or require specific parsing. For example, reading input data stored in ``csv`` format results in datetime columns that are treated as strings. In such cases, parsing columns that contain timestamps or dates so that they are treated as datetime objects allows to make use of advanced functionalities available in the standard Python library, Pandas and Polars. Skrub provides objects that help with parsing such data (|ToDatetime|), as well as the |DatetimeEncoder|, a datetime-specific encoder that feature engineers datetime columns. Parsing Datetime Strings with |ToDatetime| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Skrub provides helpers to parse datetime string columns automatically: - The |ToDatetime| transformer learns a mapping between columns and their formats. It then applies this mapping during the transform step. - The |to_datetime| function applies the |ToDatetime| transformer to all columns in the dataframe, and tries to parse them as datetimes. The format can be inferred or user-specified with the ``format`` argument. >>> import pandas as pd >>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when") >>> s 0 2024-05-05T13:17:52 1 None 2 2024-05-07T13:17:52 Name: when, dtype: object >>> from skrub import ToDatetime >>> to_dt = ToDatetime() >>> to_dt.fit_transform(s) 0 2024-05-05 13:17:52 1 NaT 2 2024-05-07 13:17:52 Name: when, dtype: datetime64[...] The attributes ``format_``, ``output_dtype_``, ``output_time_zone_`` record information about the conversion result. >>> to_dt.format_ '%Y-%m-%dT%H:%M:%S' >>> to_dt.output_dtype_ dtype('>> to_dt.output_time_zone_ is None True Once |ToDatetime| was successfully fitted, ``transform`` will always try to parse datetimes with the same format and output the same ``dtype``. Entries that fail to be converted result in a null value: >>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when") >>> to_dt = ToDatetime().fit(s) >>> to_dt.transform(s) 0 2024-05-05 13:17:52 1 NaT 2 2024-05-07 13:17:52 Name: when, dtype: datetime64[...] >>> s = pd.Series(["05/05/2024", None, "07/05/2024"], name="when") >>> to_dt.transform(s) 0 NaT 1 NaT 2 NaT Name: when, dtype: datetime64[...] Dealing with Time zones ^^^^^^^^^^^^^^^^^^^^^^^ During ``fit``, parsing strings that contain fixed offsets results in datetimes in UTC. Mixed offsets are supported and will all be converted to UTC. >>> s = pd.Series(["2020-01-01T04:00:00+02:00", "2020-01-01T04:00:00+03:00"]) >>> to_dt.fit_transform(s) 0 2020-01-01 02:00:00+00:00 1 2020-01-01 01:00:00+00:00 dtype: datetime64[..., UTC] >>> to_dt.format_ '%Y-%m-%dT%H:%M:%S%z' >>> to_dt.output_time_zone_ 'UTC' Strings with no timezone indication result in naive datetimes: >>> s = pd.Series(["2020-01-01T04:00:00", "2020-01-01T04:00:00"]) >>> to_dt.fit_transform(s) 0 2020-01-01 04:00:00 1 2020-01-01 04:00:00 dtype: datetime64[...] >>> to_dt.output_time_zone_ is None True During ``transform``, outputs are cast to the same ``dtype`` that was found during ``fit``. This includes the timezone, which is converted if necessary. >>> s_paris = pd.to_datetime( ... pd.Series(["2024-05-07T14:24:49", "2024-05-06T14:24:49"]) ... ).dt.tz_localize("Europe/Paris") >>> s_paris 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[..., Europe/Paris] >>> to_dt = ToDatetime().fit(s_paris) >>> to_dt.output_dtype_ datetime64[..., Europe/Paris] Here our converter is set to output datetimes with nanosecond resolution, localized in "Europe/Paris". We may have a column in a different timezone: >>> s_london = s_paris.dt.tz_convert("Europe/London") >>> s_london 0 2024-05-07 13:24:49+01:00 1 2024-05-06 13:24:49+01:00 dtype: datetime64[..., Europe/London] Here the timezone is "Europe/London" and the times are offset by 1 hour. During ``transform`` datetimes will be converted to the original dtype and the "Europe/Paris" timezone: >>> to_dt.transform(s_london) 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[..., Europe/Paris] Moreover, we may have to transform a timezone-naive column whereas the transformer was fitted on a timezone-aware column. Note that this is somewhat a corner case unlikely to happen in practice if the inputs to ``fit`` and ``transform`` come from the same dataframe. In this case, we make the arbitrary choice to assume that the timezone-naive datetimes are in UTC. >>> s_naive = s_paris.dt.tz_convert(None) >>> to_dt.transform(s_naive) 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[..., Europe/Paris] Conversely, a transformer fitted on a timezone-naive column can convert timezone-aware columns. Here also, we assume the naive datetimes were in UTC. >>> to_dt = ToDatetime().fit(s_naive) >>> to_dt.transform(s_london) 0 2024-05-07 12:24:49 1 2024-05-06 12:24:49 dtype: datetime64[...] Caveats when dealing with month first/day first conventions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When parsing strings in one of the formats above, |ToDatetime| tries to guess if the month comes first (USA convention) or the day (rest of the world) from the data. >>> s = pd.Series(["05/23/2024"]) >>> to_dt.fit_transform(s) 0 2024-05-23 dtype: datetime64[...] >>> to_dt.format_ '%m/%d/%Y' Here we could infer ``'%m/%d/%Y'`` because there is no 23rd month in a year. Similarly, >>> s = pd.Series(["23/05/2024"]) >>> to_dt.fit_transform(s) 0 2024-05-23 dtype: datetime64[...] >>> to_dt.format_ '%d/%m/%Y' In the case where it cannot be inferred, the USA convention is used: >>> s = pd.Series(["03/05/2024"]) >>> to_dt.fit_transform(s) 0 2024-03-05 dtype: datetime64[...] >>> to_dt.format_ '%m/%d/%Y' .. _user_guide_datetime_encoder: Encoding and Feature Engineering with |DatetimeEncoder| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once datetime columns have been parsed, they can be encoded as numeric features with the |DatetimeEncoder|, by extracting temporal features (year, month, day, hour, etc.). No timezone conversion is done; the timezone in the feature is retained. The |DatetimeEncoder| rejects non-datetime columns, so it should only be applied after conversion using |ToDatetime|. If the input column is timezone aware, the extracted features will be in the column's timezone; this is normally the case when the datetime column has been encoded with |ToDatetime|. >>> import pandas as pd >>> login = pd.to_datetime( ... pd.Series( ... ["2024-05-13T12:05:36", None, "2024-05-15T13:46:02"], name="login") ... ) >>> login 0 2024-05-13 12:05:36 1 NaT 2 2024-05-15 13:46:02 Name: login, dtype: datetime64[...] >>> from skrub import DatetimeEncoder >>> DatetimeEncoder().fit_transform(login) login_year login_month login_day login_hour login_total_seconds 0 2024.0 5.0 13.0 12.0 1.715602e+09 1 NaN NaN NaN NaN NaN 2 2024.0 5.0 15.0 13.0 1.715781e+09 Additionally, the |DatetimeEncoder| can include the following features: - Number of seconds from epoch (``add_total_seconds``, ``True`` by default) - Day of the week (``add_weekday``) - Day of the year (``add_day_of_year``) Periodic encoding is supported through trigonometric (circular) and spline encoding: set the ``periodic_encoding`` parameter to ``circular`` or ``spline``. .. figure:: /_static/periodic_features.png :alt: Periodic encoding of datetime features :align: center :width: 70% Example of periodic encoding of datetime features using circular and spline methods. Note that if ``periodic_encoding`` is set, the respective features are removed to reduce redundancy: >>> encoder = DatetimeEncoder() >>> encoder.fit_transform(login).columns Index(['login_year', 'login_month', 'login_day', 'login_hour', 'login_total_seconds'], dtype='object') >>> from sklearn.pipeline import make_pipeline >>> encoder = make_pipeline(ToDatetime(), DatetimeEncoder(periodic_encoding="circular")) >>> encoder.fit_transform(login).columns Index(['login_year', 'login_total_seconds', 'login_month_circular_0', 'login_month_circular_1', 'login_day_circular_0', 'login_day_circular_1', 'login_hour_circular_0', 'login_hour_circular_1'], dtype='object') The |DatetimeEncoder| uses hardcoded values for generating periodic features. The period of each feature is: - ``month``: 12 (month in year) - ``day``: 30 (day in month) - ``hour``: 24 (hour in day) - ``weekday``: 7 (day in week) Additionally, we specify the number of splines for each feature to avoid generating too many features: - ``month``: 12 - ``day``: 4 - ``hour``: 12 - ``weekday``: 7 All extracted features are provided as ``float32`` columns.