DatetimeEncoder#

class skrub.DatetimeEncoder(resolution='hour', add_weekday=False, add_total_seconds=True, add_day_of_year=False, periodic_encoding=None)[source]#

Extract temporal features such as month, day of the week, … from a datetime column.

Note

DatetimeEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((DatetimeEncoder(), 'col_name_1'), (DatetimeEncoder(), 'col_name_2')) instead of make_column_transformer((DatetimeEncoder(), ['col_name_1', 'col_name_2'])).

The DatetimeEncoder converts datetime features to numerical features that can be used by learners. It separates each datetime in its parts (year, month, day, etc.), and adds new features based on the datetime (weekday, seconds from epoch, day of year). Circular or spline-based periodic features may also be included.

Parameters:

resolutionstr or None, default=”hour”: If a string, extract up to this resolution. Must be “year”, “month”, “day”, “hour”, “minute”, “second”, “microsecond”, or “nanosecond”. For example, resolution="day" generates the features “year”, “month”, and “day” only. If the input column contains dates with no time information, time features (“hour”, “minute”, … ) are never extracted. If None, the features listed above are not extracted (but day of the week and total seconds may still be extracted, see below).
add_weekdaybool, default=False: Extract the day of the week as a numerical feature from 1 (Monday) to 7 (Sunday).
add_total_secondsbool, default=True: Add the total number of seconds since the Unix epoch (00:00:00 UTC on 1 January 1970).
add_day_of_yearbool, default=False: Add the day of year (ordinal day) as an integer in the range 1 to 365 (or 366 in the case of leap years). January 1st will be day 1, December 31st will be day 365 on non-leap years.
periodic_encoding‘circular’, ‘spline’, or None, default=None: Add periodic features with different granularities. Add periodic features using either trigonometric (circular) or spline encoding.

Attributes:

extracted_features_list of strings: The features that are extracted, a subset of [“year”, …, “nanosecond”, “weekday”, “total_seconds”]. If periodic_encoding is set to either circular or spline, the extracted periodic features will also be added. Given a feature named ``date, new features will be named date_year_circular_0, date_year_circular_1 etc.

See also

ToDatetime: Convert strings to datetimes.

Notes

All extracted features are provided as float32 columns.

No timezone conversion is performed: if the input column is timezone aware, the extracted features will be in the column’s timezone.

An input column that does not have a Date or Datetime dtype will be rejected by raising a RejectColumn exception. See ToDatetime for converting strings to proper datetimes. Note: the TableVectorizer only sends datetime columns to its datetime_encoder. Therefore it is always safe to use a DatetimeEncoder as the TableVectorizer’s datetime_encoder parameter.

The DatetimeEncoder uses hardcoded values for generating periodic features. The period of each feature is:

month: 12 (month in year)
day: 30 (day in month)
hour: 24 (hour in day)
weekday: 7 (day in week)

Additionally, we specify the number of splines for each feature to avoid generating too many features:

month: 12
day: 4
hour: 12
weekday: 7

Examples

>>> import pandas as pd

>>> login = pd.to_datetime(
...     pd.Series(
...         ["2024-05-13T12:05:36", None, "2024-05-15T13:46:02"], name="login")
... )
>>> login
0   2024-05-13 12:05:36
1                   NaT
2   2024-05-15 13:46:02
Name: login, dtype: datetime64[...]
>>> from skrub import DatetimeEncoder

>>> DatetimeEncoder().fit_transform(login)
   login_year  login_month  login_day  login_hour  login_total_seconds
0      2024.0          5.0       13.0        12.0         1.715602e+09
1         NaN          NaN        NaN         NaN                  NaN
2      2024.0          5.0       15.0        13.0         1.715781e+09

We can ask for a finer resolution:

>>> DatetimeEncoder(resolution='second', add_total_seconds=False).fit_transform(
...     login
... )
   login_year  login_month  login_day  login_hour  login_minute  login_second
0      2024.0          5.0       13.0        12.0           5.0          36.0
1         NaN          NaN        NaN         NaN           NaN           NaN
2      2024.0          5.0       15.0        13.0          46.0           2.0

We can also ask for the day of the week. The week starts at 1 on Monday and ends at 7 on Sunday. This is consistent with the ISO week date system (https://en.wikipedia.org/wiki/ISO_week_date), the standard library datetime.isoweekday() and polars weekday, but not with pandas day_of_week, which counts days from 0.

>>> login.dt.strftime('%A = %w')
0       Monday = 1
1              NaN
2    Wednesday = 3
Name: login, dtype: object
>>> login.dt.day_of_week
0    0.0
1    NaN
2    2.0
Name: login, dtype: float64
>>> DatetimeEncoder(add_weekday=True, add_total_seconds=False).fit_transform(login)
   login_year  login_month  login_day  login_hour  login_weekday
0      2024.0          5.0       13.0        12.0            1.0
1         NaN          NaN        NaN         NaN            NaN
2      2024.0          5.0       15.0        13.0            3.0

When a column contains only dates without time information, the time features are discarded, regardless of resolution.

>>> birthday = pd.to_datetime(
...     pd.Series(['2024-04-14', '2024-05-15'], name='birthday')
... )
>>> encoder = DatetimeEncoder(resolution='second')
>>> encoder.fit_transform(birthday)
   birthday_year  birthday_month  birthday_day  birthday_total_seconds
0         2024.0             4.0          14.0            1.713053e+09
1         2024.0             5.0          15.0            1.715731e+09
>>> encoder.extracted_features_
['year', 'month', 'day', 'total_seconds']
>>> encoder.all_outputs_
['birthday_year', 'birthday_month', 'birthday_day', 'birthday_total_seconds']

(The number of seconds since Epoch can still be extracted but not “hour”, “minute”, etc.)

Non-datetime columns are rejected by raising a RejectColumn exception.

>>> s = pd.Series(['2024-04-14', '2024-05-15'], name='birthday')
>>> s
0    2024-04-14
1    2024-05-15
Name: birthday, dtype: object
>>> DatetimeEncoder().fit_transform(s)
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Column 'birthday' does not have Date or Datetime dtype.

ToDatetime: can be used for converting strings to datetimes.

>>> from skrub import ToDatetime
>>> from sklearn.pipeline import make_pipeline
>>> make_pipeline(ToDatetime(), DatetimeEncoder()).fit_transform(s)
   birthday_year  birthday_month  birthday_day  birthday_total_seconds
0         2024.0             4.0          14.0            1.713053e+09
1         2024.0             5.0          15.0            1.715731e+09

Time zones

If the input column has a time zone, the extracted features are in this timezone.

>>> login = pd.to_datetime(
...     pd.Series(
...         ["2024-05-13T12:05:36", None, "2024-05-15T13:46:02"], name="login")
... ).dt.tz_localize('Europe/Paris')
>>> encoder = DatetimeEncoder()
>>> encoder.fit_transform(login)['login_hour']
0    12.0
1     NaN
2    13.0
Name: login_hour, dtype: float32

No special care is taken to convert inputs to transform to the same time zone as the column the encoder was fitted on. The features are always in the time zone of the input.

>>> login_sp = login.dt.tz_convert('America/Sao_Paulo')
>>> login_sp
0   2024-05-13 07:05:36-03:00
1                         NaT
2   2024-05-15 08:46:02-03:00
Name: login, dtype: datetime64[..., America/Sao_Paulo]
>>> encoder.transform(login_sp)['login_hour']
0    7.0
1    NaN
2    8.0
Name: login_hour, dtype: float32

To ensure datetime columns are in a consistent timezones, use ToDatetime.

>>> encoder = make_pipeline(ToDatetime(), DatetimeEncoder())
>>> encoder.fit_transform(login)['login_hour']
0    12.0
1     NaN
2    13.0
Name: login_hour, dtype: float32
>>> encoder.transform(login_sp)['login_hour']
0    12.0
1     NaN
2    13.0
Name: login_hour, dtype: float32

Here we can see the input to transform has been converted back to the timezone used during fit and that we get the same result for “hour”.

The DatetimeEncoder can also create new features based on either trigonometric functions or splines by setting periodic_encoder="circular" or periodic_encoder="spline" respectively. (https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html).

>>> encoder = make_pipeline(ToDatetime(), DatetimeEncoder(periodic_encoding="circular"))
>>> encoder.fit_transform(login)
   login_year  ...  login_hour_circular_1
0      2024.0  ...              -1.000000
1         NaN  ...                    NaN
2      2024.0  ...              -0.965926

Added features can be explored using DatetimeEncoder.all_outputs_:

>>> encoder[-1].all_outputs_
['login_year', 'login_total_seconds', 'login_month_circular_0', 'login_month_circular_1',
    'login_day_circular_0', 'login_day_circular_1', 'login_hour_circular_0', 'login_hour_circular_1']

Methods

`fit`(column[, y])	Fit the transformer.
`fit_transform`(column[, y])	Fit the encoder and transform a column.
`get_feature_names_out`([input_features])	Return a list of features generated by the transformer.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`set_transform_request`(*[, column])	Request metadata passed to the `transform` method.
`transform`(column)	Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:

columna pandas or polars Series: Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
ycolumn or dataframe: Prediction targets.

Returns:

self: The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:

columnpandas or polars Series with dtype Date or Datetime: The input to transform.
yNone: Ignored.

Returns:

transformedDataFrame: The extracted features.

get_feature_names_out(input_features=None)[source]#

Return a list of features generated by the transformer.

Each feature has format {input_name}_{n_component} where input_name is the name of the input column, or a default name for the encoder, and n_component is the idx of the specific feature.

Parameters:

input_featuresNone: The input features. Ignored, only here for compatibility.

Returns:

list of str: The list of feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for column parameter in transform.

Returns:

selfobject: The updated object.

transform(column)[source]#

Transform a column.

Parameters:

columnpandas or polars Series with dtype Date or Datetime: The input to transform.

Returns:

transformedDataFrame: The extracted features.

Gallery examples#

Encoding: from a dataframe to a numerical matrix for machine learning

Handling datetime features with the DatetimeEncoder

SquashingScaler: Robust numerical preprocessing for neural networks

DatetimeEncoder#

Gallery examples#

This Page