skrub.ToDatetime#

Usage examples at the bottom of this page.

class skrub.ToDatetime(format=None)[source]#

Parse datetimes represented as strings and return Datetime columns.

Note

ToDatetime is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((ToDatetime(), 'col_name_1'), (ToDatetime(), 'col_name_2')) instead of make_column_transformer((ToDatetime(), ['col_name_1', 'col_name_2'])).

An input column is converted to a column with dtype Datetime if possible, and rejected by raising a RejectColumn exception otherwise. Only Date, Datetime, String, and pandas object columns are handled, other dtypes are rejected with RejectColumn.

Once a column is accepted, outputs of transform always have the same Datetime dtype (including resolution and time zone). Once the transformer is fitted, entries that fail to be converted during subsequent calls to transform are replaced with nulls.

Parameters:
formatstr or None, optional, default=None

Format to use for parsing dates that are stored as strings, e.g. "%Y-%m-%dT%H:%M%S". If not specfied, the format is inferred from the data when possible. When doing so, for dates presented as 01/02/2003, it is usually possible to infer from the data whether the month comes first (USA convention) or the day comes first, ie "%m/%d/%Y" vs "%d/%m/%Y". In the odd chance that all the sampled dates land before the 13th day of the month and that both conventions are plausible, the USA convention (month first) is chosen.

Examples

>>> import pandas as pd
>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> s
0    2024-05-05T13:17:52
1                   None
2    2024-05-07T13:17:52
Name: when, dtype: object
>>> from skrub._to_datetime import ToDatetime
>>> to_dt = ToDatetime()
>>> to_dt.fit_transform(s)
0   2024-05-05 13:17:52
1                   NaT
2   2024-05-07 13:17:52
Name: when, dtype: datetime64[ns]

The attributes format_, output_dtype_, output_time_zone_ record information about the conversion result.

>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S'
>>> to_dt.output_dtype_
dtype('<M8[ns]')
>>> to_dt.output_time_zone_ is None
True

If we provide the datetime format, it is used and columns that do not conform to it are rejected.

>>> ToDatetime(format="%Y-%m-%dT%H:%M:%S").fit_transform(s)
0   2024-05-05 13:17:52
1                   NaT
2   2024-05-07 13:17:52
Name: when, dtype: datetime64[ns]
>>> ToDatetime(format="%d/%m/%Y").fit_transform(s)
Traceback (most recent call last):
    ...
skrub._on_each_column.RejectColumn: Failed to convert column 'when' to datetimes using the format '%d/%m/%Y'.

Columns that already have Datetime dtype are not modified (but they are accepted); for those columns the provided format, if any, is ignored.

>>> s = pd.to_datetime(s).dt.tz_localize("Europe/Paris")
>>> s
0   2024-05-05 13:17:52+02:00
1                         NaT
2   2024-05-07 13:17:52+02:00
Name: when, dtype: datetime64[ns, Europe/Paris]
>>> to_dt.fit_transform(s) is s
True

In that case the format_ is None.

>>> to_dt.format_ is None
True
>>> to_dt.output_dtype_
datetime64[ns, Europe/Paris]
>>> to_dt.output_time_zone_
'Europe/Paris'

Columns that have a different dtype than strings, pandas objects, or datetimes are rejected.

>>> s = pd.Series([2020, 2021, 2022], name="year")
>>> to_dt.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._on_each_column.RejectColumn: Column 'year' does not contain strings.

String columns that do not appear to contain datetimes or for some other reason fail to be converted are also rejected.

>>> s = pd.Series(["2024-05-07T13:36:27", "yesterday"], name="when")
>>> to_dt.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._on_each_column.RejectColumn: Could not find a datetime format for column 'when'.

Once ToDatetime was successfully fitted, transform will always try to parse datetimes with the same format and output the same dtype. Entries that fail to be converted result in a null value:

>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when")
>>> to_dt = ToDatetime().fit(s)
>>> to_dt.transform(s)
0   2024-05-05 13:17:52
1                   NaT
2   2024-05-07 13:17:52
Name: when, dtype: datetime64[ns]
>>> s = pd.Series(["05/05/2024", None, "07/05/2024"], name="when")
>>> to_dt.transform(s)
0   NaT
1   NaT
2   NaT
Name: when, dtype: datetime64[ns]

Time zones

During fit, parsing strings that contain fixed offsets results in datetimes in UTC. Mixed offsets are supported and will all be converted to UTC.

>>> s = pd.Series(["2020-01-01T04:00:00+02:00", "2020-01-01T04:00:00+03:00"])
>>> to_dt.fit_transform(s)
0   2020-01-01 02:00:00+00:00
1   2020-01-01 01:00:00+00:00
dtype: datetime64[ns, UTC]
>>> to_dt.format_
'%Y-%m-%dT%H:%M:%S%z'
>>> to_dt.output_time_zone_
'UTC'

Strings with no timezone indication result in naive datetimes:

>>> s = pd.Series(["2020-01-01T04:00:00", "2020-01-01T04:00:00"])
>>> to_dt.fit_transform(s)
0   2020-01-01 04:00:00
1   2020-01-01 04:00:00
dtype: datetime64[ns]
>>> to_dt.output_time_zone_ is None
True

During transform, outputs are cast to the same dtype that was found during fit. This includes the timezone, which is converted if necessary.

>>> s_paris = pd.to_datetime(
...     pd.Series(["2024-05-07T14:24:49", "2024-05-06T14:24:49"])
... ).dt.tz_localize("Europe/Paris")
>>> s_paris
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[ns, Europe/Paris]
>>> to_dt = ToDatetime().fit(s_paris)
>>> to_dt.output_dtype_
datetime64[ns, Europe/Paris]

Here our converter is set to output datetimes with nanosecond resolution, localized in “Europe/Paris”.

We may have a column in a different timezone:

>>> s_london = s_paris.dt.tz_convert("Europe/London")
>>> s_london
0   2024-05-07 13:24:49+01:00
1   2024-05-06 13:24:49+01:00
dtype: datetime64[ns, Europe/London]

Here the timezone is “Europe/London” and the times are offset by 1 hour. During transform datetimes will be converted to the original dtype and the “Europe/Paris” timezone:

>>> to_dt.transform(s_london)
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[ns, Europe/Paris]

Moreover, we may have to transform a timezone-naive column whereas the transformer was fitted on a timezone-aware column. Note that is somewhat a corner case unlikely to happen in practice if the inputs to fit and transform come from the same dataframe.

>>> s_naive = s_paris.dt.tz_convert(None)
>>> s_naive
0   2024-05-07 12:24:49
1   2024-05-06 12:24:49
dtype: datetime64[ns]

In this case, we make the arbitrary choice to assume that the timezone-naive datetimes are in UTC.

>>> to_dt.transform(s_naive)
0   2024-05-07 14:24:49+02:00
1   2024-05-06 14:24:49+02:00
dtype: datetime64[ns, Europe/Paris]

Conversely, a transformer fitted on a timezone-naive column can convert timezone-aware columns. Here also, we assume the naive datetimes were in UTC.

>>> to_dt = ToDatetime().fit(s_naive)
>>> to_dt.transform(s_london)
0   2024-05-07 12:24:49
1   2024-05-06 12:24:49
dtype: datetime64[ns]

``%d/%m/%Y`` vs ``%m/%d/%Y``

When parsing strings in one of the formats above, ToDatetime tries to guess if the month comes first (USA convention) or the day (rest of the world) from the data.

>>> s = pd.Series(["05/23/2024"])
>>> to_dt.fit_transform(s)
0   2024-05-23
dtype: datetime64[ns]
>>> to_dt.format_
'%m/%d/%Y'

Here we could infer '%m/%d/%Y' because there are not 23 months in a year. Similarly,

>>> s = pd.Series(["23/05/2024"])
>>> to_dt.fit_transform(s)
0   2024-05-23
dtype: datetime64[ns]
>>> to_dt.format_
'%d/%m/%Y'

In the case it cannot be inferred, the USA convention is used:

>>> s = pd.Series(["03/05/2024"])
>>> to_dt.fit_transform(s)
0   2024-03-05
dtype: datetime64[ns]
>>> to_dt.format_
'%m/%d/%Y'

If the days are randomly distributed and the fitting data large enough, it is somewhat unlikely that all days would be below 12 so the inferred format should often be correct. To be sure, one can specify the format in the constructor.

Attributes:
format_str or None

Detected format. If the transformer was fitted on a column that already had a Datetime dtype, the format_ is None. Otherwise it is the format that was detected when parsing the string column. If the parameter format was provided, it is the only one that the transformer attempts to use so in that caset format_ is either None or equal to format.

output_dtype_data type

The output dtype, which includes information about the time resolution and time zone.

output_time_zone_str or None

The time zone of the transformed column. If the output is time zone naive it is None; otherwise it is the name of the time zone such as UTC or Europe/Paris.

Methods

fit(column[, y])

Fit the transformer.

fit_transform(column[, y])

Fit the encoder and transform a column.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, column])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, column])

Request metadata passed to the transform method.

transform(column)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

yNone

Ignored.

Returns:
transformedpandas or polars Series.

The input transformed to Datetime.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in transform.

Returns:
selfobject

The updated object.

transform(column)[source]#

Transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

Returns:
transformedpandas or polars Series.

The input transformed to Datetime.