ToDatetime#
Usage examples at the bottom of this page.
- class skrub.ToDatetime(format=None)[source]#
Parse datetimes represented as strings and return
Datetime
columns.Note
ToDatetime
is a type of single-column transformer. Unlike most scikit-learn estimators, itsfit
,transform
andfit_transform
methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.TableVectorizer
orsklearn.compose.ColumnTransformer
. In theColumnTransformer
, pass a single column:make_column_transformer((ToDatetime(), 'col_name_1'), (ToDatetime(), 'col_name_2'))
instead ofmake_column_transformer((ToDatetime(), ['col_name_1', 'col_name_2']))
.An input column is converted to a column with dtype Datetime if possible, and rejected by raising a
RejectColumn
exception otherwise. Only Date, Datetime, String, and pandas object columns are handled, other dtypes are rejected withRejectColumn
.Once a column is accepted, outputs of
transform
always have the same Datetime dtype (including resolution and time zone). Once the transformer is fitted, entries that fail to be converted during subsequent calls totransform
are replaced with nulls.- Parameters:
- format
str
orNone
, optional, default=None Format to use for parsing dates that are stored as strings, e.g.
"%Y-%m-%dT%H:%M%S"
. If not specfied, the format is inferred from the data when possible. When doing so, for dates presented as 01/02/2003, it is usually possible to infer from the data whether the month comes first (USA convention) or the day comes first, ie"%m/%d/%Y"
vs"%d/%m/%Y"
. In the odd chance that all the sampled dates land before the 13th day of the month and that both conventions are plausible, the USA convention (month first) is chosen.
- format
- Attributes:
- format_
str
orNone
Detected format. If the transformer was fitted on a column that already had a Datetime dtype, the
format_
is None. Otherwise it is the format that was detected when parsing the string column. If the parameterformat
was provided, it is the only one that the transformer attempts to use so in that casetformat_
is eitherNone
or equal toformat
.- output_dtype_data type
The output dtype, which includes information about the time resolution and time zone.
- output_time_zone_
str
orNone
The time zone of the transformed column. If the output is time zone naive it is
None
; otherwise it is the name of the time zone such asUTC
orEurope/Paris
.
- format_
Examples
>>> import pandas as pd
>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when") >>> s 0 2024-05-05T13:17:52 1 None 2 2024-05-07T13:17:52 Name: when, dtype: object
>>> from skrub._to_datetime import ToDatetime
>>> to_dt = ToDatetime() >>> to_dt.fit_transform(s) 0 2024-05-05 13:17:52 1 NaT 2 2024-05-07 13:17:52 Name: when, dtype: datetime64[ns]
The attributes
format_
,output_dtype_
,output_time_zone_
record information about the conversion result.>>> to_dt.format_ '%Y-%m-%dT%H:%M:%S' >>> to_dt.output_dtype_ dtype('<M8[ns]') >>> to_dt.output_time_zone_ is None True
If we provide the datetime format, it is used and columns that do not conform to it are rejected.
>>> ToDatetime(format="%Y-%m-%dT%H:%M:%S").fit_transform(s) 0 2024-05-05 13:17:52 1 NaT 2 2024-05-07 13:17:52 Name: when, dtype: datetime64[ns]
>>> ToDatetime(format="%d/%m/%Y").fit_transform(s) Traceback (most recent call last): ... skrub._on_each_column.RejectColumn: Failed to convert column 'when' to datetimes using the format '%d/%m/%Y'.
Columns that already have
Datetime
dtype
are not modified (but they are accepted); for those columns the provided format, if any, is ignored.>>> s = pd.to_datetime(s).dt.tz_localize("Europe/Paris") >>> s 0 2024-05-05 13:17:52+02:00 1 NaT 2 2024-05-07 13:17:52+02:00 Name: when, dtype: datetime64[ns, Europe/Paris] >>> to_dt.fit_transform(s) is s True
In that case the
format_
isNone
.>>> to_dt.format_ is None True >>> to_dt.output_dtype_ datetime64[ns, Europe/Paris] >>> to_dt.output_time_zone_ 'Europe/Paris'
Columns that have a different
dtype
than strings, pandas objects, or datetimes are rejected.>>> s = pd.Series([2020, 2021, 2022], name="year") >>> to_dt.fit_transform(s) Traceback (most recent call last): ... skrub._on_each_column.RejectColumn: Column 'year' does not contain strings.
String columns that do not appear to contain datetimes or for some other reason fail to be converted are also rejected.
>>> s = pd.Series(["2024-05-07T13:36:27", "yesterday"], name="when") >>> to_dt.fit_transform(s) Traceback (most recent call last): ... skrub._on_each_column.RejectColumn: Could not find a datetime format for column 'when'.
Once
ToDatetime
was successfully fitted,transform
will always try to parse datetimes with the same format and output the samedtype
. Entries that fail to be converted result in a null value:>>> s = pd.Series(["2024-05-05T13:17:52", None, "2024-05-07T13:17:52"], name="when") >>> to_dt = ToDatetime().fit(s) >>> to_dt.transform(s) 0 2024-05-05 13:17:52 1 NaT 2 2024-05-07 13:17:52 Name: when, dtype: datetime64[ns] >>> s = pd.Series(["05/05/2024", None, "07/05/2024"], name="when") >>> to_dt.transform(s) 0 NaT 1 NaT 2 NaT Name: when, dtype: datetime64[ns]
Time zones
During
fit
, parsing strings that contain fixed offsets results in datetimes in UTC. Mixed offsets are supported and will all be converted to UTC.>>> s = pd.Series(["2020-01-01T04:00:00+02:00", "2020-01-01T04:00:00+03:00"]) >>> to_dt.fit_transform(s) 0 2020-01-01 02:00:00+00:00 1 2020-01-01 01:00:00+00:00 dtype: datetime64[ns, UTC] >>> to_dt.format_ '%Y-%m-%dT%H:%M:%S%z' >>> to_dt.output_time_zone_ 'UTC'
Strings with no timezone indication result in naive datetimes:
>>> s = pd.Series(["2020-01-01T04:00:00", "2020-01-01T04:00:00"]) >>> to_dt.fit_transform(s) 0 2020-01-01 04:00:00 1 2020-01-01 04:00:00 dtype: datetime64[ns] >>> to_dt.output_time_zone_ is None True
During
transform
, outputs are cast to the samedtype
that was found duringfit
. This includes the timezone, which is converted if necessary.>>> s_paris = pd.to_datetime( ... pd.Series(["2024-05-07T14:24:49", "2024-05-06T14:24:49"]) ... ).dt.tz_localize("Europe/Paris") >>> s_paris 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[ns, Europe/Paris] >>> to_dt = ToDatetime().fit(s_paris) >>> to_dt.output_dtype_ datetime64[ns, Europe/Paris]
Here our converter is set to output datetimes with nanosecond resolution, localized in “Europe/Paris”.
We may have a column in a different timezone:
>>> s_london = s_paris.dt.tz_convert("Europe/London") >>> s_london 0 2024-05-07 13:24:49+01:00 1 2024-05-06 13:24:49+01:00 dtype: datetime64[ns, Europe/London]
Here the timezone is “Europe/London” and the times are offset by 1 hour. During
transform
datetimes will be converted to the original dtype and the “Europe/Paris” timezone:>>> to_dt.transform(s_london) 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[ns, Europe/Paris]
Moreover, we may have to transform a timezone-naive column whereas the transformer was fitted on a timezone-aware column. Note that is somewhat a corner case unlikely to happen in practice if the inputs to
fit
andtransform
come from the same dataframe.>>> s_naive = s_paris.dt.tz_convert(None) >>> s_naive 0 2024-05-07 12:24:49 1 2024-05-06 12:24:49 dtype: datetime64[ns]
In this case, we make the arbitrary choice to assume that the timezone-naive datetimes are in UTC.
>>> to_dt.transform(s_naive) 0 2024-05-07 14:24:49+02:00 1 2024-05-06 14:24:49+02:00 dtype: datetime64[ns, Europe/Paris]
Conversely, a transformer fitted on a timezone-naive column can convert timezone-aware columns. Here also, we assume the naive datetimes were in UTC.
>>> to_dt = ToDatetime().fit(s_naive) >>> to_dt.transform(s_london) 0 2024-05-07 12:24:49 1 2024-05-06 12:24:49 dtype: datetime64[ns]
``%d/%m/%Y`` vs ``%m/%d/%Y``
When parsing strings in one of the formats above,
ToDatetime
tries to guess if the month comes first (USA convention) or the day (rest of the world) from the data.>>> s = pd.Series(["05/23/2024"]) >>> to_dt.fit_transform(s) 0 2024-05-23 dtype: datetime64[ns] >>> to_dt.format_ '%m/%d/%Y'
Here we could infer
'%m/%d/%Y'
because there are not 23 months in a year. Similarly,>>> s = pd.Series(["23/05/2024"]) >>> to_dt.fit_transform(s) 0 2024-05-23 dtype: datetime64[ns] >>> to_dt.format_ '%d/%m/%Y'
In the case it cannot be inferred, the USA convention is used:
>>> s = pd.Series(["03/05/2024"]) >>> to_dt.fit_transform(s) 0 2024-03-05 dtype: datetime64[ns] >>> to_dt.format_ '%m/%d/%Y'
If the days are randomly distributed and the fitting data large enough, it is somewhat unlikely that all days would be below 12 so the inferred format should often be correct. To be sure, one can specify the
format
in the constructor.Methods
fit
(column[, y])Fit the transformer.
fit_transform
(column[, y])Fit the encoder and transform a column.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_fit_request
(*[, column])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, column])Request metadata passed to the
transform
method.transform
(column)Transform a column.
- fit(column, y=None)[source]#
Fit the transformer.
Subclasses should implement
fit_transform
andtransform
.- Parameters:
- columna pandas or polars
Series
Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_fit_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.