ToFloat#

class skrub.ToFloat[source]#

Convert a column to 32-bit floating-point numbers.

Note

ToFloat is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.ApplyToCols or a skrub.TableVectorizer. To apply to all columns:: ApplyToCol(ToFloat()) To apply to selected columns:: ApplyToCols(ToFloat(), cols=[‘col_name_1’, ‘col_name_2’])

No conversion is attempted if the column has a datetime or categorical dtype; a RejectColumn exception is raised.

Otherwise, we attempt to convert the column to float32. If the conversion fails the column is rejected (a RejectColumn exception is raised).

For pandas, the output is always np.float32, not the extension dtype pd.Float64Dtype. We do this conversion because most scikit-learn estimators cannot handle those dtypes correctly yet, especially in the presence of missing values (represented by pd.NA in such columns).

During transform, entries for which conversion fails are replaced by null values.

Examples

>>> import pandas as pd
>>> from skrub._to_float import ToFloat

A column that does not contain floats is converted if possible:

>>> s = pd.Series(['1.1', None, '3.3'], name='x')
>>> s
0     1.1
1    ...
2     3.3
Name: x, dtype: ...
>>> s[0]
'1.1'
>>> to_float = ToFloat()
>>> float_s = to_float.fit_transform(s)
>>> float_s
0    1.1
1    NaN
2    3.3
Name: x, dtype: float32
>>> float_s[0]
np.float32(1.1)

Note that a column such as the example above may easily occur as the output of CleanNullStrings.

A numeric column will also be converted to floats:

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> to_float.fit_transform(s)
0    1.0
1    2.0
2    3.0
dtype: float32

Boolean columns are treated as numbers:

>>> s = pd.Series([True, False], name='b')
>>> s
0     True
1    False
Name: b, dtype: bool
>>> to_float.fit_transform(s)
0    1.0
1    0.0
Name: b, dtype: float32
>>> s = pd.Series([True, None], name='b', dtype='boolean')
>>> s
0    True
1    <NA>
Name: b, dtype: boolean
>>> to_float.fit_transform(s)
0    1.0
1    NaN
Name: b, dtype: float32
>>> s = pd.Series([True, None], name='b')
>>> s
0    True
1    None
Name: b, dtype: object
>>> to_float.fit_transform(s)
0    1.0
1    NaN
Name: b, dtype: float32

float64 columns are converted to float32:

>>> s = pd.Series([1.1, 2.2])
>>> s
0    1.1
1    2.2
dtype: float64
>>> to_float.fit_transform(s)
0    1.1
1    2.2
dtype: float32

Float64Dtype and Float32Dtype are cast to np.float32. We do this because most scikit-learn estimators cannot handle pd.Float32Dtype correctly yet, especially in the presence of missing values (represented by pd.NA in such columns).

>>> s = pd.Series([1.1, 2.2, None], dtype='Float32')
>>> s
0     1.1
1     2.2
2    <NA>
dtype: Float32
>>> to_float.fit_transform(s)
0    1.1
1    2.2
2    NaN
dtype: float32

Notice that pd.NA has been replaced by np.nan.

Columns that cannot be cast to numbers are rejected:

>>> s = pd.Series(['1.1', '2.2', 'hello'], name='x')
>>> to_float.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Could not convert column 'x' to numbers.

Once a column has been accepted, all calls to transform will result in the same output dtype. Values that fail to be converted become null values.

>>> to_float = ToFloat().fit(pd.Series([1, 2]))
>>> to_float.transform(pd.Series(['3.3', 'hello']))
0    3.3
1    NaN
dtype: float32

Categorical and datetime columns are always rejected:

>>> s = pd.Series(['1.1', '2.2'], dtype='category', name='s')
>>> s
0    1.1
1    2.2
Name: s, dtype: category
Categories (2, ...): ['1.1', '2.2']
>>> to_float.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Refusing to cast column 's' with dtype 'category' to numbers.
>>> to_float.fit_transform(pd.to_datetime(pd.Series(['2024-05-13'], name='s')))
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Refusing to cast column 's' with dtype 'datetime64[...]' to numbers.

float32 columns are passed through:

>>> s = pd.Series([1.1, None], dtype='float32')
>>> to_float.fit_transform(s) is s
True

Methods

fit(column[, y])

Fit the transformer.

fit_transform(column[, y])

Fit the encoder and transform a column.

get_feature_names_out([input_features])

Get the output feature names.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, column])

Configure whether metadata should be requested to be passed to the transform method.

transform(column)

Transform a column.

fit(column, y=None, **kwargs)[source]#

Fit the transformer.

This default implementation simply calls fit_transform() and returns self.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

**kwargs

Extra named arguments are passed to self.fit_transform().

Returns:
self

The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

yNone

Ignored.

Returns:
transformedpandas or polars Series

The input transformed to Float32.

get_feature_names_out(input_features=None)[source]#

Get the output feature names.

Parameters:
input_featuresarray_like of str, default=None

Input feature names. Ignored.

Returns:
all_outputs_

The names of the output features.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in transform.

Returns:
selfobject

The updated object.

transform(column)[source]#

Transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

Returns:
transformedpandas or polars Series

The input transformed to Float32.