ToFloat#
- class skrub.ToFloat[source]#
Convert a column to 32-bit floating-point numbers.
Note
ToFloatis a type of single-column transformer. Unlike most scikit-learn estimators, itsfit,transformandfit_transformmethods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.ApplyToColsor askrub.TableVectorizer. To apply to all columns:: ApplyToCol(ToFloat()) To apply to selected columns:: ApplyToCols(ToFloat(), cols=[‘col_name_1’, ‘col_name_2’])No conversion is attempted if the column has a datetime or categorical dtype; a
RejectColumnexception is raised.Otherwise, we attempt to convert the column to float32. If the conversion fails the column is rejected (a
RejectColumnexception is raised).For pandas, the output is always
np.float32, not the extension dtypepd.Float64Dtype. We do this conversion because most scikit-learn estimators cannot handle those dtypes correctly yet, especially in the presence of missing values (represented bypd.NAin such columns).During
transform, entries for which conversion fails are replaced by null values.Examples
>>> import pandas as pd >>> from skrub._to_float import ToFloat
A column that does not contain floats is converted if possible:
>>> s = pd.Series(['1.1', None, '3.3'], name='x') >>> s 0 1.1 1 ... 2 3.3 Name: x, dtype: ... >>> s[0] '1.1' >>> to_float = ToFloat() >>> float_s = to_float.fit_transform(s) >>> float_s 0 1.1 1 NaN 2 3.3 Name: x, dtype: float32 >>> float_s[0] np.float32(1.1)
Note that a column such as the example above may easily occur as the output of
CleanNullStrings.A numeric column will also be converted to floats:
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> to_float.fit_transform(s) 0 1.0 1 2.0 2 3.0 dtype: float32
Boolean columns are treated as numbers:
>>> s = pd.Series([True, False], name='b') >>> s 0 True 1 False Name: b, dtype: bool >>> to_float.fit_transform(s) 0 1.0 1 0.0 Name: b, dtype: float32
>>> s = pd.Series([True, None], name='b', dtype='boolean') >>> s 0 True 1 <NA> Name: b, dtype: boolean >>> to_float.fit_transform(s) 0 1.0 1 NaN Name: b, dtype: float32 >>> s = pd.Series([True, None], name='b') >>> s 0 True 1 None Name: b, dtype: object >>> to_float.fit_transform(s) 0 1.0 1 NaN Name: b, dtype: float32
float64 columns are converted to float32:
>>> s = pd.Series([1.1, 2.2]) >>> s 0 1.1 1 2.2 dtype: float64 >>> to_float.fit_transform(s) 0 1.1 1 2.2 dtype: float32
Float64Dtype and Float32Dtype are cast to
np.float32. We do this because most scikit-learn estimators cannot handlepd.Float32Dtypecorrectly yet, especially in the presence of missing values (represented bypd.NAin such columns).>>> s = pd.Series([1.1, 2.2, None], dtype='Float32') >>> s 0 1.1 1 2.2 2 <NA> dtype: Float32 >>> to_float.fit_transform(s) 0 1.1 1 2.2 2 NaN dtype: float32
Notice that
pd.NAhas been replaced bynp.nan.Columns that cannot be cast to numbers are rejected:
>>> s = pd.Series(['1.1', '2.2', 'hello'], name='x') >>> to_float.fit_transform(s) Traceback (most recent call last): ... skrub._apply_to_cols.RejectColumn: Could not convert column 'x' to numbers.
Once a column has been accepted, all calls to
transformwill result in the same output dtype. Values that fail to be converted become null values.>>> to_float = ToFloat().fit(pd.Series([1, 2])) >>> to_float.transform(pd.Series(['3.3', 'hello'])) 0 3.3 1 NaN dtype: float32
Categorical and datetime columns are always rejected:
>>> s = pd.Series(['1.1', '2.2'], dtype='category', name='s') >>> s 0 1.1 1 2.2 Name: s, dtype: category Categories (2, ...): ['1.1', '2.2'] >>> to_float.fit_transform(s) Traceback (most recent call last): ... skrub._apply_to_cols.RejectColumn: Refusing to cast column 's' with dtype 'category' to numbers. >>> to_float.fit_transform(pd.to_datetime(pd.Series(['2024-05-13'], name='s'))) Traceback (most recent call last): ... skrub._apply_to_cols.RejectColumn: Refusing to cast column 's' with dtype 'datetime64[...]' to numbers.
float32 columns are passed through:
>>> s = pd.Series([1.1, None], dtype='float32') >>> to_float.fit_transform(s) is s True
Methods
fit(column[, y])Fit the transformer.
fit_transform(column[, y])Fit the encoder and transform a column.
get_feature_names_out([input_features])Get the output feature names.
get_params([deep])Get parameters for this estimator.
set_params(**params)Set the parameters of this estimator.
set_transform_request(*[, column])Configure whether metadata should be requested to be passed to the
transformmethod.transform(column)Transform a column.
- fit(column, y=None, **kwargs)[source]#
Fit the transformer.
This default implementation simply calls
fit_transform()and returnsself.Subclasses should implement
fit_transformandtransform.- Parameters:
- columna pandas or polars
Series Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- **kwargs
Extra named arguments are passed to
self.fit_transform().
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- get_feature_names_out(input_features=None)[source]#
Get the output feature names.
- Parameters:
- input_featuresarray_like of
str, default=None Input feature names. Ignored.
- input_featuresarray_like of
- Returns:
- all_outputs_
The names of the output features.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, column='$UNCHANGED$')[source]#
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.