DropUninformative#

class skrub.DropUninformative(drop_if_constant=False, drop_if_unique=False, drop_null_fraction=1.0)[source]#

Drop column if it is found to be uninformative according to various criteria.

Note

DropUninformative is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((DropUninformative(), 'col_name_1'), (DropUninformative(), 'col_name_2')) instead of make_column_transformer((DropUninformative(), ['col_name_1', 'col_name_2'])).

Columns are considered “uninformative” if the fraction of missing values is larger than a threshold, if they contain one unique value, or if all values are unique.

Parameters:

drop_if_constantbool, default=False: If True, drop the column if it contains only one unique value. Missing values count as one additional distinct value.
drop_if_uniquebool, default=False: If True, drop the column if all values are distinct. Missing values count as one additional distinct value. Numeric columns are never dropped. This may lead to dropping columns that contain free-flowing text.
drop_null_fractionfloat or None, default=1.0: Drop columns with a fraction of missing values larger than threshold. If None, keep the column even if all its values are missing.

Notes

A column is considered to be “uninformative” if one or more of the following issues are found:

The fraction of missing values is larger than a certain fraction (by default, all values must be null for the column to be dropped).
The column includes only one unique value (the column is constant). Missing values are considered a separate value.
The number of unique values in the column is equal to the length of the column, i.e., all values are unique. This is only considered for non-numeric columns. Missing values are considered a separate value. Note that this may lead to dropping columns that contain free-flowing text.

Examples

>>> from skrub import DropUninformative
>>> import pandas as pd
>>> df = pd.DataFrame({"col1": [None, None, None]})

By default, only null columns are dropped:

>>> du = DropUninformative()
>>> du.fit_transform(df["col1"])
[]

It is also possible to drop constant columns, or specify a lower null fraction threshold:

>>> df = pd.DataFrame({"col1": [1, 2, None], "col2": ["const", "const", "const"]})
>>> du = DropUninformative(drop_if_constant=True, drop_null_fraction=0.1)
>>> du.fit_transform(df["col1"])
[]
>>> du.fit_transform(df["col2"])
[]

Finally, it is possible to set drop_if_unique to True in order to drop string columns that contain all distinct values:

>>> df = pd.DataFrame({"col1": ["A", "B", "C"]})
>>> du = DropUninformative(drop_if_unique=True)
>>> du.fit_transform(df["col1"])
[]

Methods

`fit`(column[, y])	Fit the transformer.
`fit_transform`(column[, y])	Fit the encoder and transform a column.
`get_feature_names_out`()	Return a list of features generated by the transformer.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`set_transform_request`(*[, column])	Request metadata passed to the `transform` method.
`transform`(column)	Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:

columna pandas or polars Series: Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
ycolumn or dataframe: Prediction targets.

Returns:

self: The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:

columnPandas or Polars series: The input column to check.
yNone: Ignored.

Returns:

column: The input column, or an empty list if the column is chosen to be dropped.

get_feature_names_out()[source]#

Return a list of features generated by the transformer.

Each feature has format {input_name}_{n_component} where input_name is the name of the input column, or a default name for the encoder, and n_component is the idx of the specific feature.

Returns:

list of str: The list of feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for column parameter in transform.

Returns:

selfobject: The updated object.

transform(column)[source]#

Transform a column.

Parameters:

columnPandas or Polars series: The input column to check.

Returns:

column: The input column, or an empty list if the column is chosen to be dropped.

DropUninformative#

This Page