DropUninformative#

class skrub.DropUninformative(drop_if_constant=False, drop_if_unique=False, drop_null_fraction=1.0)[source]#

Drop column if it is found to be uninformative according to various criteria.

Note

DropUninformative is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((DropUninformative(), 'col_name_1'), (DropUninformative(), 'col_name_2')) instead of make_column_transformer((DropUninformative(), ['col_name_1', 'col_name_2'])).

A column is considered to be “uninformative” if one or more of the following issues are found:

  • The fraction of missing values is larger than a certain fraction (by default, all values must be null for the column to be dropped).

  • The column includes only one unique value (the column is constant). Missing values are considered a separate value.

  • The number of unique values in the column is equal to the length of the column, i.e., all values are unique. This is only considered for non-numeric columns. Missing values are considered a separate value. Note that this may lead to dropping columns that contain free-flowing text.

Parameters:
drop_if_constantbool, default=False

If True, drop the column if it contains only one unique value. Missing values count as one additional distinct value.

drop_if_uniquebool, default=False

If True, drop the column if all values are distinct. Missing values count as one additional distinct value. Numeric columns are never dropped. This may lead to dropping columns that contain free-flowing text.

drop_null_fractionfloat or None, default=1.0

Drop columns with a fraction of missing values larger than threshold. If None, keep the column even if all its values are missing.

Examples

>>> from skrub import DropUninformative
>>> import pandas as pd
>>> df = pd.DataFrame({"col1": [None, None, None]})

By default, only null columns are dropped:

>>> du = DropUninformative()
>>> du.fit_transform(df["col1"])
[]

It is also possible to drop constant columns, or specify a lower null fraction threshold:

>>> df = pd.DataFrame({"col1": [1, 2, None], "col2": ["const", "const", "const"]})
>>> du = DropUninformative(drop_if_constant=True, drop_null_fraction=0.1)
>>> du.fit_transform(df["col1"])
[]
>>> du.fit_transform(df["col2"])
[]

Finally, it is possible to set drop_if_unique to True in order to drop string columns that contain all distinct values:

>>> df = pd.DataFrame({"col1": ["A", "B", "C"]})
>>> du = DropUninformative(drop_if_unique=True)
>>> du.fit_transform(df["col1"])
[]

Methods

fit(column[, y])

Fit the transformer.

fit_transform(column[, y])

Fit the encoder and transform a column.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, column])

Request metadata passed to the transform method.

transform(column)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
columnPandas or Polars series

The input column to check.

yNone

Ignored.

Returns:
column

The input column, or an empty list if the column is chosen to be dropped.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in transform.

Returns:
selfobject

The updated object.

transform(column)[source]#

Transform a column.

Parameters:
columnPandas or Polars series

The input column to check.

Returns:
column

The input column, or an empty list if the column is chosen to be dropped.