DropUninformative#
- class skrub.DropUninformative(drop_if_constant=False, drop_if_unique=False, drop_null_fraction=1.0)[source]#
Drop column if it is found to be uninformative according to various criteria.
Note
DropUninformative
is a type of single-column transformer. Unlike most scikit-learn estimators, itsfit
,transform
andfit_transform
methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.TableVectorizer
orsklearn.compose.ColumnTransformer
. In theColumnTransformer
, pass a single column:make_column_transformer((DropUninformative(), 'col_name_1'), (DropUninformative(), 'col_name_2'))
instead ofmake_column_transformer((DropUninformative(), ['col_name_1', 'col_name_2']))
.A column is considered to be “uninformative” if one or more of the following issues are found:
The fraction of missing values is larger than a certain fraction (by default, all values must be null for the column to be dropped).
The column includes only one unique value (the column is constant). Missing values are considered a separate value.
The number of unique values in the column is equal to the length of the column, i.e., all values are unique. This is only considered for non-numeric columns. Missing values are considered a separate value. Note that this may lead to dropping columns that contain free-flowing text.
- Parameters:
- drop_if_constant
bool
, default=False If True, drop the column if it contains only one unique value. Missing values count as one additional distinct value.
- drop_if_unique
bool
, default=False If True, drop the column if all values are distinct. Missing values count as one additional distinct value. Numeric columns are never dropped. This may lead to dropping columns that contain free-flowing text.
- drop_null_fraction
float
orNone
, default=1.0 Drop columns with a fraction of missing values larger than threshold. If None, keep the column even if all its values are missing.
- drop_if_constant
Examples
>>> from skrub import DropUninformative >>> import pandas as pd >>> df = pd.DataFrame({"col1": [None, None, None]})
By default, only null columns are dropped:
>>> du = DropUninformative() >>> du.fit_transform(df["col1"]) []
It is also possible to drop constant columns, or specify a lower null fraction threshold:
>>> df = pd.DataFrame({"col1": [1, 2, None], "col2": ["const", "const", "const"]}) >>> du = DropUninformative(drop_if_constant=True, drop_null_fraction=0.1) >>> du.fit_transform(df["col1"]) [] >>> du.fit_transform(df["col2"]) []
Finally, it is possible to set
drop_if_unique
toTrue
in order to drop string columns that contain all distinct values:>>> df = pd.DataFrame({"col1": ["A", "B", "C"]}) >>> du = DropUninformative(drop_if_unique=True) >>> du.fit_transform(df["col1"]) []
Methods
fit
(column[, y])Fit the transformer.
fit_transform
(column[, y])Fit the encoder and transform a column.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, column])Request metadata passed to the
transform
method.transform
(column)Transform a column.
- fit(column, y=None)[source]#
Fit the transformer.
Subclasses should implement
fit_transform
andtransform
.- Parameters:
- columna pandas or polars
Series
Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- fit_transform(column, y=None)[source]#
Fit the encoder and transform a column.
- Parameters:
- columnPandas or Polars series
The input column to check.
- y
None
Ignored.
- Returns:
- column
The input column, or an empty list if the column is chosen to be dropped.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.