Cleaner#

class skrub.Cleaner(drop_null_fraction=1.0, drop_if_constant=False, drop_if_unique=False, datetime_format=None, numeric_dtype=None, cast_to_str=False, n_jobs=1)[source]#

Column-wise consistency checks and sanitization of dtypes, null values and dates.

The Cleaner performs some consistency checks and basic preprocessing such as detecting null values represented as strings (e.g. 'N/A'), parsing dates, and removing uninformative columns. See the “Notes” section for a full list.

Parameters:
drop_null_fractionfloat or None, default=1.0

Fraction of null above which the column is dropped. If drop_null_fraction is set to 1.0, the column is dropped if it contains only nulls or NaNs (this is the default behavior). If drop_null_fraction is a number in [0.0, 1.0), the column is dropped if the fraction of nulls is strictly larger than drop_null_fraction. If drop_null_fraction is None, this selection is disabled: no columns are dropped based on the number of null values they contain.

drop_if_constantbool, default=False

If set to true, drop columns that contain a single unique value. Note that missing values are considered as one additional distinct value.

drop_if_uniquebool, default=False

If set to true, drop columns that contain only unique values, i.e., the number of unique values is equal to the number of rows in the column. Numeric columns are never dropped.

datetime_formatstr, default=None

The format to use when parsing dates. If None, the format is inferred.

numeric_dtype“float32” or None, default=None

If set to float32, convert columns with numerical information to np.float32 dtype thanks to the transformer ToFloat. If None, numerical columns are not modified.

cast_to_strbool, default=False

If True, apply the ToStr transformer to non-numeric, non-categorical, and non-datetime columns, converting them to strings. If False, this step is skipped and such columns retain their original dtype (e.g., lists, structs).

n_jobsint, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib parallel_backend context. -1 means using all processors.

Attributes:
all_processing_steps_dict

Maps the name of each column to a list of all the processing steps that were applied to it.

all_outputs_list of str

Column names of the output of transform.

See also

TableVectorizer

Process columns of a dataframe and convert them to a numeric (vectorized) representation.

ToFloat

Convert numeric columns to np.float32, to have consistent numeric types and representation of missing values. More informative columns (e.g., categorical or datetime) are not converted.

ApplyToCols

Apply a given transformer separately to each column in a selection of columns. Useful to complement the default heuristics of the Cleaner.

ApplyToFrame

Apply a given transformer jointly to all columns in a selection of columns. Useful to complement the default heuristics of the Cleaner.

DropUninformative

Drop columns that are considered uninformative, e.g., containing only null values or a single unique value.

Notes

The Cleaner performs the following set of transformations on each column:

  • CleanNullStrings(): replace strings used to represent missing values with NA markers.

  • DropUninformative: drop the column if it is considered to be “uninformative”. A column is considered to be “uninformative” if it contains only missing values (drop_null_fraction), only a constant value (drop_if_constant), or if all values are distinct (drop_if_unique). By default, the Cleaner keeps all columns, unless they contain only missing values. Note that setting drop_if_unique to True may lead to dropping columns that contain text.

  • ToDatetime(): parse datetimes represented as strings and return them as actual datetimes with the correct dtype. If datetime_format is provided, it is forwarded to ToDatetime(). Otherwise, the format is inferred.

  • CleanCategories(): process categorical columns depending on the dataframe library (Pandas or Polars) to force consistent typing and avoid issues downstream.

  • ToStr(): convert columns to strings unless they are numerical,

categorical, or datetime. This step is controlled by the cast_to_str parameter. When cast_to_str=False (default), string conversion is skipped. When cast_to_str=True, string conversion is applied.

If numeric_dtype is set to float32, the Cleaner will also convert numeric columns to this dtype, including numbers represented as string, ensuring a consistent representation of numbers and missing values. This can be useful if the Cleaner is used as a preprocessing step in a skrub pipeline.

Examples

>>> from skrub import Cleaner
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'A': ['one', 'two', 'two', 'three'],
...     'B': ['02/02/2024', '23/02/2024', '12/03/2024', '13/03/2024'],
...     'C': ['1.5', 'N/A', '12.2', 'N/A'],
...     'D': [1.5, 2.0, 2.5, 3.0],
... })
>>> df
       A           B     C    D
0    one  02/02/2024   1.5  1.5
1    two  23/02/2024   N/A  2.0
2    two  12/03/2024  12.2  2.5
3  three  13/03/2024   N/A  3.0
>>> df.dtypes
A       ...
B       ...
C       ...
D   float64
dtype: object

The Cleaner will parse datetime columns and convert nulls to dtypes suitable to those of the column (e.g., np.NaN for numerical columns).

>>> cleaner = Cleaner()
>>> cleaner.fit_transform(df)
       A          B     C    D
0    one 2024-02-02   1.5  1.5
1    two 2024-02-23   None  2.0
2    two 2024-03-12  12.2  2.5
3  three 2024-03-13   None  3.0
>>> cleaner.fit_transform(df).dtypes
A               ...
B    datetime64[ns]
C               ...
D           float64
dtype: object

We can inspect all the processing steps that were applied to a given column:

>>> cleaner.all_processing_steps_['A']
[CleanNullStrings(), DropUninformative()]
>>> cleaner.all_processing_steps_['B']
[CleanNullStrings(), DropUninformative(), ToDatetime()]
>>> cleaner.all_processing_steps_['C']
[CleanNullStrings(), DropUninformative()]
>>> cleaner.all_processing_steps_['D']
[DropUninformative()]

Methods

fit(X[, y])

Fit transformer.

fit_transform(X[, y])

Fit transformer and transform dataframe.

get_feature_names_out([input_features])

Return the column names of the output of transform as a list of strings.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform dataframe.

fit(X, y=None)[source]#

Fit transformer.

Parameters:
Xdataframe of shape (n_samples, n_features)

Input data to transform.

yarray_like, shape (n_samples,) or (n_samples, n_outputs) or None, default=None

Target values for supervised learning (None for unsupervised transformations).

Returns:
selfCleaner

The fitted estimator.

fit_transform(X, y=None)[source]#

Fit transformer and transform dataframe.

Parameters:
Xdataframe of shape (n_samples, n_features)

Input data to transform.

yarray_like of shape (n_samples,) or (n_samples, n_outputs) or None, default=None

Target values for supervised learning (None for unsupervised transformations).

Returns:
dataframe

The transformed input.

get_feature_names_out(input_features=None)[source]#

Return the column names of the output of transform as a list of strings.

Parameters:
input_featuresarray_like of str or None, default=None

Ignored.

Returns:
list of strings

The column names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform dataframe.

Parameters:
Xdataframe of shape (n_samples, n_features)

Input data to transform.

Returns:
dataframe

The transformed input.