Cleaner#
- class skrub.Cleaner(drop_null_fraction=1.0, drop_if_constant=False, drop_if_unique=False, n_jobs=1)[source]#
Column-wise consistency checks and sanitization, eg of null values or dates.
The
Cleaner
performs some consistency checks and basic preprocessing such as detecting null values represented as strings (e.g.'N/A'
) or parsing dates. See the “Notes” section for a full list.- Parameters:
- drop_null_fraction
float
orNone
, default=1.0 Fraction of null above which the column is dropped. If drop_null_fraction is set to
1.0
, the column is dropped if it contains only nulls or NaNs (this is the default behavior). If drop_null_fraction is a number in[0.0, 1.0)
, the column is dropped if the fraction of nulls is strictly larger than drop_null_fraction. If drop_null_fraction isNone
, this selection is disabled: no columns are dropped based on the number of null values they contain.- drop_if_constant
bool
, default=False If set to true, drop columns that contain a single unique value. Note that missing values are considered as one additional distinct value.
- drop_if_unique
bool
, default=False If set to true, drop columns that contain only unique values, i.e., the number of unique values is equal to the number of rows in the column. Numeric columns are never dropped.
- n_jobs
int
, default=None Number of jobs to run in parallel.
None
means 1 unless in a joblibparallel_backend
context.-1
means using all processors.
- drop_null_fraction
- Attributes:
- all_processing_steps_
dict
Maps the name of each column to a list of all the processing steps that were applied to it.
- all_processing_steps_
See also
TableVectorizer
Process columns of a dataframe and convert them to a numeric (vectorized) representation.
Notes
The
Cleaner
performs the following set of transformations on each column:CleanNullStrings()
: replace strings used to represent null values with actual null values.DropUninformative()
: drop the column if it contains too many null values,
if it contains only one distinct value, or if all values are distinct (off by default).
ToDatetime()
: parse datetimes represented as strings and return them as actual datetimes with the correct dtype.CleanCategories()
: process categorical columns depending on the dataframe library (Pandas or Polars) to force consistent typing and avoid issues downstream.ToStr()
: convert columns to strings, unless they are numerical, categorical, or datetime.
The
Cleaner
object should only be used for preliminary sanitizing of the data because it does not perform any transformations on numeric columns. On the other hand, theTableVectorizer
converts numeric columns to float32 and ensures that null values are represented with NaNs, which can be handled correctly by downstream scikit-learn estimators.Examples
>>> from skrub import Cleaner >>> import pandas as pd >>> df = pd.DataFrame({ ... 'A': ['one', 'two', 'two', 'three'], ... 'B': ['02/02/2024', '23/02/2024', '12/03/2024', '13/03/2024'], ... 'C': ['1.5', 'N/A', '12.2', 'N/A'], ... 'D': [1.5, 2.0, 2.5, 3.0], ... }) >>> df A B C D 0 one 02/02/2024 1.5 1.5 1 two 23/02/2024 N/A 2.0 2 two 12/03/2024 12.2 2.5 3 three 13/03/2024 N/A 3.0 >>> df.dtypes A object B object C object D float64 dtype: object
The Cleaner will parse datetime columns and convert nulls to dtypes suitable to those of the column (e.g.,
np.NaN
for numerical columns).>>> cleaner = Cleaner() >>> cleaner.fit_transform(df) A B C D 0 one 2024-02-02 1.5 1.5 1 two 2024-02-23 NaN 2.0 2 two 2024-03-12 12.2 2.5 3 three 2024-03-13 NaN 3.0
>>> cleaner.fit_transform(df).dtypes A object B datetime64[ns] C object D float64 dtype: object
We can inspect all the processing steps that were applied to a given column:
>>> cleaner.all_processing_steps_['A'] [CleanNullStrings(), DropUninformative(), ToStr()] >>> cleaner.all_processing_steps_['B'] [CleanNullStrings(), DropUninformative(), ToDatetime()] >>> cleaner.all_processing_steps_['C'] [CleanNullStrings(), DropUninformative(), ToStr()] >>> cleaner.all_processing_steps_['D'] [DropUninformative()]
See Also:#
- TableVectorizer :
Process columns of a dataframe and convert them to a numeric (vectorized) representation.
Methods
fit
(X[, y])Fit transformer.
fit_transform
(X[, y])Fit transformer and transform dataframe.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform dataframe.
- fit(X, y=None)[source]#
Fit transformer.
- Parameters:
- Xdataframe of shape (n_samples, n_features)
Input data to transform.
- yarray_like, shape (n_samples,) or (n_samples, n_outputs) or
None
, default=None Target values for supervised learning (None for unsupervised transformations).
- Returns:
- selfCleaner
The fitted estimator.
- fit_transform(X, y=None)[source]#
Fit transformer and transform dataframe.
- Parameters:
- Xdataframe of shape (n_samples, n_features)
Input data to transform.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs) or
None
, default=None Target values for supervised learning (None for unsupervised transformations).
- Returns:
- dataframe
The transformed input.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.