Cleaner#
- class skrub.Cleaner(drop_null_fraction=1.0, drop_if_constant=False, drop_if_unique=False, datetime_format=None, null_strings=None, parse_numbers=False, cast_to_float32=False, cast_to_str=False, n_jobs=1, numeric_dtype=None)[source]#
Column-wise consistency checks and sanitization of dtypes, null values and dates.
The
Cleanerperforms some consistency checks and basic preprocessing such as detecting null values represented as strings (e.g.'N/A'), parsing dates, and removing uninformative columns. See the “Notes” section for a full list.- Parameters:
- drop_null_fraction
floatorNone, default=1.0 Fraction of null above which the column is dropped. If
drop_null_fractionis set to1.0, the column is dropped if it contains only nulls or NaNs (this is the default behavior). Ifdrop_null_fractionis a number in[0.0, 1.0), the column is dropped if the fraction of nulls is strictly larger thandrop_null_fraction. Ifdrop_null_fractionisNone, this selection is disabled: no columns are dropped based on the number of null values they contain.- drop_if_constant
bool, default=False If set to true, drop columns that contain a single unique value. Note that missing values are considered as one additional distinct value.
- drop_if_unique
bool, default=False If set to true, drop columns that contain only unique values, i.e., the number of unique values is equal to the number of rows in the column. Numeric columns are never dropped.
Deprecated since version 0.9.0.
This functionality can drop informative columns and is unlikely to be of use in practice. It is therefore deprecated and will be removed in a future version.
- datetime_format
str, default=None The format to use when parsing dates. If None, the format is inferred.
- parse_numbers
bool, default=False Whether to parse strings that represent numeric values.
False: no numeric parsing is attempted.True: applyToFloatto string columns. String columns whose non-missing values can all be parsed as numbers are converted tofloat32.
- cast_to_float32
bool, default=False Whether to cast numeric columns to
float32. If set toTrue, numeric columns are converted tofloat32.- cast_to_str
bool, default=False If
True, apply theToStrtransformer to non-numeric, non-categorical, and non-datetime columns, converting them to strings. IfFalse, this step is skipped and such columns retain their original dtype (e.g., lists, structs).- numeric_dtype“float32” or
None, default=None If set to “float32”, this parameter has the same effect as
cast_to_float32=Trueandparse_numbers=True: it casts numeric columns tofloat32.Deprecated since version 0.9.0: Use
cast_to_float32=Truewithparse_numbers=Trueinstead.- null_strings
stror sequence ofstr, default=None Additional strings to consider as null values, beyond the default list.
- n_jobs
int, default=None Number of jobs to run in parallel.
Nonemeans 1 unless in a joblibparallel_backendcontext.-1means using all processors.
- drop_null_fraction
- Attributes:
See also
TableVectorizerProcess columns of a dataframe and convert them to a numeric (vectorized) representation.
ToFloatConvert numeric columns to
np.float32, to have consistent numeric types and representation of missing values. More informative columns (e.g., categorical or datetime) are not converted.ApplyToColsApply a given transformer to each column in a selection of columns. Useful to complement the default heuristics of the
Cleaner.DropUninformativeDrop columns that are considered uninformative, e.g., containing only null values or a single unique value.
Notes
The
Cleanerperforms the following set of transformations on each column:CleanNullStrings(): replace strings used to represent missing values with NA markers.DropUninformative: drop the column if it is considered to be “uninformative”. A column is considered to be “uninformative” if it contains only missing values (drop_null_fraction) or only a constant value (drop_if_constant). By default, theCleanerkeeps all columns, unless they contain only missing values.ToDatetime: parse datetimes represented as strings and return them as actual datetimes with the correct dtype. Ifdatetime_formatis provided, it is forwarded toToDatetime. Otherwise, the format is inferred.ToFloat: - ifparse_numbers=True, applyToFloaton string columns,converting strings whose non-missing values can all be parsed as numbers to
float32;if
cast_to_float32=True, applyToFloaton numeric columns to cast them tofloat32.
CleanCategories(): process categorical columns depending on the dataframe library (Pandas or Polars) to force consistent typing and avoid issues downstream.ToStr(): convert columns to strings unless they are numerical,
categorical, or datetime. This step is controlled by the
cast_to_strparameter. Whencast_to_str=False(default), string conversion is skipped. Whencast_to_str=True, string conversion is applied.Example:
>>> import pandas as pd >>> df = pd.DataFrame({"num_str": ["1", "2"], "num": [1, 2], "f": [1.0, 2.0]}) >>> Cleaner(parse_numbers=False).fit_transform(df).dtypes num_str ... num ... f float64 dtype: object >>> cleaner = Cleaner(parse_numbers=True, cast_to_float32=True) >>> cleaner.fit_transform(df).dtypes num_str float32 num ... f float32 dtype: object
Examples
>>> from skrub import Cleaner >>> import pandas as pd >>> df = pd.DataFrame({ ... 'A': ['one', 'two', 'two', 'three'], ... 'B': ['02/02/2024', '23/02/2024', '12/03/2024', '13/03/2024'], ... 'C': ['1.5', 'N/A', '12.2', 'N/A'], ... 'D': [1.5, 2.0, 2.5, 3.0], ... }) >>> df A B C D 0 one 02/02/2024 1.5 1.5 1 two 23/02/2024 N/A 2.0 2 two 12/03/2024 12.2 2.5 3 three 13/03/2024 N/A 3.0 >>> df.dtypes A ... B ... C ... D float64 dtype: object
The Cleaner will parse datetime columns and convert nulls to dtypes suitable to those of the column (e.g.,
np.NaNfor numerical columns).>>> cleaner = Cleaner() >>> cleaner.fit_transform(df) A B C D 0 one 2024-02-02 1.5 1.5 1 two 2024-02-23 ... 2.0 2 two 2024-03-12 12.2 2.5 3 three 2024-03-13 ... 3.0
>>> cleaner.fit_transform(df).dtypes A ... B datetime64[ns] C ... D float64 dtype: object
Columns can be excluded from processing by combining the
Cleanerwith :class:`~skrub.ApplyToCols. For example, to exclude the datetime column from processing and keep it as a string, we can do:>>> from skrub import ApplyToCols >>> import skrub.selectors as s >>> ApplyToCols(Cleaner(), s.all() - 'B').fit_transform(df) B A C D 0 02/02/2024 one 1.5 1.5 1 23/02/2024 two ... 2.0 2 12/03/2024 two 12.2 2.5 3 13/03/2024 three ... 3.0
We can inspect all the processing steps that were applied to a given column:
>>> cleaner.all_processing_steps_['A'] [CleanNullStrings(), DropUninformative()] >>> cleaner.all_processing_steps_['B'] [CleanNullStrings(), DropUninformative(), ToDatetime()] >>> cleaner.all_processing_steps_['C'] [CleanNullStrings(), DropUninformative()] >>> cleaner.all_processing_steps_['D'] [DropUninformative()]
Methods
fit(X[, y])Fit transformer.
fit_transform(X[, y])Fit transformer and transform dataframe.
get_feature_names_out([input_features])Return the column names of the output of
transformas a list of strings.get_params([deep])Get parameters for this estimator.
set_output(*[, transform])Set output container.
set_params(**params)Set the parameters of this estimator.
transform(X)Transform dataframe.
- fit(X, y=None)[source]#
Fit transformer.
- Parameters:
- Xdataframe of shape (n_samples, n_features)
Input data to transform.
- yarray_like, shape (n_samples,) or (n_samples, n_outputs) or
None, default=None Target values for supervised learning (None for unsupervised transformations).
- Returns:
- selfCleaner
The fitted estimator.
- fit_transform(X, y=None)[source]#
Fit transformer and transform dataframe.
- Parameters:
- Xdataframe of shape (n_samples, n_features)
Input data to transform.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs) or
None, default=None Target values for supervised learning (None for unsupervised transformations).
- Returns:
- dataframe
The transformed input.
- get_feature_names_out(input_features=None)[source]#
Return the column names of the output of
transformas a list of strings.- Parameters:
- input_featuresarray_like of
strorNone, default=None Ignored.
- input_featuresarray_like of
- Returns:
listof stringsThe column names.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.