Cleaner
: sanitizing a dataframe#
Very often, the first steps in preparing a dataframe for further use involve understanding the datatypes in the data and changing them into a more suitable format (e.g., from string to number or datetime).
The Cleaner
aids with this by running the following set of transformations on
each column:
Clean null strings: Replace strings typically used to represent missing values with a null value suitable for the column under consideration.
DropUninformative
: Drop the column if it is considered “uninformative.” A column is considered “uninformative” if it contains only missing values (drop_null_fraction
), only a constant value (drop_if_constant
), or if all values are distinct (drop_if_unique
). By default, theCleaner
keeps all columns unless they contain only missing values. Refer to Removing unneeded columns with DropUninformative and Cleaner for more detail on this operation.
Note
Setting drop_if_unique
to True
may lead to dropping columns
that contain text or IDs. Numeric columns are never dropped by drop_if_unique
.
ToDatetime
: Parse datetimes represented as strings and return them as actual datetimes with the correct dtype. Ifdatetime_format
is provided, it is forwarded toToDatetime
. Otherwise, the format is guessed according to common datetime formats.Clean categories: If the dtype of the column is detected as “Categorical”, process it based on the dataframe library (Pandas or Polars) to ensure consistent typing and avoid downstream issues.
Convert to strings: Convert columns to strings unless they have a more informative dtype, such as numeric, categorical, or datetime.
If numeric_dtype
is set to float32
, the Cleaner
will also convert
numeric columns to np.float32
dtype, ensuring a consistent representation
of numbers and missing values. This can be useful if the Cleaner
is used as a preprocessing step at the beginning of an ML pipeline.
The Cleaner
is a scikit-learn compatible transformer:
>>> from skrub import Cleaner
>>> import pandas as pd
>>> df = pd.DataFrame({
... "id": [1, 2, 3],
... "all_missing": ["", "", ""],
... "date": ["2024-05-05", "2024-05-06", "2024-05-07"],
... })
>>> df_clean = Cleaner().fit_transform(df)
>>> df_clean
id date
0 1 2024-05-05
1 2 2024-05-06
2 3 2024-05-07
>>> df_clean.dtypes
id int64
date datetime64[ns]
dtype: object
Note that the "all_missing"
column has been dropped, and that the "date"
column has been correctly parsed as a datetime column.
Converting numeric dtypes to float32
with the Cleaner
#
By default, when the Cleaner
encounters numeric dtypes (e.g., int8
,
float64
), it leaves them as-is. In some cases, it may be beneficial to have
the same numeric dtype for all numeric columns to guarantee compatibility between
values.
The Cleaner
allows conversion of numeric features to float32
by setting
the numeric_dtype
parameter:
>>> from skrub import Cleaner
>>> cleaner = Cleaner(numeric_dtype="float32")
>>> import pandas as pd
>>> df = pd.DataFrame({
... "id": [1, 2, 3],
... })
>>> df.dtypes
id int64
dtype: object
>>> df_cleaned = cleaner.fit_transform(df)
>>> df_cleaned.dtypes
id float32
dtype: object
Setting the dtype to float32
reduces RAM footprint for most use cases and
ensures that all missing values have the same representation. This also ensures
compatibility with scikit-learn transformers.