Cleaner: sanitizing a dataframe#
Very often, the first steps in preparing a dataframe for further use involve understanding the datatypes in the data and changing them into a more suitable format (e.g., from string to number or datetime).
The Cleaner aids with this by running the following set of transformations on
each column:
Clean null strings: Replace strings typically used to represent missing values with a null value suitable for the column under consideration.
DropUninformative: Drop the column if it is considered “uninformative.” A column is considered “uninformative” if it contains only missing values (drop_null_fraction), only a constant value (drop_if_constant), or if all values are distinct (drop_if_unique). By default, theCleanerkeeps all columns unless they contain only missing values. Refer to Removing unneeded columns with DropUninformative and Cleaner for more detail on this operation.
Note
Setting drop_if_unique to True may lead to dropping columns
that contain text or IDs. Numeric columns are never dropped by drop_if_unique.
ToDatetime: Parse datetimes represented as strings and return them as actual datetimes with the correct dtype. Ifdatetime_formatis provided, it is forwarded toToDatetime. Otherwise, the format is guessed according to common datetime formats.Clean categories: If the dtype of the column is detected as “Categorical”, process it based on the dataframe library (Pandas or Polars) to ensure consistent typing and avoid downstream issues.
Convert to strings: Convert columns to strings unless they have a more informative dtype, such as numeric, categorical, or datetime.
If numeric_dtype is set to float32, the Cleaner will also convert
numeric columns to np.float32 dtype, ensuring a consistent representation
of numbers and missing values. This can be useful if the Cleaner
is used as a preprocessing step at the beginning of an ML pipeline.
The Cleaner is a scikit-learn compatible transformer:
>>> from skrub import Cleaner
>>> import pandas as pd
>>> df = pd.DataFrame({
... "id": [1, 2, 3],
... "all_missing": ["", "", ""],
... "date": ["2024-05-05", "2024-05-06", "2024-05-07"],
... })
>>> df_clean = Cleaner().fit_transform(df)
>>> df_clean
id date
0 1 2024-05-05
1 2 2024-05-06
2 3 2024-05-07
>>> df_clean.dtypes
id int64
date datetime64[ns]
dtype: object
Note that the "all_missing" column has been dropped, and that the "date"
column has been correctly parsed as a datetime column.
Converting numeric dtypes to float32 with the Cleaner#
By default, when the Cleaner encounters numeric dtypes (e.g., int8,
float64), it leaves them as-is. In some cases, it may be beneficial to have
the same numeric dtype for all numeric columns to guarantee compatibility between
values.
The Cleaner allows conversion of numeric features to float32 by setting
the numeric_dtype parameter:
>>> from skrub import Cleaner
>>> cleaner = Cleaner(numeric_dtype="float32")
>>> import pandas as pd
>>> df = pd.DataFrame({
... "id": [1, 2, 3],
... })
>>> df.dtypes
id int64
dtype: object
>>> df_cleaned = cleaner.fit_transform(df)
>>> df_cleaned.dtypes
id float32
dtype: object
Setting the dtype to float32 reduces RAM footprint for most use cases and
ensures that all missing values have the same representation. This also ensures
compatibility with scikit-learn transformers.