Preprocessing data with the skrub Cleaner

Introduction

In this chapter, we will show how we can quickly pre-process and sanitize data using skrub’s Cleaner.

Using the skrub Cleaner

The Cleaner is intended to be a first step in preparing tabular data for analysis or modeling, and can handle a variety of common data cleaning tasks automatically. It is designed to work out-of-the-box with minimal configuration, although it is also possible to customize its behavior if needed.

Given a dataframe, the Cleaner applies a sequence of transformers to each column:

Consider this example dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "numerical_1": [1, 2, 3, 4, 5],
        "numerical_2": [10.5, 20.3, None, 40.1, 50.2],
        "string_column": ["apple", "?", "banana", "cherry", "?"],
        "datetime_column": [
            "03 Jan 2020",
            "04 Jan 2020",
            "05 Jan 2020",
            "06 Jan 2020",
            "07 Jan 2020",
        ],
        "all_none": [None, None, None, None, None],
    }
)
df
numerical_1 numerical_2 string_column datetime_column all_none
0 1 10.5 apple 03 Jan 2020 None
1 2 20.3 ? 04 Jan 2020 None
2 3 NaN banana 05 Jan 2020 None
3 4 40.1 cherry 06 Jan 2020 None
4 5 50.2 ? 07 Jan 2020 None

This dataframe has mixed type columns, with some of the missing values denoted as None and some "?". The datetime column has a non-standard format and has been parsed as a string column. Finally, one of the columns is completely empty.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   numerical_1      5 non-null      int64  
 1   numerical_2      4 non-null      float64
 2   string_column    5 non-null      object 
 3   datetime_column  5 non-null      object 
 4   all_none         0 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 332.0+ bytes

Using plain pandas

Cleaning this dataset using plain pandas may require writing code like this:

# Parse the datetime strings with a specific format
df['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')

# Drop columns with only a single unique value
df_clean = df.loc[:, df.nunique(dropna=True) > 1]

# Replace "?" with np.nan in all string columns
df_clean = df_clean.replace("?", np.nan)

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_clean = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_clean.columns[df_clean.eq('').all()]
    df_clean = df_clean.drop(columns=empty_string_cols)
    return df_clean

# Apply the function to the DataFrame
df_clean = drop_empty_columns(df_clean)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 NaN 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 NaN 2020-01-07

The alternative: skrub.Cleaner

By default, the Cleaner applies various transformations that can sanitize many common use cases:

from skrub import Cleaner
df_clean = Cleaner().fit_transform(df)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 None 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 None 2020-01-07

We can see that the cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed accordingly:

df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   numerical_1      5 non-null      int64         
 1   numerical_2      4 non-null      float64       
 2   string_column    3 non-null      object        
 3   datetime_column  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

Under the hood: DropUninformative

When the Cleaner is fitted on a dataframe, it checks whether the dataframe includes uninformative columns, that is columns that do not bring useful information for training a ML model, and should therefore be dropped.

This is done by the DropUninformative transformer, which is a standalone transformer that the Cleaner leverages to sanitize data. DropUninformative marks a columns as “uninformative” if it satisfies one of these conditions:

  • The fraction of missing values is larger than the threshold provided by the user with drop_null_fraction.
    • By default, this threshold is 1.0, i.e., only columns that contain only missing values are dropped.
    • Setting the threshold to None will disable this check and therefore retain empty columns.
  • It contains only one value, and no missing values.
    • This is controlled by the drop_if_constant flag, which is False by default.
  • All values in the column are distinct.
    • This may be the case if the column contains UIDs, but it can also happen when the column contains text.
    • This check is off by default and can be turned on by setting drop_if_unique to True.
import pandas as pd
import numpy as np

df2 = pd.DataFrame({
    "id": [101, 102, 103, 104, 105],
    "age": [34, 28, 45, 52, 39],
    "salary": [70000, None, 85000, None, None], 
    "department": ["HR", "Finance", "IT", "HR", "Finance"],
    "constant_col": ["active"] * 5,  # single constant value
})
df2
id age salary department constant_col
0 101 34 70000.0 HR active
1 102 28 NaN Finance active
2 103 45 85000.0 IT active
3 104 52 NaN HR active
4 105 39 NaN Finance active
from skrub import ApplyToCols, DropUninformative

drop = ApplyToCols(DropUninformative(drop_if_constant=True, drop_null_fraction=0.5))
drop.fit_transform(df2)
id age department
0 101 34 HR
1 102 28 Finance
2 103 45 IT
3 104 52 HR
4 105 39 Finance

Conclusion

In this chapter we have covered how the skrub Cleaner helps with sanitizing data by implementing a number of common transformations that need to be executed in order to ensure that the data used by the pipeline are consistent and can be used as indended by ML models.

  • The Cleaner object automates common data cleaning steps, making it easy to prepare tabular data for analysis or modeling with minimal configuration.
  • Cleaner leverages transformers like DropUninformative to automatically remove columns that are empty, constant, or contain mostly unique values (such as IDs), based on user-defined thresholds.
  • The DropUninformative transformer can also be used directly for fine-grained control over which columns to drop, according to missing value fraction, constant values, or uniqueness.
  • skrub objects are designed to work seamlessly with pandas DataFrames, streamlining the preprocessing workflow and reducing the need for manual data cleaning code.

In the next chapter we will see how skrub helps with applying this and other transformations to specific columns in the data.

Exercise

Path to the exercise: ../notebooks/01_ex_explore_clean.html