Preprocessing data with the skrub `Cleaner`

Introduction

In this chapter, we will show how we can quickly pre-process and sanitize data using skrub’s Cleaner.

Using the skrub `Cleaner`

The Cleaner is intended to be a first step in preparing tabular data for analysis or modeling, and can handle a variety of common data cleaning tasks automatically. It is designed to work out-of-the-box with minimal configuration, although it is also possible to customize its behavior if needed.

Given a dataframe, the Cleaner applies a sequence of transformers to each column:

Consider this example dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "numerical_1": [1, 2, 3, 4, 5],
        "numerical_2": [10.5, 20.3, None, 40.1, 50.2],
        "string_column": ["apple", "?", "banana", "cherry", "?"],
        "datetime_column": [
            "03 Jan 2020",
            "04 Jan 2020",
            "05 Jan 2020",
            "06 Jan 2020",
            "07 Jan 2020",
        ],
        "all_none": [None, None, None, None, None],
    }
)
df

	numerical_1	numerical_2	string_column	datetime_column	all_none
0	1	10.5	apple	03 Jan 2020	None
1	2	20.3	?	04 Jan 2020	None
2	3	NaN	banana	05 Jan 2020	None
3	4	40.1	cherry	06 Jan 2020	None
4	5	50.2	?	07 Jan 2020	None

This dataframe has mixed type columns, with some of the missing values denoted as None and some "?". The datetime column has a non-standard format and has been parsed as a string column. Finally, one of the columns is completely empty.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   numerical_1      5 non-null      int64  
 1   numerical_2      4 non-null      float64
 2   string_column    5 non-null      object 
 3   datetime_column  5 non-null      object 
 4   all_none         0 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 332.0+ bytes

Using plain pandas

Cleaning this dataset using plain pandas may require writing code like this:

# Parse the datetime strings with a specific format
df['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')

# Drop columns with only a single unique value
df_clean = df.loc[:, df.nunique(dropna=True) > 1]

# Replace "?" with np.nan in all string columns
df_clean = df_clean.replace("?", np.nan)

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_clean = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_clean.columns[df_clean.eq('').all()]
    df_clean = df_clean.drop(columns=empty_string_cols)
    return df_clean

# Apply the function to the DataFrame
df_clean = drop_empty_columns(df_clean)
df_clean

	numerical_1	numerical_2	string_column	datetime_column
0	1	10.5	apple	2020-01-03
1	2	20.3	NaN	2020-01-04
2	3	NaN	banana	2020-01-05
3	4	40.1	cherry	2020-01-06
4	5	50.2	NaN	2020-01-07

The alternative: `skrub.Cleaner`

By default, the Cleaner applies various transformations that can sanitize many common use cases:

from skrub import Cleaner
df_clean = Cleaner().fit_transform(df)
df_clean

	numerical_1	numerical_2	string_column	datetime_column
0	1	10.5	apple	2020-01-03
1	2	20.3	None	2020-01-04
2	3	NaN	banana	2020-01-05
3	4	40.1	cherry	2020-01-06
4	5	50.2	None	2020-01-07

We can see that the cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed accordingly:

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   numerical_1      5 non-null      int64         
 1   numerical_2      4 non-null      float64       
 2   string_column    3 non-null      object        
 3   datetime_column  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

Under the hood: `DropUninformative`

When the Cleaner is fitted on a dataframe, it checks whether the dataframe includes uninformative columns, that is columns that do not bring useful information for training a ML model, and should therefore be dropped.

This is done by the DropUninformative transformer, which is a standalone transformer that the Cleaner leverages to sanitize data. DropUninformative marks a columns as “uninformative” if it satisfies one of these conditions:

The fraction of missing values is larger than the threshold provided by the user with drop_null_fraction.
- By default, this threshold is 1.0, i.e., only columns that contain only missing values are dropped.
- Setting the threshold to None will disable this check and therefore retain empty columns.
It contains only one value, and no missing values.
- This is controlled by the drop_if_constant flag, which is False by default.
All values in the column are distinct.
- This may be the case if the column contains UIDs, but it can also happen when the column contains text.
- This check is off by default and can be turned on by setting drop_if_unique to True.

import pandas as pd
import numpy as np

df2 = pd.DataFrame({
    "id": [101, 102, 103, 104, 105],
    "age": [34, 28, 45, 52, 39],
    "salary": [70000, None, 85000, None, None], 
    "department": ["HR", "Finance", "IT", "HR", "Finance"],
    "constant_col": ["active"] * 5,  # single constant value
})
df2

	id	age	salary	department	constant_col
0	101	34	70000.0	HR	active
1	102	28	NaN	Finance	active
2	103	45	85000.0	IT	active
3	104	52	NaN	HR	active
4	105	39	NaN	Finance	active

from skrub import ApplyToCols, DropUninformative

drop = ApplyToCols(DropUninformative(drop_if_constant=True, drop_null_fraction=0.5))
drop.fit_transform(df2)

	id	age	department
0	101	34	HR
1	102	28	Finance
2	103	45	IT
3	104	52	HR
4	105	39	Finance

Conclusion

In this chapter we have covered how the skrub Cleaner helps with sanitizing data by implementing a number of common transformations that need to be executed in order to ensure that the data used by the pipeline are consistent and can be used as indended by ML models.

The Cleaner object automates common data cleaning steps, making it easy to prepare tabular data for analysis or modeling with minimal configuration.
Cleaner leverages transformers like DropUninformative to automatically remove columns that are empty, constant, or contain mostly unique values (such as IDs), based on user-defined thresholds.
The DropUninformative transformer can also be used directly for fine-grained control over which columns to drop, according to missing value fraction, constant values, or uniqueness.
skrub objects are designed to work seamlessly with pandas DataFrames, streamlining the preprocessing workflow and reducing the need for manual data cleaning code.

In the next chapter we will see how skrub helps with applying this and other transformations to specific columns in the data.

Exercise

Path to the exercise: ../notebooks/01_ex_explore_clean.html

Preprocessing data with the skrub Cleaner