Preprocessing data with the skrub Cleaner

Introduction

Once a table has been explored, the typical next step is addressing any problem it might have.

This can be described broadly as “cleaning” or “sanitizing” the data.

An example dataframe

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "numerical_1": [1, 2, 3, 4, 5],
        "numerical_2": [10.5, 20.3, None, 40.1, 50.2],
        "string_column": ["apple", "?", "banana", "cherry", "?"],
        "datetime_column": [
            "03 Jan 2020",
            "04 Jan 2020",
            "05 Jan 2020",
            "06 Jan 2020",
            "07 Jan 2020",
        ],
        "all_none": [None, None, None, None, None],
    }
)
df
numerical_1 numerical_2 string_column datetime_column all_none
0 1 10.5 apple 03 Jan 2020 None
1 2 20.3 ? 04 Jan 2020 None
2 3 NaN banana 05 Jan 2020 None
3 4 40.1 cherry 06 Jan 2020 None
4 5 50.2 ? 07 Jan 2020 None

There are some issues!

  • Null values in various columns
  • "?" used as null value marker
  • The datetime column has an unusual format
  • There is a column that is fully empty

Cleaning with pandas

First, let’s look at the datatypes detected by pandas:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   numerical_1      5 non-null      int64  
 1   numerical_2      4 non-null      float64
 2   string_column    5 non-null      object 
 3   datetime_column  5 non-null      object 
 4   all_none         0 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 332.0+ bytes

Null values were not detected properly, the datetime wasn’t parsed as such.

Cleaning with pandas

# Parse the datetime strings with a specific format
df['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')

# Drop columns with only a single unique value
df_clean = df.loc[:, df.nunique(dropna=True) > 1]

# Replace "?" with np.nan in all string columns
df_clean = df_clean.replace("?", np.nan)

Cleaning with pandas

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_clean = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_clean.columns[df_clean.eq('').all()]
    df_clean = df_clean.drop(columns=empty_string_cols)
    return df_clean

# Apply the function to the DataFrame
df_clean = drop_empty_columns(df_clean)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 NaN 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 NaN 2020-01-07

The alternative: skrub.Cleaner

By default, the Cleaner applies various transformations that can sanitize many common use cases:

from skrub import Cleaner
df_clean = Cleaner().fit_transform(df)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 None 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 None 2020-01-07

The alternative: skrub.Cleaner

The cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed:

df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   numerical_1      5 non-null      int64         
 1   numerical_2      4 non-null      float64       
 2   string_column    3 non-null      object        
 3   datetime_column  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

What does the Cleaner do in practice?

  1. It replaces strings used to represent null values with None
  2. It removes uninformative columns
  3. It tries to parse datetime columns
  4. It converts columns to string, unless they have an informative type
  5. It parses strings like "1234.6" to numbers

Note on intended usage

The Cleaner is intended to be used early on when pre-processing, so it avoids altering the input dataframe too much.

The assumption is that complex transformations are done in later steps by the user.

This is different from what is done by the TableVectorizer.

Under the hood: DropUninformative

The Cleaner uses heuristics to detect columns that are unlikely to bring information useful to a ML model.

This is done by the DropUninformative transformer, which marks a columns as “uninformative” if:

  • It contains more nulls than a certain fraction (1.0 by default)
  • It contains only one value (and no missing values)
  • All values in a column are distinct

An example for DropUninformative

import pandas as pd
import numpy as np

df2 = pd.DataFrame({
    "id": [101, 102, 103, 104, 105],
    "age": [34, 28, 45, 52, 39],
    "salary": [70000, None, 85000, None, None], 
    "department": ["HR", "Finance", "IT", "HR", "Finance"],
    "constant_col": ["active"] * 5,  # single constant value
})
df2
id age salary department constant_col
0 101 34 70000.0 HR active
1 102 28 NaN Finance active
2 103 45 85000.0 IT active
3 104 52 NaN HR active
4 105 39 NaN Finance active

An example for DropUninformative

from skrub import ApplyToCols, DropUninformative

drop = ApplyToCols(DropUninformative(drop_if_constant=True, drop_null_fraction=0.5))
drop.fit_transform(df2)
id age department
0 101 34 HR
1 102 28 Finance
2 103 45 IT
3 104 52 HR
4 105 39 Finance

What we have seen in this chapter

  • The Cleaner automates common data cleaning steps.
  • It parses numbers and datetimes, and replaces null sentinels with actual null values.
  • It detects and possibly removes uninformative columns.
  • It can be applied directly to any dataframe.

Time for the quiz!