4  Preprocessing data with the skrub Cleaner

4.1 Introduction

In this chapter, we will show how we can quickly pre-process and sanitize data using skrub’s Cleaner.

We first load the wine dataset from the local repository. This dataset is a downsampled version of the OpenML dataset (id=42074).

from sklearn.datasets import fetch_openml
from skrub import TableReport
import pandas as pd

wine = pd.read_csv("../data/wine/data.csv")

We can explore it using the TableReport:

TableReport(wine)
Processing column   1 / 10Processing column   2 / 10Processing column   3 / 10Processing column   4 / 10Processing column   5 / 10Processing column   6 / 10Processing column   7 / 10Processing column   8 / 10Processing column   9 / 10Processing column  10 / 10

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We can notice that there are a few columns that contain a sizable amount of missing values (“region_2” and “designation”). If we want to remove these columns programmatically using pandas, we have to do something like this:

wine.loc[:, wine.isnull().mean() <= 0.3]
country description points price province region_1 variety winery
0 New Zealand An obvious, unsubtle Chardonnay that flashes p... 85 12.0 Gisborne NaN Chardonnay Brancott
1 Italy Coffee bean, leather and tobacco tones are sur... 87 NaN Piedmont Roero Nebbiolo Deltetto
2 Portugal Bottled in June 2009, nearly four years after ... 91 25.0 Douro NaN Portuguese Red Mário Braga
3 Chile Dusty, mild red fruit aromas bring hints of fl... 84 10.0 Colchagua Valley NaN Merlot Santa Carolina
4 US The most massive, dense and ageworthy of all t... 92 56.0 Washington Red Mountain Cabernet Sauvignon Sparkman
... ... ... ... ... ... ... ... ...
9995 Spain Unusual in that this cava hails from the Riber... 85 16.0 Catalonia Cava Sparkling Blend Finca Torremilanos
9996 US In this newest vintage of Oriana, the Riesling... 90 24.0 Washington Columbia Valley (WA) White Blend Brian Carter Cellars
9997 Italy Founded in 1918 by one of the grandfathers and... 95 NaN Piedmont Barolo Nebbiolo Cantina Bartolo Mascarello
9998 Italy Perticaia delivers a gorgeous Trebbiano with r... 89 NaN Central Italy Umbria Trebbiano Perticaia
9999 France Floral wine, light in structure with some fres... 87 NaN Bordeaux Pessac-Léognan Bordeaux-style Red Blend Château Les Carmes Haut-Brion

10000 rows × 8 columns

It may also be beneficial to convert numerical features to float32, to reduce the computational cost:

wine.astype({col: "float32" for col in wine.select_dtypes(include="number").columns})
country description designation points price province region_1 region_2 variety winery
0 New Zealand An obvious, unsubtle Chardonnay that flashes p... Unoaked 85.0 12.0 Gisborne NaN NaN Chardonnay Brancott
1 Italy Coffee bean, leather and tobacco tones are sur... Braja Riserva 87.0 NaN Piedmont Roero NaN Nebbiolo Deltetto
2 Portugal Bottled in June 2009, nearly four years after ... Quinta do Mourão Rio Bom Colheita 91.0 25.0 Douro NaN NaN Portuguese Red Mário Braga
3 Chile Dusty, mild red fruit aromas bring hints of fl... Reserva 84.0 10.0 Colchagua Valley NaN NaN Merlot Santa Carolina
4 US The most massive, dense and ageworthy of all t... Kingpin 92.0 56.0 Washington Red Mountain Columbia Valley Cabernet Sauvignon Sparkman
... ... ... ... ... ... ... ... ... ... ...
9995 Spain Unusual in that this cava hails from the Riber... Peñalba López Brut Nature 85.0 16.0 Catalonia Cava NaN Sparkling Blend Finca Torremilanos
9996 US In this newest vintage of Oriana, the Riesling... Oriana White 90.0 24.0 Washington Columbia Valley (WA) Columbia Valley White Blend Brian Carter Cellars
9997 Italy Founded in 1918 by one of the grandfathers and... NaN 95.0 NaN Piedmont Barolo NaN Nebbiolo Cantina Bartolo Mascarello
9998 Italy Perticaia delivers a gorgeous Trebbiano with r... NaN 89.0 NaN Central Italy Umbria NaN Trebbiano Perticaia
9999 France Floral wine, light in structure with some fres... NaN 87.0 NaN Bordeaux Pessac-Léognan NaN Bordeaux-style Red Blend Château Les Carmes Haut-Brion

10000 rows × 10 columns

These operations are quite common in most cases (although the parameters and requirements may vary by project), so writing the code that addresses them may become repetitive.

A simpler way of dealing with this preliminary preparation is to use the skrub Cleaner.

4.2 Using the skrub Cleaner

The Cleaner is intended to be a first step in preparing tabular data for analysis or modeling, and can handle a variety of common data cleaning tasks automatically. It is designed to work out-of-the-box with minimal configuration, although it is also possible to customize its behavior if needed.

Given a dataframe, the Cleaner applies a sequence of transformers to each column:

Consider this example dataframe:

df = pd.DataFrame(
    {
        "numerical_1": [1, 2, 3, 4, 5],
        "numerical_2": [10.5, 20.3, None, 40.1, 50.2],
        "string_column": ["apple", "?", "banana", "cherry", "?"],
        "datetime_column": [
            "03 Jan 2020",
            "04 Jan 2020",
            "05 Jan 2020",
            "06 Jan 2020",
            "07 Jan 2020",
        ],
        "all_none": [None, None, None, None, None],
    }
)
df
numerical_1 numerical_2 string_column datetime_column all_none
0 1 10.5 apple 03 Jan 2020 None
1 2 20.3 ? 04 Jan 2020 None
2 3 NaN banana 05 Jan 2020 None
3 4 40.1 cherry 06 Jan 2020 None
4 5 50.2 ? 07 Jan 2020 None

This dataframe has mixed type columns, with some of the missing values denoted as None and some "?". The datetime column has a non-standard format and has been parsed as a string column. Finally, one of the columns is completely empty.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   numerical_1      5 non-null      int64  
 1   numerical_2      4 non-null      float64
 2   string_column    5 non-null      object 
 3   datetime_column  5 non-null      object 
 4   all_none         0 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 332.0+ bytes

4.2.1 Using plain pandas

Cleaning this dataset using plain pandas may require writing code like this:

# Parse the datetime strings with a specific format
df['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')

# Drop columns with only a single unique value
df_clean = df.loc[:, df.nunique(dropna=True) > 1]

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_clean = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_clean.columns[df_clean.eq('').all()]
    df_clean = df_clean.drop(columns=empty_string_cols)
    return df_clean

# Apply the function to the DataFrame
df_clean = drop_empty_columns(df_clean)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 ? 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 ? 2020-01-07

4.2.2 The alternative: skrub.Cleaner

By default, the Cleaner applies various transformations that can sanitize many common use cases:

from skrub import Cleaner
df_clean = Cleaner().fit_transform(df)
df_clean
numerical_1 numerical_2 string_column datetime_column
0 1 10.5 apple 2020-01-03
1 2 20.3 None 2020-01-04
2 3 NaN banana 2020-01-05
3 4 40.1 cherry 2020-01-06
4 5 50.2 None 2020-01-07

We can see that the cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed accordingly:

df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   numerical_1      5 non-null      int64         
 1   numerical_2      4 non-null      float64       
 2   string_column    3 non-null      object        
 3   datetime_column  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

4.2.3 Cleaning steps performed by the Cleaner

In more detail, the Cleaner executes the following steps in order:

  1. It replaces common strings used to represent missing values (e.g., NULL, ?) with NA markers.
  2. It uses the DropUninformative transformer to decide whether a column is “uninformative”, that is, it is not likely to bring information useful to train a ML model. For example, empty columns are uninformative.
  3. It tries to parse datetime columns using common formats, or a user-provided datetime_format.
  4. It processes categorical columns to ensure consistent typing depending on the dataframe library in use.
  5. It converts columns to string, unless they have a data type that carries more information, such as numerical, datetime, and categorial columns.
  6. It automatically parses numbers written as strings (“1.324”) to convert them to actual numerical columns.
  7. Finally, it can convert numerical columns to np.float32 dtype if called with the parameter numeric_dtype="float32". This ensures a consistent representation of numbers and missing values, and helps reducing the memory footprint.

We can look back at the “wine” dataframe and clean it with a suitably configured Cleaner:

cleaner = Cleaner(drop_null_fraction=0.3, numeric_dtype="float32")

cleaner.fit_transform(wine)
country description points price province region_1 variety winery
0 New Zealand An obvious, unsubtle Chardonnay that flashes p... 85.0 12.0 Gisborne NaN Chardonnay Brancott
1 Italy Coffee bean, leather and tobacco tones are sur... 87.0 NaN Piedmont Roero Nebbiolo Deltetto
2 Portugal Bottled in June 2009, nearly four years after ... 91.0 25.0 Douro NaN Portuguese Red Mário Braga
3 Chile Dusty, mild red fruit aromas bring hints of fl... 84.0 10.0 Colchagua Valley NaN Merlot Santa Carolina
4 US The most massive, dense and ageworthy of all t... 92.0 56.0 Washington Red Mountain Cabernet Sauvignon Sparkman
... ... ... ... ... ... ... ... ...
9995 Spain Unusual in that this cava hails from the Riber... 85.0 16.0 Catalonia Cava Sparkling Blend Finca Torremilanos
9996 US In this newest vintage of Oriana, the Riesling... 90.0 24.0 Washington Columbia Valley (WA) White Blend Brian Carter Cellars
9997 Italy Founded in 1918 by one of the grandfathers and... 95.0 NaN Piedmont Barolo Nebbiolo Cantina Bartolo Mascarello
9998 Italy Perticaia delivers a gorgeous Trebbiano with r... 89.0 NaN Central Italy Umbria Trebbiano Perticaia
9999 France Floral wine, light in structure with some fres... 87.0 NaN Bordeaux Pessac-Léognan Bordeaux-style Red Blend Château Les Carmes Haut-Brion

10000 rows × 8 columns

4.3 Under the hood: DropUninformative

When the Cleaner is fitted on a dataframe, it checks whether the dataframe includes uninformative columns, that is columns that do not bring useful information for training a ML model, and should therefore be dropped.

This is done by the DropUninformative transformer, which is a standalone transformer that the Cleaner leverages to sanitize data. DropUninformative marks a columns as “uninformative” if it satisfies one of these conditions:

  • The fraction of missing values is larger than the threshold provided by the user with drop_null_fraction.
    • By default, this threshold is 1.0, i.e., only columns that contain only missing values are dropped.
    • Setting the threshold to None will disable this check and therefore retain empty columns.
  • It contains only one value, and no missing values.
    • This is controlled by the drop_if_constant flag, which is False by default.
  • All values in the column are distinct.
    • This may be the case if the column contains UIDs, but it can also happen when the column contains text.
    • This check is off by default and can be turned on by setting drop_if_unique to True.

4.4 Conclusion

In this chapter we have covered how the skrub Cleaner helps with sanitizing data by implementing a number of common transformations that need to be executed in order to ensure that the data used by the pipeline are consistent and can be used as indended by ML models.

In the next chapter we will see how skrub helps with applying this and other transformations to specific columns in the data.

5 Exercise: clean a dataframe using the Cleaner

Path to the exercise: content/exercises/02_cleaning_data.ipynb

Load the given dataframe.

import pandas as pd
df = pd.read_csv("../data/cleaner_data.csv")

Use the TableReport to answer the following questions:

  • Are there constant columns?
  • Are there datetime columns? If so, were they parsed correctly?
  • What is the dtype of the numerical features?
from skrub import TableReport
TableReport(df)
Processing column   1 / 12Processing column   2 / 12Processing column   3 / 12Processing column   4 / 12Processing column   5 / 12Processing column   6 / 12Processing column   7 / 12Processing column   8 / 12Processing column   9 / 12Processing column  10 / 12Processing column  11 / 12Processing column  12 / 12

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Then, use the Cleaner to sanitize the data so that:

  • Constant columns are removed
  • Datetimes are parsed properly (hint: use "%d-%b-%Y" as the datetime format)
  • All columns with more than 50% missing values are removed
  • Numerical features are converted to float32
from skrub import Cleaner

# Write your answer here
# 
# 
# 
# 
# 
# 
# 
# 
# solution
from skrub import Cleaner

cleaner = Cleaner(
    drop_if_constant=True,
    drop_null_fraction=0.5,
    numeric_dtype="float32",
    datetime_format="%d-%b-%Y",
)

# Apply the cleaner
df_cleaned = cleaner.fit_transform(df)

# Display the cleaned dataframe
TableReport(df_cleaned)
Processing column   1 / 10Processing column   2 / 10Processing column   3 / 10Processing column   4 / 10Processing column   5 / 10Processing column   6 / 10Processing column   7 / 10Processing column   8 / 10Processing column   9 / 10Processing column  10 / 10

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We can inspect which columns were dropped and what transformations were applied:

print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")
print(
    f"\nColumns dropped: {[col for col in df.columns if col not in cleaner.all_outputs_]}"
)
Original shape: (10000, 12)
Cleaned shape: (10000, 10)

Columns dropped: ['with_nulls_1', 'contract_type_1']