4 Preprocessing data with the skrub `Cleaner`

4.1 Introduction

In this chapter, we will show how we can quickly pre-process and sanitize data using skrub’s Cleaner.

We first load the wine dataset from the local repository. This dataset is a downsampled version of the OpenML dataset (id=42074).

from sklearn.datasets import fetch_openml
from skrub import TableReport
import pandas as pd

wine = pd.read_csv("../data/wine/data.csv")

We can explore it using the TableReport:

TableReport(wine)

Processing column   1 / 10Processing column   2 / 10Processing column   3 / 10Processing column   4 / 10Processing column   5 / 10Processing column   6 / 10Processing column   7 / 10Processing column   8 / 10Processing column   9 / 10Processing column  10 / 10

	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	New Zealand	An obvious, unsubtle Chardonnay that flashes peaches, pineapple and tropical fruit on a curvaceous frame, then turns tart and citrusy on the finish. Imported by Pernod Ricard.	Unoaked	85	12.0	Gisborne			Chardonnay	Brancott
1	Italy	Coffee bean, leather and tobacco tones are surrounded by delicate aromas of wild berry, violet and spice. The tannins are tight and bone dry, making for a good pairing with cheese fondue.	Braja Riserva	87		Piedmont	Roero		Nebbiolo	Deltetto
2	Portugal	Bottled in June 2009, nearly four years after harvest, this is a limited production wine. It has great structure, tannin and beautiful ripe fruits. With a perfumed black-currant character, it is ready to drink now, but its tannins suggest aging potential.	Quinta do Mourão Rio Bom Colheita	91	25.0	Douro			Portuguese Red	Mário Braga
3	Chile	Dusty, mild red fruit aromas bring hints of flowers, sucking candy and vanilla cream. The palate is bouncy, with lively but herbal raspberry and plum flavors. Finishes dry enough, with touches of vanilla, spice and bramble.	Reserva	84	10.0	Colchagua Valley			Merlot	Santa Carolina
4	US	The most massive, dense and ageworthy of all the Sparkman reds, this Kingpin Cabernet is a barrel selection of the darkest, deepest wines in the cellar. Almost pure cassis fruit is matched to a firm, steely spine. Tannins are strong but not heavy, and the aging in two thirds new wood has not turned the wine terribly oaky. Hints of rock, tobacco and earth promise more flavor development over the next decade.	Kingpin	92	56.0	Washington	Red Mountain	Columbia Valley	Cabernet Sauvignon	Sparkman

9,995	Spain	Unusual in that this cava hails from the Ribera del Duero region, where red wines are 99.9% of the show. For a no-dosage nature, it's properly crisp and dry, almost like a liquid version of a soda cracker. There are some holes in the fabric but overall it's citrusy, tight and only mildly bitter on the finish.	Peñalba López Brut Nature	85	16.0	Catalonia	Cava		Sparkling Blend	Finca Torremilanos
9,996	US	In this newest vintage of Oriana, the Riesling component has been bumped up to 18%; the rest is half Viognier, 31% Roussanne. The three grapes complement each other well; the Riesling adding aromatics, acidity and a barely perceptible amount of residual sugar. This opens with a vitamin pill aroma that slowly dissipates; the fruit flavors are a mix of pineapple and Satsuma orange, with lots of natural acidity.	Oriana White	90	24.0	Washington	Columbia Valley (WA)	Columbia Valley	White Blend	Brian Carter Cellars
9,997	Italy	Founded in 1918 by one of the grandfathers and custodians of Barolo tradition, this historic estate is now run by Bartolo's energetic daughter Maria Teresa. She continues in a tradition of excellence, which highlights integrity of fruit and intensity of flavors. There's beautiful purity here in the form of cassis fruit, cherry cola and smoked Spanish cedar. Drink after 2017.		95		Piedmont	Barolo		Nebbiolo	Cantina Bartolo Mascarello
9,998	Italy	Perticaia delivers a gorgeous Trebbiano with rich creaminess to its texture and compelling aromas of passion fruit, melon, yellow rose and pulpy stone fruit. You could pair this beautiful wine with shellfish or crab cakes.		89		Central Italy	Umbria		Trebbiano	Perticaia
9,999	France	Floral wine, light in structure with some freshness. It's attractive, open, the tannins giving shape rather than aging potential.		87		Bordeaux	Pessac-Léognan		Bordeaux-style Red Blend	Château Les Carmes Haut-Brion

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	country	ObjectDType	False	0 (0.0%)	31 (0.3%)
1	description	ObjectDType	False	0 (0.0%)	9692 (96.9%)
2	designation	ObjectDType	False	3085 (30.9%)	4914 (49.1%)
3	points	Int64DType	False	0 (0.0%)	21 (0.2%)	87.8	3.19	80	88	100
4	price	Float64DType	False	897 (9.0%)	177 (1.8%)	32.7	34.9	4.00	24.0	800.
5	province	ObjectDType	False	0 (0.0%)	251 (2.5%)
6	region_1	ObjectDType	False	1675 (16.8%)	692 (6.9%)
7	region_2	ObjectDType	False	5980 (59.8%)	18 (0.2%)
8	variety	ObjectDType	False	0 (0.0%)	310 (3.1%)
9	winery	ObjectDType	False	0 (0.0%)	5224 (52.2%)

Column 1	Column 2	Cramér's V
country	province	0.600
region_1	region_2	0.569
province	region_2	0.531
province	region_1	0.493
country	region_1	0.492
province	variety	0.339
country	region_2	0.328
description	winery	0.262
country	variety	0.260
designation	province	0.244
region_2	variety	0.213
country	designation	0.186
region_1	variety	0.179
price	province	0.161
designation	variety	0.159
country	winery	0.153
designation	region_1	0.151
country	price	0.148
region_1	winery	0.145
designation	price	0.142

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We can notice that there are a few columns that contain a sizable amount of missing values (“region_2” and “designation”). If we want to remove these columns programmatically using pandas, we have to do something like this:

wine.loc[:, wine.isnull().mean() <= 0.3]

	country	description	points	price	province	region_1	variety	winery
0	New Zealand	An obvious, unsubtle Chardonnay that flashes p...	85	12.0	Gisborne	NaN	Chardonnay	Brancott
1	Italy	Coffee bean, leather and tobacco tones are sur...	87	NaN	Piedmont	Roero	Nebbiolo	Deltetto
2	Portugal	Bottled in June 2009, nearly four years after ...	91	25.0	Douro	NaN	Portuguese Red	Mário Braga
3	Chile	Dusty, mild red fruit aromas bring hints of fl...	84	10.0	Colchagua Valley	NaN	Merlot	Santa Carolina
4	US	The most massive, dense and ageworthy of all t...	92	56.0	Washington	Red Mountain	Cabernet Sauvignon	Sparkman
...	...	...	...	...	...	...	...	...
9995	Spain	Unusual in that this cava hails from the Riber...	85	16.0	Catalonia	Cava	Sparkling Blend	Finca Torremilanos
9996	US	In this newest vintage of Oriana, the Riesling...	90	24.0	Washington	Columbia Valley (WA)	White Blend	Brian Carter Cellars
9997	Italy	Founded in 1918 by one of the grandfathers and...	95	NaN	Piedmont	Barolo	Nebbiolo	Cantina Bartolo Mascarello
9998	Italy	Perticaia delivers a gorgeous Trebbiano with r...	89	NaN	Central Italy	Umbria	Trebbiano	Perticaia
9999	France	Floral wine, light in structure with some fres...	87	NaN	Bordeaux	Pessac-Léognan	Bordeaux-style Red Blend	Château Les Carmes Haut-Brion

10000 rows × 8 columns

It may also be beneficial to convert numerical features to float32, to reduce the computational cost:

wine.astype({col: "float32" for col in wine.select_dtypes(include="number").columns})

	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	New Zealand	An obvious, unsubtle Chardonnay that flashes p...	Unoaked	85.0	12.0	Gisborne	NaN	NaN	Chardonnay	Brancott
1	Italy	Coffee bean, leather and tobacco tones are sur...	Braja Riserva	87.0	NaN	Piedmont	Roero	NaN	Nebbiolo	Deltetto
2	Portugal	Bottled in June 2009, nearly four years after ...	Quinta do Mourão Rio Bom Colheita	91.0	25.0	Douro	NaN	NaN	Portuguese Red	Mário Braga
3	Chile	Dusty, mild red fruit aromas bring hints of fl...	Reserva	84.0	10.0	Colchagua Valley	NaN	NaN	Merlot	Santa Carolina
4	US	The most massive, dense and ageworthy of all t...	Kingpin	92.0	56.0	Washington	Red Mountain	Columbia Valley	Cabernet Sauvignon	Sparkman
...	...	...	...	...	...	...	...	...	...	...
9995	Spain	Unusual in that this cava hails from the Riber...	Peñalba López Brut Nature	85.0	16.0	Catalonia	Cava	NaN	Sparkling Blend	Finca Torremilanos
9996	US	In this newest vintage of Oriana, the Riesling...	Oriana White	90.0	24.0	Washington	Columbia Valley (WA)	Columbia Valley	White Blend	Brian Carter Cellars
9997	Italy	Founded in 1918 by one of the grandfathers and...	NaN	95.0	NaN	Piedmont	Barolo	NaN	Nebbiolo	Cantina Bartolo Mascarello
9998	Italy	Perticaia delivers a gorgeous Trebbiano with r...	NaN	89.0	NaN	Central Italy	Umbria	NaN	Trebbiano	Perticaia
9999	France	Floral wine, light in structure with some fres...	NaN	87.0	NaN	Bordeaux	Pessac-Léognan	NaN	Bordeaux-style Red Blend	Château Les Carmes Haut-Brion

10000 rows × 10 columns

These operations are quite common in most cases (although the parameters and requirements may vary by project), so writing the code that addresses them may become repetitive.

A simpler way of dealing with this preliminary preparation is to use the skrub Cleaner.

4.2 Using the skrub `Cleaner`

The Cleaner is intended to be a first step in preparing tabular data for analysis or modeling, and can handle a variety of common data cleaning tasks automatically. It is designed to work out-of-the-box with minimal configuration, although it is also possible to customize its behavior if needed.

Given a dataframe, the Cleaner applies a sequence of transformers to each column:

Consider this example dataframe:

df = pd.DataFrame(
    {
        "numerical_1": [1, 2, 3, 4, 5],
        "numerical_2": [10.5, 20.3, None, 40.1, 50.2],
        "string_column": ["apple", "?", "banana", "cherry", "?"],
        "datetime_column": [
            "03 Jan 2020",
            "04 Jan 2020",
            "05 Jan 2020",
            "06 Jan 2020",
            "07 Jan 2020",
        ],
        "all_none": [None, None, None, None, None],
    }
)
df

	numerical_1	numerical_2	string_column	datetime_column	all_none
0	1	10.5	apple	03 Jan 2020	None
1	2	20.3	?	04 Jan 2020	None
2	3	NaN	banana	05 Jan 2020	None
3	4	40.1	cherry	06 Jan 2020	None
4	5	50.2	?	07 Jan 2020	None

This dataframe has mixed type columns, with some of the missing values denoted as None and some "?". The datetime column has a non-standard format and has been parsed as a string column. Finally, one of the columns is completely empty.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   numerical_1      5 non-null      int64  
 1   numerical_2      4 non-null      float64
 2   string_column    5 non-null      object 
 3   datetime_column  5 non-null      object 
 4   all_none         0 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 332.0+ bytes

4.2.1 Using plain pandas

Cleaning this dataset using plain pandas may require writing code like this:

# Parse the datetime strings with a specific format
df['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')

# Drop columns with only a single unique value
df_clean = df.loc[:, df.nunique(dropna=True) > 1]

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_clean = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_clean.columns[df_clean.eq('').all()]
    df_clean = df_clean.drop(columns=empty_string_cols)
    return df_clean

# Apply the function to the DataFrame
df_clean = drop_empty_columns(df_clean)
df_clean

	numerical_1	numerical_2	string_column	datetime_column
0	1	10.5	apple	2020-01-03
1	2	20.3	?	2020-01-04
2	3	NaN	banana	2020-01-05
3	4	40.1	cherry	2020-01-06
4	5	50.2	?	2020-01-07

4.2.2 The alternative: `skrub.Cleaner`

By default, the Cleaner applies various transformations that can sanitize many common use cases:

from skrub import Cleaner
df_clean = Cleaner().fit_transform(df)
df_clean

	numerical_1	numerical_2	string_column	datetime_column
0	1	10.5	apple	2020-01-03
1	2	20.3	None	2020-01-04
2	3	NaN	banana	2020-01-05
3	4	40.1	cherry	2020-01-06
4	5	50.2	None	2020-01-07

We can see that the cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed accordingly:

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   numerical_1      5 non-null      int64         
 1   numerical_2      4 non-null      float64       
 2   string_column    3 non-null      object        
 3   datetime_column  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 292.0+ bytes

4.2.3 Cleaning steps performed by the `Cleaner`

In more detail, the Cleaner executes the following steps in order:

It replaces common strings used to represent missing values (e.g., NULL, ?) with NA markers.
It uses the DropUninformative transformer to decide whether a column is “uninformative”, that is, it is not likely to bring information useful to train a ML model. For example, empty columns are uninformative.
It tries to parse datetime columns using common formats, or a user-provided datetime_format.
It processes categorical columns to ensure consistent typing depending on the dataframe library in use.
It converts columns to string, unless they have a data type that carries more information, such as numerical, datetime, and categorial columns.
It automatically parses numbers written as strings (“1.324”) to convert them to actual numerical columns.
Finally, it can convert numerical columns to np.float32 dtype if called with the parameter numeric_dtype="float32". This ensures a consistent representation of numbers and missing values, and helps reducing the memory footprint.

We can look back at the “wine” dataframe and clean it with a suitably configured Cleaner:

cleaner = Cleaner(drop_null_fraction=0.3, numeric_dtype="float32")

cleaner.fit_transform(wine)

	country	description	points	price	province	region_1	variety	winery
0	New Zealand	An obvious, unsubtle Chardonnay that flashes p...	85.0	12.0	Gisborne	NaN	Chardonnay	Brancott
1	Italy	Coffee bean, leather and tobacco tones are sur...	87.0	NaN	Piedmont	Roero	Nebbiolo	Deltetto
2	Portugal	Bottled in June 2009, nearly four years after ...	91.0	25.0	Douro	NaN	Portuguese Red	Mário Braga
3	Chile	Dusty, mild red fruit aromas bring hints of fl...	84.0	10.0	Colchagua Valley	NaN	Merlot	Santa Carolina
4	US	The most massive, dense and ageworthy of all t...	92.0	56.0	Washington	Red Mountain	Cabernet Sauvignon	Sparkman
...	...	...	...	...	...	...	...	...
9995	Spain	Unusual in that this cava hails from the Riber...	85.0	16.0	Catalonia	Cava	Sparkling Blend	Finca Torremilanos
9996	US	In this newest vintage of Oriana, the Riesling...	90.0	24.0	Washington	Columbia Valley (WA)	White Blend	Brian Carter Cellars
9997	Italy	Founded in 1918 by one of the grandfathers and...	95.0	NaN	Piedmont	Barolo	Nebbiolo	Cantina Bartolo Mascarello
9998	Italy	Perticaia delivers a gorgeous Trebbiano with r...	89.0	NaN	Central Italy	Umbria	Trebbiano	Perticaia
9999	France	Floral wine, light in structure with some fres...	87.0	NaN	Bordeaux	Pessac-Léognan	Bordeaux-style Red Blend	Château Les Carmes Haut-Brion

10000 rows × 8 columns

4.3 Under the hood: `DropUninformative`

When the Cleaner is fitted on a dataframe, it checks whether the dataframe includes uninformative columns, that is columns that do not bring useful information for training a ML model, and should therefore be dropped.

This is done by the DropUninformative transformer, which is a standalone transformer that the Cleaner leverages to sanitize data. DropUninformative marks a columns as “uninformative” if it satisfies one of these conditions:

The fraction of missing values is larger than the threshold provided by the user with drop_null_fraction.
- By default, this threshold is 1.0, i.e., only columns that contain only missing values are dropped.
- Setting the threshold to None will disable this check and therefore retain empty columns.
It contains only one value, and no missing values.
- This is controlled by the drop_if_constant flag, which is False by default.
All values in the column are distinct.
- This may be the case if the column contains UIDs, but it can also happen when the column contains text.
- This check is off by default and can be turned on by setting drop_if_unique to True.

4.4 Conclusion

In this chapter we have covered how the skrub Cleaner helps with sanitizing data by implementing a number of common transformations that need to be executed in order to ensure that the data used by the pipeline are consistent and can be used as indended by ML models.

In the next chapter we will see how skrub helps with applying this and other transformations to specific columns in the data.

5 Exercise: clean a dataframe using the `Cleaner`

Path to the exercise: content/exercises/02_cleaning_data.ipynb

Load the given dataframe.

import pandas as pd
df = pd.read_csv("../data/cleaner_data.csv")

Use the TableReport to answer the following questions:

Are there constant columns?
Are there datetime columns? If so, were they parsed correctly?
What is the dtype of the numerical features?

from skrub import TableReport
TableReport(df)

Processing column   1 / 12Processing column   2 / 12Processing column   3 / 12Processing column   4 / 12Processing column   5 / 12Processing column   6 / 12Processing column   7 / 12Processing column   8 / 12Processing column   9 / 12Processing column  10 / 12Processing column  11 / 12Processing column  12 / 12

	num_1	num_2	num_3	first_name	last_name	city	country	department	with_nulls_1	contract_type_1	date_1	date_2
0	6	418.	38.5	Emily	Rodriguez	Milwaukee	Belgium	Design		CONTRACT	31-Oct-2022	03-Mar-2022
1	34	39.5	59.0	Chloe	Moore	Detroit	Singapore	Operations	Las Vegas	CONTRACT	14-Sep-2020	11-Jan-2020
2	11	575.	44.0	Lily	Robinson	Houston	Italy	Operations		CONTRACT	20-Feb-2021	13-Apr-2020
3	98	804.	53.6		White	Louisville	Australia	Finance		CONTRACT	19-Mar-2023	18-Sep-2021
4	52	346.	67.8	Riley	Robinson	Mesa	France	Procurement	Boston	CONTRACT	19-Jul-2021	10-Jan-2022

9,995	51	148.	51.8	Ava	Taylor	Memphis	Portugal	Design		CONTRACT	27-Feb-2023	28-Mar-2020
9,996	73	437.	33.7	Amelia	Moore	Boston	New Zealand	Sales	Baltimore	CONTRACT	10-Jul-2022	07-Sep-2021
9,997	80	385.	37.8	Isabella	Davis	Miami	Ireland	Sales	Dallas	CONTRACT	25-Jun-2023	12-Feb-2020
9,998	5	588.	29.7	Lily	Wilson	San Jose	Netherlands	IT	Los Angeles	CONTRACT	07-Dec-2022	03-Mar-2022
9,999	56	434.	46.9	Aria	Williams	Chicago	Belgium	Operations		CONTRACT	28-Mar-2023	11-Aug-2023

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	num_1	Int64DType	False	0 (0.0%)	101 (1.0%)	50.5	29.4	0	50	100
1	num_2	Float64DType	False	0 (0.0%)	10000 (100.0%)	498.	288.	0.0306	493.	1.00e+03
2	num_3	Float64DType	False	0 (0.0%)	10000 (100.0%)	50.1	15.3	-4.48	50.2	114.
3	first_name	ObjectDType	False	984 (9.8%)	45 (0.4%)
4	last_name	ObjectDType	False	0 (0.0%)	41 (0.4%)
5	city	ObjectDType	False	1398 (14.0%)	40 (0.4%)
6	country	ObjectDType	False	0 (0.0%)	33 (0.3%)
7	department	ObjectDType	False	0 (0.0%)	15 (0.1%)
8	with_nulls_1	ObjectDType	False	7496 (75.0%)	40 (0.4%)
9	contract_type_1	ObjectDType	True	0 (0.0%)	1 (< 0.1%)
10	date_1	ObjectDType	False	0 (0.0%)	1460 (14.6%)
11	date_2	ObjectDType	False	0 (0.0%)	1461 (14.6%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
last_name	date_1	0.0673
country	date_2	0.0662
city	date_2	0.0639
department	with_nulls_1	0.0637
num_2	department	0.0616
first_name	with_nulls_1	0.0603
country	department	0.0597
city	country	0.0596
num_3	with_nulls_1	0.0595
first_name	city	0.0595
num_2	country	0.0588
num_1	num_2	0.0583	-0.00577
num_2	date_1	0.0583
num_1	first_name	0.0583
first_name	last_name	0.0582
last_name	city	0.0578
num_1	with_nulls_1	0.0577
first_name	date_1	0.0577
first_name	country	0.0572
num_2	first_name	0.0568

Please enable javascript

Then, use the Cleaner to sanitize the data so that:

Constant columns are removed
Datetimes are parsed properly (hint: use "%d-%b-%Y" as the datetime format)
All columns with more than 50% missing values are removed
Numerical features are converted to float32

from skrub import Cleaner

# Write your answer here
# 
# 
# 
# 
# 
# 
# 
#

# solution
from skrub import Cleaner

cleaner = Cleaner(
    drop_if_constant=True,
    drop_null_fraction=0.5,
    numeric_dtype="float32",
    datetime_format="%d-%b-%Y",
)

# Apply the cleaner
df_cleaned = cleaner.fit_transform(df)

# Display the cleaned dataframe
TableReport(df_cleaned)

Processing column   1 / 10Processing column   2 / 10Processing column   3 / 10Processing column   4 / 10Processing column   5 / 10Processing column   6 / 10Processing column   7 / 10Processing column   8 / 10Processing column   9 / 10Processing column  10 / 10

	num_1	num_2	num_3	first_name	last_name	city	country	department	date_1	date_2
0	6.00	418.	38.5	Emily	Rodriguez	Milwaukee	Belgium	Design	2022-10-31 00:00:00	2022-03-03 00:00:00
1	34.0	39.5	59.0	Chloe	Moore	Detroit	Singapore	Operations	2020-09-14 00:00:00	2020-01-11 00:00:00
2	11.0	575.	44.0	Lily	Robinson	Houston	Italy	Operations	2021-02-20 00:00:00	2020-04-13 00:00:00
3	98.0	804.	53.6		White	Louisville	Australia	Finance	2023-03-19 00:00:00	2021-09-18 00:00:00
4	52.0	346.	67.8	Riley	Robinson	Mesa	France	Procurement	2021-07-19 00:00:00	2022-01-10 00:00:00

9,995	51.0	148.	51.8	Ava	Taylor	Memphis	Portugal	Design	2023-02-27 00:00:00	2020-03-28 00:00:00
9,996	73.0	437.	33.7	Amelia	Moore	Boston	New Zealand	Sales	2022-07-10 00:00:00	2021-09-07 00:00:00
9,997	80.0	385.	37.8	Isabella	Davis	Miami	Ireland	Sales	2023-06-25 00:00:00	2020-02-12 00:00:00
9,998	5.00	588.	29.7	Lily	Wilson	San Jose	Netherlands	IT	2022-12-07 00:00:00	2022-03-03 00:00:00
9,999	56.0	434.	46.9	Aria	Williams	Chicago	Belgium	Operations	2023-03-28 00:00:00	2023-08-11 00:00:00

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	num_1	Float32DType	False	0 (0.0%)	101 (1.0%)	50.5	29.4	0.00	50.0	100.
1	num_2	Float32DType	False	0 (0.0%)	10000 (100.0%)	498.	288.	0.0306	493.	1.00e+03
2	num_3	Float32DType	False	0 (0.0%)	9997 (100.0%)	50.1	15.3	-4.48	50.2	114.
3	first_name	ObjectDType	False	984 (9.8%)	45 (0.4%)
4	last_name	ObjectDType	False	0 (0.0%)	41 (0.4%)
5	city	ObjectDType	False	1398 (14.0%)	40 (0.4%)
6	country	ObjectDType	False	0 (0.0%)	33 (0.3%)
7	department	ObjectDType	False	0 (0.0%)	15 (0.1%)
8	date_1	DateTime64DType	False	0 (0.0%)	1460 (14.6%)			2020-01-01T00:00:00		2024-01-01T00:00:00
9	date_2	DateTime64DType	False	0 (0.0%)	1461 (14.6%)			2020-01-01T00:00:00		2024-01-01T00:00:00

Column 1	Column 2	Cramér's V	Pearson's Correlation
num_2	department	0.0616
country	department	0.0597
city	country	0.0596
first_name	city	0.0595
num_2	country	0.0588
num_1	num_2	0.0583	-0.00577
num_1	first_name	0.0583
first_name	last_name	0.0582
last_name	city	0.0578
first_name	country	0.0572
num_2	first_name	0.0568
first_name	date_1	0.0568
city	date_1	0.0566
num_1	department	0.0565
first_name	date_2	0.0564
num_3	first_name	0.0564
num_2	num_3	0.0563	-0.0283
last_name	country	0.0563
num_1	last_name	0.0560
num_3	country	0.0557

Please enable javascript

We can inspect which columns were dropped and what transformations were applied:

print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")
print(
    f"\nColumns dropped: {[col for col in df.columns if col not in cleaner.all_outputs_]}"
)

Original shape: (10000, 12)
Cleaned shape: (10000, 10)

Columns dropped: ['with_nulls_1', 'contract_type_1']

4.1 Introduction

country

description

designation

points

price

province

region_1

region_2

variety

winery

country

description

designation

points

price

province

region_1

region_2

variety

winery

Please enable javascript

4.2 Using the skrub Cleaner

4.2.1 Using plain pandas

4.2.2 The alternative: skrub.Cleaner

4.2.3 Cleaning steps performed by the Cleaner

4.3 Under the hood: DropUninformative

4.4 Conclusion

5 Exercise: clean a dataframe using the Cleaner

num_1

num_2

num_3

first_name

last_name

city

country

department

with_nulls_1

contract_type_1

date_1

date_2

num_1

num_2

num_3

first_name

last_name

city

country

department

with_nulls_1

contract_type_1

date_1

date_2

Please enable javascript

num_1

num_2

num_3

first_name

last_name

city

country

department

date_1

date_2

num_1

num_2

num_3

first_name

last_name

city

country

department

date_1

date_2

Please enable javascript

4.2 Using the skrub `Cleaner`

4.2.2 The alternative: `skrub.Cleaner`

4.2.3 Cleaning steps performed by the `Cleaner`

4.3 Under the hood: `DropUninformative`

5 Exercise: clean a dataframe using the `Cleaner`