Click a table cell for more info about its column.
country
description
designation
points
price
province
region_1
region_2
variety
winery
0
New Zealand
An obvious, unsubtle Chardonnay that flashes peaches, pineapple and tropical fruit on a curvaceous frame, then turns tart and citrusy on the finish. Imported by Pernod Ricard.
Unoaked
85
12.0
Gisborne
Chardonnay
Brancott
1
Italy
Coffee bean, leather and tobacco tones are surrounded by delicate aromas of wild berry, violet and spice. The tannins are tight and bone dry, making for a good pairing with cheese fondue.
Braja Riserva
87
Piedmont
Roero
Nebbiolo
Deltetto
2
Portugal
Bottled in June 2009, nearly four years after harvest, this is a limited production wine. It has great structure, tannin and beautiful ripe fruits. With a perfumed black-currant character, it is ready to drink now, but its tannins suggest aging potential.
Quinta do Mourão Rio Bom Colheita
91
25.0
Douro
Portuguese Red
Mário Braga
3
Chile
Dusty, mild red fruit aromas bring hints of flowers, sucking candy and vanilla cream. The palate is bouncy, with lively but herbal raspberry and plum flavors. Finishes dry enough, with touches of vanilla, spice and bramble.
Reserva
84
10.0
Colchagua Valley
Merlot
Santa Carolina
4
US
The most massive, dense and ageworthy of all the Sparkman reds, this Kingpin Cabernet is a barrel selection of the darkest, deepest wines in the cellar. Almost pure cassis fruit is matched to a firm, steely spine. Tannins are strong but not heavy, and the aging in two thirds new wood has not turned the wine terribly oaky. Hints of rock, tobacco and earth promise more flavor development over the next decade.
Kingpin
92
56.0
Washington
Red Mountain
Columbia Valley
Cabernet Sauvignon
Sparkman
9,995
Spain
Unusual in that this cava hails from the Ribera del Duero region, where red wines are 99.9% of the show. For a no-dosage nature, it's properly crisp and dry, almost like a liquid version of a soda cracker. There are some holes in the fabric but overall it's citrusy, tight and only mildly bitter on the finish.
Peñalba López Brut Nature
85
16.0
Catalonia
Cava
Sparkling Blend
Finca Torremilanos
9,996
US
In this newest vintage of Oriana, the Riesling component has been bumped up to 18%; the rest is half Viognier, 31% Roussanne. The three grapes complement each other well; the Riesling adding aromatics, acidity and a barely perceptible amount of residual sugar. This opens with a vitamin pill aroma that slowly dissipates; the fruit flavors are a mix of pineapple and Satsuma orange, with lots of natural acidity.
Oriana White
90
24.0
Washington
Columbia Valley (WA)
Columbia Valley
White Blend
Brian Carter Cellars
9,997
Italy
Founded in 1918 by one of the grandfathers and custodians of Barolo tradition, this historic estate is now run by Bartolo's energetic daughter Maria Teresa. She continues in a tradition of excellence, which highlights integrity of fruit and intensity of flavors. There's beautiful purity here in the form of cassis fruit, cherry cola and smoked Spanish cedar. Drink after 2017.
95
Piedmont
Barolo
Nebbiolo
Cantina Bartolo Mascarello
9,998
Italy
Perticaia delivers a gorgeous Trebbiano with rich creaminess to its texture and compelling aromas of passion fruit, melon, yellow rose and pulpy stone fruit. You could pair this beautiful wine with shellfish or crab cakes.
89
Central Italy
Umbria
Trebbiano
Perticaia
9,999
France
Floral wine, light in structure with some freshness. It's attractive, open, the tannins giving shape rather than aging potential.
This medium-bodied Chardonnay is plump without being heavy, featuring modest citrus aromas but greater depth of fruit on the palate, adding pear and melon flavors to the mix. It's fresh and lively on the finish, which should serve it well with a wide variety of dishes. Drink now–2013.
This is a fun wine, with flavors like cherry candy and milk chocolate. Light and insubstantial, but it will be quite nice chilled as an apéritif.
The bouquet enters into the land of exotic, with aromas of sandalwood and cinnamon. But from there the wine loses steam, offering only basic plum and berry flavors followed by an earthy, almost murky finish. Doesn't show enough in the middle and late stages to rate better. Imported by American Wine Distributors.
A stewy, thick, dark Syrah with tannins akin to a plush leather chair, very savory, big in style though a tad one-dimensional. The majority of Syrah is blended with 9% Grenache and 2% Mourvèdre.
An impressively ripe wine, full of Cabernet Franc trademarks, such as spice, fruitcake and red berry fruits. It is structured, firm, hinting at wood, and finishing with bright acidity.
Slightly odd-scented nose with hints of pipe tobacco, grapefruit, green apple and herbs. Better on the palate with green pear, faint apple, pineapple and herbal notes in a tart, lean framework. Pleasant hints of oak show up on the finish.
Spicy, toasty, full-bodied and fruity, this is a wine that is packed with plums, fresh red fruits, and a great burst of ripe acidity. Delicious and very accessible.
Too dark and heavy for a blush wine, with jammy flavors of cherries, vanilla and spices. Lacks lightness and intricacy.
Very simple and straightforward, this easy expression of Prosecco carries forth crisp aromas of dried grass, pressed flower, honey, peach and citrus. There's creamy foam in the mouth that would pair with mozzarella.
This wine's pungent bouquet comes close to resembling ammonia in its sweatry intensity, yet it rounds out surprisingly well on the palate, where melon fruit and a touch of sweetness emerges. Drink up.
List:
["This medium-bodied Chardonnay is plump without being heavy, featuring modest citrus aromas but greater depth of fruit on the palate, adding pear and melon flavors to the mix. It's fresh and lively on the finish, which should serve it well with a wide variety of dishes. Drink now–2013.", 'This is a fun wine, with flavors like cherry candy and milk chocolate. Light and insubstantial, but it will be quite nice chilled as an apéritif.', "The bouquet enters into the land of exotic, with aromas of sandalwood and cinnamon. But from there the wine loses steam, offering only basic plum and berry flavors followed by an earthy, almost murky finish. Doesn't show enough in the middle and late stages to rate better. Imported by American Wine Distributors.", 'A stewy, thick, dark Syrah with tannins akin to a plush leather chair, very savory, big in style though a tad one-dimensional. The majority of Syrah is blended with 9% Grenache and 2% Mourvèdre.', 'An impressively ripe wine, full of Cabernet Franc trademarks, such as spice, fruitcake and red berry fruits. It is structured, firm, hinting at wood, and finishing with bright acidity.', 'Slightly odd-scented nose with hints of pipe tobacco, grapefruit, green apple and herbs. Better on the palate with green pear, faint apple, pineapple and herbal notes in a tart, lean framework. Pleasant hints of oak show up on the finish.', 'Spicy, toasty, full-bodied and fruity, this is a wine that is packed with plums, fresh red fruits, and a great burst of ripe acidity. Delicious and very accessible.', 'Too dark and heavy for a blush wine, with jammy flavors of cherries, vanilla and spices. Lacks lightness and intricacy.', "Very simple and straightforward, this easy expression of Prosecco carries forth crisp aromas of dried grass, pressed flower, honey, peach and citrus. There's creamy foam in the mouth that would pair with mozzarella.", "This wine's pungent bouquet comes close to resembling ammonia in its sweatry intensity, yet it rounds out surprisingly well on the palate, where melon fruit and a touch of sweetness emerges. Drink up."]
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
country
ObjectDType
False
0 (0.0%)
31 (0.3%)
1
description
ObjectDType
False
0 (0.0%)
9692 (96.9%)
2
designation
ObjectDType
False
3085 (30.9%)
4914 (49.1%)
3
points
Int64DType
False
0 (0.0%)
21 (0.2%)
87.8
3.19
80
88
100
4
price
Float64DType
False
897 (9.0%)
177 (1.8%)
32.7
34.9
4.00
24.0
800.
5
province
ObjectDType
False
0 (0.0%)
251 (2.5%)
6
region_1
ObjectDType
False
1675 (16.8%)
692 (6.9%)
7
region_2
ObjectDType
False
5980 (59.8%)
18 (0.2%)
8
variety
ObjectDType
False
0 (0.0%)
310 (3.1%)
9
winery
ObjectDType
False
0 (0.0%)
5224 (52.2%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
This medium-bodied Chardonnay is plump without being heavy, featuring modest citrus aromas but greater depth of fruit on the palate, adding pear and melon flavors to the mix. It's fresh and lively on the finish, which should serve it well with a wide variety of dishes. Drink now–2013.
This is a fun wine, with flavors like cherry candy and milk chocolate. Light and insubstantial, but it will be quite nice chilled as an apéritif.
The bouquet enters into the land of exotic, with aromas of sandalwood and cinnamon. But from there the wine loses steam, offering only basic plum and berry flavors followed by an earthy, almost murky finish. Doesn't show enough in the middle and late stages to rate better. Imported by American Wine Distributors.
A stewy, thick, dark Syrah with tannins akin to a plush leather chair, very savory, big in style though a tad one-dimensional. The majority of Syrah is blended with 9% Grenache and 2% Mourvèdre.
An impressively ripe wine, full of Cabernet Franc trademarks, such as spice, fruitcake and red berry fruits. It is structured, firm, hinting at wood, and finishing with bright acidity.
Slightly odd-scented nose with hints of pipe tobacco, grapefruit, green apple and herbs. Better on the palate with green pear, faint apple, pineapple and herbal notes in a tart, lean framework. Pleasant hints of oak show up on the finish.
Spicy, toasty, full-bodied and fruity, this is a wine that is packed with plums, fresh red fruits, and a great burst of ripe acidity. Delicious and very accessible.
Too dark and heavy for a blush wine, with jammy flavors of cherries, vanilla and spices. Lacks lightness and intricacy.
Very simple and straightforward, this easy expression of Prosecco carries forth crisp aromas of dried grass, pressed flower, honey, peach and citrus. There's creamy foam in the mouth that would pair with mozzarella.
This wine's pungent bouquet comes close to resembling ammonia in its sweatry intensity, yet it rounds out surprisingly well on the palate, where melon fruit and a touch of sweetness emerges. Drink up.
List:
["This medium-bodied Chardonnay is plump without being heavy, featuring modest citrus aromas but greater depth of fruit on the palate, adding pear and melon flavors to the mix. It's fresh and lively on the finish, which should serve it well with a wide variety of dishes. Drink now–2013.", 'This is a fun wine, with flavors like cherry candy and milk chocolate. Light and insubstantial, but it will be quite nice chilled as an apéritif.', "The bouquet enters into the land of exotic, with aromas of sandalwood and cinnamon. But from there the wine loses steam, offering only basic plum and berry flavors followed by an earthy, almost murky finish. Doesn't show enough in the middle and late stages to rate better. Imported by American Wine Distributors.", 'A stewy, thick, dark Syrah with tannins akin to a plush leather chair, very savory, big in style though a tad one-dimensional. The majority of Syrah is blended with 9% Grenache and 2% Mourvèdre.', 'An impressively ripe wine, full of Cabernet Franc trademarks, such as spice, fruitcake and red berry fruits. It is structured, firm, hinting at wood, and finishing with bright acidity.', 'Slightly odd-scented nose with hints of pipe tobacco, grapefruit, green apple and herbs. Better on the palate with green pear, faint apple, pineapple and herbal notes in a tart, lean framework. Pleasant hints of oak show up on the finish.', 'Spicy, toasty, full-bodied and fruity, this is a wine that is packed with plums, fresh red fruits, and a great burst of ripe acidity. Delicious and very accessible.', 'Too dark and heavy for a blush wine, with jammy flavors of cherries, vanilla and spices. Lacks lightness and intricacy.', "Very simple and straightforward, this easy expression of Prosecco carries forth crisp aromas of dried grass, pressed flower, honey, peach and citrus. There's creamy foam in the mouth that would pair with mozzarella.", "This wine's pungent bouquet comes close to resembling ammonia in its sweatry intensity, yet it rounds out surprisingly well on the palate, where melon fruit and a touch of sweetness emerges. Drink up."]
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
We can notice that there are a few columns that contain a sizable amount of missing values (“region_2” and “designation”). If we want to remove these columns programmatically using pandas, we have to do something like this:
wine.loc[:, wine.isnull().mean() <=0.3]
country
description
points
price
province
region_1
variety
winery
0
New Zealand
An obvious, unsubtle Chardonnay that flashes p...
85
12.0
Gisborne
NaN
Chardonnay
Brancott
1
Italy
Coffee bean, leather and tobacco tones are sur...
87
NaN
Piedmont
Roero
Nebbiolo
Deltetto
2
Portugal
Bottled in June 2009, nearly four years after ...
91
25.0
Douro
NaN
Portuguese Red
Mário Braga
3
Chile
Dusty, mild red fruit aromas bring hints of fl...
84
10.0
Colchagua Valley
NaN
Merlot
Santa Carolina
4
US
The most massive, dense and ageworthy of all t...
92
56.0
Washington
Red Mountain
Cabernet Sauvignon
Sparkman
...
...
...
...
...
...
...
...
...
9995
Spain
Unusual in that this cava hails from the Riber...
85
16.0
Catalonia
Cava
Sparkling Blend
Finca Torremilanos
9996
US
In this newest vintage of Oriana, the Riesling...
90
24.0
Washington
Columbia Valley (WA)
White Blend
Brian Carter Cellars
9997
Italy
Founded in 1918 by one of the grandfathers and...
95
NaN
Piedmont
Barolo
Nebbiolo
Cantina Bartolo Mascarello
9998
Italy
Perticaia delivers a gorgeous Trebbiano with r...
89
NaN
Central Italy
Umbria
Trebbiano
Perticaia
9999
France
Floral wine, light in structure with some fres...
87
NaN
Bordeaux
Pessac-Léognan
Bordeaux-style Red Blend
Château Les Carmes Haut-Brion
10000 rows × 8 columns
It may also be beneficial to convert numerical features to float32, to reduce the computational cost:
wine.astype({col: "float32"for col in wine.select_dtypes(include="number").columns})
country
description
designation
points
price
province
region_1
region_2
variety
winery
0
New Zealand
An obvious, unsubtle Chardonnay that flashes p...
Unoaked
85.0
12.0
Gisborne
NaN
NaN
Chardonnay
Brancott
1
Italy
Coffee bean, leather and tobacco tones are sur...
Braja Riserva
87.0
NaN
Piedmont
Roero
NaN
Nebbiolo
Deltetto
2
Portugal
Bottled in June 2009, nearly four years after ...
Quinta do Mourão Rio Bom Colheita
91.0
25.0
Douro
NaN
NaN
Portuguese Red
Mário Braga
3
Chile
Dusty, mild red fruit aromas bring hints of fl...
Reserva
84.0
10.0
Colchagua Valley
NaN
NaN
Merlot
Santa Carolina
4
US
The most massive, dense and ageworthy of all t...
Kingpin
92.0
56.0
Washington
Red Mountain
Columbia Valley
Cabernet Sauvignon
Sparkman
...
...
...
...
...
...
...
...
...
...
...
9995
Spain
Unusual in that this cava hails from the Riber...
Peñalba López Brut Nature
85.0
16.0
Catalonia
Cava
NaN
Sparkling Blend
Finca Torremilanos
9996
US
In this newest vintage of Oriana, the Riesling...
Oriana White
90.0
24.0
Washington
Columbia Valley (WA)
Columbia Valley
White Blend
Brian Carter Cellars
9997
Italy
Founded in 1918 by one of the grandfathers and...
NaN
95.0
NaN
Piedmont
Barolo
NaN
Nebbiolo
Cantina Bartolo Mascarello
9998
Italy
Perticaia delivers a gorgeous Trebbiano with r...
NaN
89.0
NaN
Central Italy
Umbria
NaN
Trebbiano
Perticaia
9999
France
Floral wine, light in structure with some fres...
NaN
87.0
NaN
Bordeaux
Pessac-Léognan
NaN
Bordeaux-style Red Blend
Château Les Carmes Haut-Brion
10000 rows × 10 columns
These operations are quite common in most cases (although the parameters and requirements may vary by project), so writing the code that addresses them may become repetitive.
A simpler way of dealing with this preliminary preparation is to use the skrub Cleaner.
4.2 Using the skrub Cleaner
The Cleaner is intended to be a first step in preparing tabular data for analysis or modeling, and can handle a variety of common data cleaning tasks automatically. It is designed to work out-of-the-box with minimal configuration, although it is also possible to customize its behavior if needed.
Given a dataframe, the Cleaner applies a sequence of transformers to each column:
Consider this example dataframe:
df = pd.DataFrame( {"numerical_1": [1, 2, 3, 4, 5],"numerical_2": [10.5, 20.3, None, 40.1, 50.2],"string_column": ["apple", "?", "banana", "cherry", "?"],"datetime_column": ["03 Jan 2020","04 Jan 2020","05 Jan 2020","06 Jan 2020","07 Jan 2020", ],"all_none": [None, None, None, None, None], })df
numerical_1
numerical_2
string_column
datetime_column
all_none
0
1
10.5
apple
03 Jan 2020
None
1
2
20.3
?
04 Jan 2020
None
2
3
NaN
banana
05 Jan 2020
None
3
4
40.1
cherry
06 Jan 2020
None
4
5
50.2
?
07 Jan 2020
None
This dataframe has mixed type columns, with some of the missing values denoted as None and some "?". The datetime column has a non-standard format and has been parsed as a string column. Finally, one of the columns is completely empty.
Cleaning this dataset using plain pandas may require writing code like this:
# Parse the datetime strings with a specific formatdf['datetime_column'] = pd.to_datetime(df['datetime_column'], format='%d %b %Y')# Drop columns with only a single unique valuedf_clean = df.loc[:, df.nunique(dropna=True) >1]# Function to drop columns with only missing values or empty stringsdef drop_empty_columns(df):# Drop columns with only missing values df_clean = df.dropna(axis=1, how='all')# Drop columns with only empty strings empty_string_cols = df_clean.columns[df_clean.eq('').all()] df_clean = df_clean.drop(columns=empty_string_cols)return df_clean# Apply the function to the DataFramedf_clean = drop_empty_columns(df_clean)df_clean
numerical_1
numerical_2
string_column
datetime_column
0
1
10.5
apple
2020-01-03
1
2
20.3
?
2020-01-04
2
3
NaN
banana
2020-01-05
3
4
40.1
cherry
2020-01-06
4
5
50.2
?
2020-01-07
4.2.2 The alternative: skrub.Cleaner
By default, the Cleaner applies various transformations that can sanitize many common use cases:
from skrub import Cleanerdf_clean = Cleaner().fit_transform(df)df_clean
numerical_1
numerical_2
string_column
datetime_column
0
1
10.5
apple
2020-01-03
1
2
20.3
None
2020-01-04
2
3
NaN
banana
2020-01-05
3
4
40.1
cherry
2020-01-06
4
5
50.2
None
2020-01-07
We can see that the cleaned version of the dataframe is now marking missing values correctly, and that the datetime column has been parsed accordingly:
In more detail, the Cleaner executes the following steps in order:
It replaces common strings used to represent missing values (e.g., NULL, ?) with NA markers.
It uses the DropUninformative transformer to decide whether a column is “uninformative”, that is, it is not likely to bring information useful to train a ML model. For example, empty columns are uninformative.
It tries to parse datetime columns using common formats, or a user-provided datetime_format.
It processes categorical columns to ensure consistent typing depending on the dataframe library in use.
It converts columns to string, unless they have a data type that carries more information, such as numerical, datetime, and categorial columns.
It automatically parses numbers written as strings (“1.324”) to convert them to actual numerical columns.
Finally, it can convert numerical columns to np.float32 dtype if called with the parameter numeric_dtype="float32". This ensures a consistent representation of numbers and missing values, and helps reducing the memory footprint.
We can look back at the “wine” dataframe and clean it with a suitably configured Cleaner:
When the Cleaner is fitted on a dataframe, it checks whether the dataframe includes uninformative columns, that is columns that do not bring useful information for training a ML model, and should therefore be dropped.
This is done by the DropUninformative transformer, which is a standalone transformer that the Cleaner leverages to sanitize data. DropUninformative marks a columns as “uninformative” if it satisfies one of these conditions:
The fraction of missing values is larger than the threshold provided by the user with drop_null_fraction.
By default, this threshold is 1.0, i.e., only columns that contain only missing values are dropped.
Setting the threshold to None will disable this check and therefore retain empty columns.
It contains only one value, and no missing values.
This is controlled by the drop_if_constant flag, which is False by default.
All values in the column are distinct.
This may be the case if the column contains UIDs, but it can also happen when the column contains text.
This check is off by default and can be turned on by setting drop_if_unique to True.
4.4 Conclusion
In this chapter we have covered how the skrub Cleaner helps with sanitizing data by implementing a number of common transformations that need to be executed in order to ensure that the data used by the pipeline are consistent and can be used as indended by ML models.
In the next chapter we will see how skrub helps with applying this and other transformations to specific columns in the data.
5 Exercise: clean a dataframe using the Cleaner
Path to the exercise: content/exercises/02_cleaning_data.ipynb
Load the given dataframe.
import pandas as pddf = pd.read_csv("../data/cleaner_data.csv")
Use the TableReport to answer the following questions:
Are there constant columns?
Are there datetime columns? If so, were they parsed correctly?
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
num_1
Int64DType
False
0 (0.0%)
101 (1.0%)
50.5
29.4
0
50
100
1
num_2
Float64DType
False
0 (0.0%)
10000 (100.0%)
498.
288.
0.0306
493.
1.00e+03
2
num_3
Float64DType
False
0 (0.0%)
10000 (100.0%)
50.1
15.3
-4.48
50.2
114.
3
first_name
ObjectDType
False
984 (9.8%)
45 (0.4%)
4
last_name
ObjectDType
False
0 (0.0%)
41 (0.4%)
5
city
ObjectDType
False
1398 (14.0%)
40 (0.4%)
6
country
ObjectDType
False
0 (0.0%)
33 (0.3%)
7
department
ObjectDType
False
0 (0.0%)
15 (0.1%)
8
with_nulls_1
ObjectDType
False
7496 (75.0%)
40 (0.4%)
9
contract_type_1
ObjectDType
True
0 (0.0%)
1 (< 0.1%)
10
date_1
ObjectDType
False
0 (0.0%)
1460 (14.6%)
11
date_2
ObjectDType
False
0 (0.0%)
1461 (14.6%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
Then, use the Cleaner to sanitize the data so that:
Constant columns are removed
Datetimes are parsed properly (hint: use "%d-%b-%Y" as the datetime format)
All columns with more than 50% missing values are removed
Numerical features are converted to float32
from skrub import Cleaner# Write your answer here# # # # # # # #
# solutionfrom skrub import Cleanercleaner = Cleaner( drop_if_constant=True, drop_null_fraction=0.5, numeric_dtype="float32", datetime_format="%d-%b-%Y",)# Apply the cleanerdf_cleaned = cleaner.fit_transform(df)# Display the cleaned dataframeTableReport(df_cleaned)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
num_1
Float32DType
False
0 (0.0%)
101 (1.0%)
50.5
29.4
0.00
50.0
100.
1
num_2
Float32DType
False
0 (0.0%)
10000 (100.0%)
498.
288.
0.0306
493.
1.00e+03
2
num_3
Float32DType
False
0 (0.0%)
9997 (100.0%)
50.1
15.3
-4.48
50.2
114.
3
first_name
ObjectDType
False
984 (9.8%)
45 (0.4%)
4
last_name
ObjectDType
False
0 (0.0%)
41 (0.4%)
5
city
ObjectDType
False
1398 (14.0%)
40 (0.4%)
6
country
ObjectDType
False
0 (0.0%)
33 (0.3%)
7
department
ObjectDType
False
0 (0.0%)
15 (0.1%)
8
date_1
DateTime64DType
False
0 (0.0%)
1460 (14.6%)
2020-01-01T00:00:00
2024-01-01T00:00:00
9
date_2
DateTime64DType
False
0 (0.0%)
1461 (14.6%)
2020-01-01T00:00:00
2024-01-01T00:00:00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
We can inspect which columns were dropped and what transformations were applied:
print(f"Original shape: {df.shape}")print(f"Cleaned shape: {df_cleaned.shape}")print(f"\nColumns dropped: {[col for col in df.columns if col notin cleaner.all_outputs_]}")