2 Exploring dataframes with skrub

2.1 Introduction

In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.

First, let’s import the necessary libraries and load the dataset.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml

# Load the Adult Census dataset
data =  pd.read_csv("../data/adult_census/data.csv")
target =  pd.read_csv("../data/adult_census/target.csv")

Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:

The size of the dataset.
The data types and names of the columns.
The distribution of values in the columns.
Whether null values are present, in what measure and where.
Discrete/categorical features, and their cardinality.
Columns strongly correlated with each other.

2.2 Exploring data with Pandas tools

Let’s first explore the data using Pandas only.

We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.

data.head(5)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States
4	18	NaN	103497	Some-college	10	Never-married	NaN	Own-child	White	Female	0	30	United-States

If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB

With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.

We can also get a richer summary of the data with the .describe() method:

data.describe(include="all")

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
count	48842.000000	46043	4.884200e+04	48842	48842.000000	48842	46033	48842	48842	48842	48842.000000	48842.000000	48842.000000	47985
unique	NaN	8	NaN	16	NaN	7	14	6	5	2	NaN	NaN	NaN	41
top	NaN	Private	NaN	HS-grad	NaN	Married-civ-spouse	Prof-specialty	Husband	White	Male	NaN	NaN	NaN	United-States
freq	NaN	33906	NaN	15784	NaN	22379	6172	19716	41762	32650	NaN	NaN	NaN	43832
mean	38.643585	NaN	1.896641e+05	NaN	10.078089	NaN	NaN	NaN	NaN	NaN	1079.067626	87.502314	40.422382	NaN
std	13.710510	NaN	1.056040e+05	NaN	2.570973	NaN	NaN	NaN	NaN	NaN	7452.019058	403.004552	12.391444	NaN
min	17.000000	NaN	1.228500e+04	NaN	1.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	1.000000	NaN
25%	28.000000	NaN	1.175505e+05	NaN	9.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	40.000000	NaN
50%	37.000000	NaN	1.781445e+05	NaN	10.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	40.000000	NaN
75%	48.000000	NaN	2.376420e+05	NaN	12.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	45.000000	NaN
max	90.000000	NaN	1.490400e+06	NaN	16.000000	NaN	NaN	NaN	NaN	NaN	99999.000000	4356.000000	99.000000	NaN

This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.

2.3 Exploring data with the `TableReport`

Now, let’s create a TableReport to explore the dataset.

from skrub import TableReport
TableReport(data, verbose=0)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

2.3.1 Default view of the TableReport

The TableReport gives us a comprehensive overview of the dataset. The default view shows all the columns in the dataset, and allows to select and copy the content of the cells shown in the preview.

The TableReport is intended to show a preview of the data, so it does not contain all the rows in the dataset, rather it shows only the first and last few rows by default. Similarly, it stores only the top 10 most frequent values for each column, if column distributions are plotted.

2.3.2 The “Stats” tab

TableReport(data, open_tab="stats")

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The “Stats” tab provides a variety of descriptive statistics for each column in the dataset. This includes:

The column name
The detected data type of the column
Whether the column is sorted or not
The number of null values in the column, as well as the percentage
The number of unique values in the column

For numerical columns, additional statistics are provided:

Mean
Standard deviation
Minimum and maximum values
Median

Stat columns can also be sorted, for example to quickly identify which columns contain the most nulls, or have the largest cardinality (number of unique values).

Filters

Pre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs.

2.3.3 The “Distributions” tab

TableReport(data, open_tab="distributions")

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The “Distributions” tab provides visualizations of the distributions of values in each column. This includes histograms for numerical columns and bar plots for categorical columns.

The “Distributions” tab helps with detecting potential issues in the data, such as:

Skewed distributions
Outliers
Unexpected value frequencies

For example, in this dataset we can see that some columns are heavily skewed, such as “workclass”, “race”, and “native-country”: this is important information to keep track of, because these columns may require special handling during data preprocessing or modeling.

Additionally, the “Distributions” tab allows to select columns manually, so that they can be added to a script and selected for further analysis or modeling.

Outlier detection

The TableReport detects outliers using a simple interquartile test, marking as outliers all values that are beyond the IQR. This is a simple heuristic, and should not be treated as perfect. If your problem requires reliable outlier detection, you should not rely exclusively on what the TableReport shows.

2.3.4 The “Associations” tab

TableReport(data, open_tab="associations")

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The “Associations” tab provides insights into the relationships between different columns in the dataset. It shows Pearson’s correlation coefficient for numerical columns, as well as Cramér’s V for all columns.

While this is a somewhat rough measure of association, it can help identify potential relationships worth exploring further during the analysis, and highlights highly correlated columns: depending on the modeling technique used, these may need to be handled specially to avoid issues with multicollinearity.

In this example, we can see that “education-num” and “education” have perfect correlation, which means that one of the two columns can be dropped without losing information.

2.4 Exploring the target variable

Besides dataframes, the TableReport handles series and mono- and bi-dimensional numpy arrays.

So, let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:

TableReport(target)

	class
0	<=50K
1	<=50K
2	>50K
3	>50K
4	<=50K

48,837	<=50K
48,838	>50K
48,839	<=50K
48,840	<=50K
48,841	>50K

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	class	ObjectDType	False	0 (0.0%)	2 (< 0.1%)

Please enable javascript

2.5 Configuring and saving the `TableReport`

The TableReport can be saved on disk as an HTML.

TableReport(data).write_html("report.html")

Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.

It is possible to configure various parameters using the skrub global config. For example, it is possible to replace the default Pandas or Polars dataframe display with the TableReport by using patch_display (and unpatch_display):

from skrub import patch_display, unpatch_display

# replace the default pandas repr 
patch_display()
data

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

To disable, use unpatch_display:

unpatch_display()
data

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	0	40	United-States
4	18	NaN	103497	Some-college	10	Never-married	NaN	Own-child	White	Female	0	0	30	United-States
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48837	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48838	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48839	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48840	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48841	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States

48842 rows × 14 columns

More detail on the skrub configuration is reported in the User Guide.

2.6 Working with big tables

Plotting and measuring the column correlations are expensive operations and may take a long time, so when the dataframe under study is large it may be more convenient to skip them for quicker development.

The max_plot_columns and max_association_columns parameters allow to set a threshold on the number of columns: the TableReport will skip the respective task if the number of colums in the dataframe is larger than the threshold:

When the number of columns is too large, an information message is shown in the respective tab instead of the plots or correlations.

TableReport(
    data, max_association_columns=3, max_plot_columns=3, open_tab="distributions"
)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Please enable javascript

2.7 Conclusions

In this chapter we have learned how the TableReport can be used to speed up data exploration, allowing us to find possible criticalities in the data.

We covered:

Creating and configuring a TableReport for fast, interactive data exploration
Exploring column statistics, value distributions, and associations visually
Detecting nulls, outliers, and highly correlated columns at a glance
Filtering columns by type or characteristics using built-in filters
Saving and sharing interactive reports as standalone HTML files
Adjusting TableReport settings for large datasets to optimize performance

In the next chapter, we will find out how to address some of the possible problems using the skrub Cleaner.

2.1 Introduction

2.2 Exploring data with Pandas tools

2.3 Exploring data with the TableReport

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

Please enable javascript

2.3.1 Default view of the TableReport

2.3.2 The “Stats” tab

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

Please enable javascript

2.3.3 The “Distributions” tab

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

2.3 Exploring data with the `TableReport`

2.5 Configuring and saving the `TableReport`