Exploring dataframes with skrub

Introduction

In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.

First, let’s import the necessary libraries and load the dataset.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml

# Load the Adult Census dataset
data =  pd.read_csv("../data/adult_census/data.csv")
target =  pd.read_csv("../data/adult_census/target.csv")

Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:

The size of the dataset.
The data types and names of the columns.
The distribution of values in the columns.
Whether null values are present, in what measure and where.
Discrete/categorical features, and their cardinality.
Columns strongly correlated with each other.

Exploring data with Pandas tools

Let’s first explore the data using Pandas only.

We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.

data.head(5)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
0	25	Private	226802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States
1	38	Private	89814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States
2	28	Local-gov	336951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States
3	44	Private	160323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States
4	18	NaN	103497	Some-college	10	Never-married	NaN	Own-child	White	Female	0	30	United-States

If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB

With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.

We can also get a richer summary of the data with the .describe() method:

data.describe(include="all")

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
count	48842.000000	46043	4.884200e+04	48842	48842.000000	48842	46033	48842	48842	48842	48842.000000	48842.000000	48842.000000	47985
unique	NaN	8	NaN	16	NaN	7	14	6	5	2	NaN	NaN	NaN	41
top	NaN	Private	NaN	HS-grad	NaN	Married-civ-spouse	Prof-specialty	Husband	White	Male	NaN	NaN	NaN	United-States
freq	NaN	33906	NaN	15784	NaN	22379	6172	19716	41762	32650	NaN	NaN	NaN	43832
mean	38.643585	NaN	1.896641e+05	NaN	10.078089	NaN	NaN	NaN	NaN	NaN	1079.067626	87.502314	40.422382	NaN
std	13.710510	NaN	1.056040e+05	NaN	2.570973	NaN	NaN	NaN	NaN	NaN	7452.019058	403.004552	12.391444	NaN
min	17.000000	NaN	1.228500e+04	NaN	1.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	1.000000	NaN
25%	28.000000	NaN	1.175505e+05	NaN	9.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	40.000000	NaN
50%	37.000000	NaN	1.781445e+05	NaN	10.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	40.000000	NaN
75%	48.000000	NaN	2.376420e+05	NaN	12.000000	NaN	NaN	NaN	NaN	NaN	0.000000	0.000000	45.000000	NaN
max	90.000000	NaN	1.490400e+06	NaN	16.000000	NaN	NaN	NaN	NaN	NaN	99999.000000	4356.000000	99.000000	NaN

This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.

Exploring data with the `TableReport`

Now, let’s create a TableReport to explore the dataset.

from skrub import TableReport
TableReport(data, verbose=0)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Stats” tab

TableReport(data, open_tab="stats")

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

Filters

Pre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs.

The “Distributions” tab

TableReport(data, open_tab="distributions")

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

The “Associations” tab

TableReport(data, open_tab="associations")

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
education	education-num	1.00
relationship	sex	0.657
marital-status	relationship	0.489
marital-status	sex	0.483
occupation	sex	0.419
workclass	occupation	0.377
race	native-country	0.317
age	marital-status	0.294
age	relationship	0.279
sex	hours-per-week	0.265
education	occupation	0.211
education-num	occupation	0.211
occupation	relationship	0.177
workclass	sex	0.171
relationship	hours-per-week	0.169
age	hours-per-week	0.160	0.0538
occupation	hours-per-week	0.156
age	sex	0.151
marital-status	occupation	0.135
marital-status	hours-per-week	0.134

Please enable javascript

Exploring the target variable

Besides dataframes, the TableReport handles series and mono- and bi-dimensional numpy arrays.

So, let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:

TableReport(target)

Processing column   1 / 1

	class
0	<=50K
1	<=50K
2	>50K
3	>50K
4	<=50K

48,837	<=50K
48,838	>50K
48,839	<=50K
48,840	<=50K
48,841	>50K

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	class	ObjectDType	False	0 (0.0%)	2 (< 0.1%)

Please enable javascript

Configuring and saving the `TableReport`

The TableReport can be saved on disk as an HTML.

TableReport(data).write_html("report.html")

Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.

Working with big tables

TableReport(
    data, max_association_columns=3, max_plot_columns=3, open_tab="distributions"
)

Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	25	Private	226,802	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	0	40	United-States
1	38	Private	89,814	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	50	United-States
2	28	Local-gov	336,951	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States
3	44	Private	160,323	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7,688	0	40	United-States
4	18		103,497	Some-college	10	Never-married		Own-child	White	Female	0	0	30	United-States

48,837	27	Private	257,302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States
48,838	40	Private	154,374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States
48,839	58	Private	151,910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States
48,840	22	Private	201,490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States
48,841	52	Self-emp-inc	287,927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15,024	0	40	United-States

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	age	Int64DType	False	0 (0.0%)	74 (0.2%)	38.6	13.7	17	37	90
1	workclass	ObjectDType	False	2799 (5.7%)	8 (< 0.1%)
2	fnlwgt	Int64DType	False	0 (0.0%)	28523 (58.4%)	1.90e+05	1.06e+05	12,285	178,142	1,490,400
3	education	ObjectDType	False	0 (0.0%)	16 (< 0.1%)
4	education-num	Int64DType	False	0 (0.0%)	16 (< 0.1%)	10.1	2.57	1	10	16
5	marital-status	ObjectDType	False	0 (0.0%)	7 (< 0.1%)
6	occupation	ObjectDType	False	2809 (5.8%)	14 (< 0.1%)
7	relationship	ObjectDType	False	0 (0.0%)	6 (< 0.1%)
8	race	ObjectDType	False	0 (0.0%)	5 (< 0.1%)
9	sex	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
10	capital-gain	Int64DType	False	0 (0.0%)	123 (0.3%)	1.08e+03	7.45e+03	0	0	99,999
11	capital-loss	Int64DType	False	0 (0.0%)	99 (0.2%)	87.5	403.	0	0	4,356
12	hours-per-week	Int64DType	False	0 (0.0%)	96 (0.2%)	40.4	12.4	1	40	99
13	native-country	ObjectDType	False	857 (1.8%)	41 (< 0.1%)

Please enable javascript

Conclusions

The TableReport is a powerful EDA tool
It shows a rich preview of the content of the dataframe
It provides precomputed statistics for all the columns
It prepares distribution plots for each column
It measures the association between columns

Exploring dataframes with skrub

Introduction

Exploring data with Pandas tools

Exploring data with the TableReport

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

Please enable javascript

The “Stats” tab

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

Please enable javascript

The “Distributions” tab

age

workclass

fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

age

workclass

Exploring data with the `TableReport`

Configuring and saving the `TableReport`