Exploring dataframes with skrub

Introduction

In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.

First, let’s import the necessary libraries and load the dataset.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml

# Load the Adult Census dataset
data =  pd.read_csv("../data/adult_census/data.csv")
target =  pd.read_csv("../data/adult_census/target.csv")

Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:

  • The size of the dataset.
  • The data types and names of the columns.
  • The distribution of values in the columns.
  • Whether null values are present, in what measure and where.
  • Discrete/categorical features, and their cardinality.
  • Columns strongly correlated with each other.

Exploring data with Pandas tools

Let’s first explore the data using Pandas only.

We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.

data.head(5)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States

If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB

With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.

We can also get a richer summary of the data with the .describe() method:

data.describe(include="all")
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
count 48842.000000 46043 4.884200e+04 48842 48842.000000 48842 46033 48842 48842 48842 48842.000000 48842.000000 48842.000000 47985
unique NaN 8 NaN 16 NaN 7 14 6 5 2 NaN NaN NaN 41
top NaN Private NaN HS-grad NaN Married-civ-spouse Prof-specialty Husband White Male NaN NaN NaN United-States
freq NaN 33906 NaN 15784 NaN 22379 6172 19716 41762 32650 NaN NaN NaN 43832
mean 38.643585 NaN 1.896641e+05 NaN 10.078089 NaN NaN NaN NaN NaN 1079.067626 87.502314 40.422382 NaN
std 13.710510 NaN 1.056040e+05 NaN 2.570973 NaN NaN NaN NaN NaN 7452.019058 403.004552 12.391444 NaN
min 17.000000 NaN 1.228500e+04 NaN 1.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 1.000000 NaN
25% 28.000000 NaN 1.175505e+05 NaN 9.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN
50% 37.000000 NaN 1.781445e+05 NaN 10.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN
75% 48.000000 NaN 2.376420e+05 NaN 12.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 45.000000 NaN
max 90.000000 NaN 1.490400e+06 NaN 16.000000 NaN NaN NaN NaN NaN 99999.000000 4356.000000 99.000000 NaN

This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.

Exploring data with the TableReport

Now, let’s create a TableReport to explore the dataset.

from skrub import TableReport
TableReport(data, verbose=0)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Stats” tab

TableReport(data, open_tab="stats")
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Filters

Pre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs.

The “Distributions” tab

TableReport(data, open_tab="distributions")
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Associations” tab

TableReport(data, open_tab="associations")
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Exploring the target variable

Besides dataframes, the TableReport handles series and mono- and bi-dimensional numpy arrays.

So, let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:

TableReport(target)
Processing column   1 / 1

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Configuring and saving the TableReport

The TableReport can be saved on disk as an HTML.

TableReport(data).write_html("report.html")

Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.

Working with big tables

TableReport(
    data, max_association_columns=3, max_plot_columns=3, open_tab="distributions"
)
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Conclusions

  • The TableReport is a powerful EDA tool
  • It shows a rich preview of the content of the dataframe
  • It provides precomputed statistics for all the columns
  • It prepares distribution plots for each column
  • It measures the association between columns