2  Exploring dataframes with skrub

2.1 Introduction

In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.

First, let’s import the necessary libraries and load the dataset.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_openml

# Load the Adult Census dataset
data =  pd.read_csv("../data/adult_census/data.csv")
target =  pd.read_csv("../data/adult_census/target.csv")

Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:

  • The size of the dataset.
  • The data types and names of the columns.
  • The distribution of values in the columns.
  • Whether null values are present, in what measure and where.
  • Discrete/categorical features, and their cardinality.
  • Columns strongly correlated with each other.

2.2 Exploring data with Pandas tools

Let’s first explore the data using Pandas only.

We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.

data.head(5)
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States

If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB

With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.

We can also get a richer summary of the data with the .describe() method:

data.describe(include="all")
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
count 48842.000000 46043 4.884200e+04 48842 48842.000000 48842 46033 48842 48842 48842 48842.000000 48842.000000 48842.000000 47985
unique NaN 8 NaN 16 NaN 7 14 6 5 2 NaN NaN NaN 41
top NaN Private NaN HS-grad NaN Married-civ-spouse Prof-specialty Husband White Male NaN NaN NaN United-States
freq NaN 33906 NaN 15784 NaN 22379 6172 19716 41762 32650 NaN NaN NaN 43832
mean 38.643585 NaN 1.896641e+05 NaN 10.078089 NaN NaN NaN NaN NaN 1079.067626 87.502314 40.422382 NaN
std 13.710510 NaN 1.056040e+05 NaN 2.570973 NaN NaN NaN NaN NaN 7452.019058 403.004552 12.391444 NaN
min 17.000000 NaN 1.228500e+04 NaN 1.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 1.000000 NaN
25% 28.000000 NaN 1.175505e+05 NaN 9.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN
50% 37.000000 NaN 1.781445e+05 NaN 10.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 40.000000 NaN
75% 48.000000 NaN 2.376420e+05 NaN 12.000000 NaN NaN NaN NaN NaN 0.000000 0.000000 45.000000 NaN
max 90.000000 NaN 1.490400e+06 NaN 16.000000 NaN NaN NaN NaN NaN 99999.000000 4356.000000 99.000000 NaN

This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.

2.3 Exploring data with the TableReport

Now, let’s create a TableReport to explore the dataset.

from skrub import TableReport
TableReport(data, verbose=0)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

2.3.1 Default view of the TableReport

The TableReport gives us a comprehensive overview of the dataset. The default view shows all the columns in the dataset, and allows to select and copy the content of the cells shown in the preview.

The TableReport is intended to show a preview of the data, so it does not contain all the rows in the dataset, rather it shows only the first and last few rows by default. Similarly, it stores only the top 10 most frequent values for each column, if column distributions are plotted.

2.3.2 The “Stats” tab

TableReport(data, open_tab="stats")

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Stats” tab provides a variety of descriptive statistics for each column in the dataset. This includes:

  • The column name
  • The detected data type of the column
  • Whether the column is sorted or not
  • The number of null values in the column, as well as the percentage
  • The number of unique values in the column

For numerical columns, additional statistics are provided:

  • Mean
  • Standard deviation
  • Minimum and maximum values
  • Median

Stat columns can also be sorted, for example to quickly identify which columns contain the most nulls, or have the largest cardinality (number of unique values).

NoneFilters

Pre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs.

2.3.3 The “Distributions” tab

TableReport(data, open_tab="distributions")

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Distributions” tab provides visualizations of the distributions of values in each column. This includes histograms for numerical columns and bar plots for categorical columns.

The “Distributions” tab helps with detecting potential issues in the data, such as:

  • Skewed distributions
  • Outliers
  • Unexpected value frequencies

For example, in this dataset we can see that some columns are heavily skewed, such as “workclass”, “race”, and “native-country”: this is important information to keep track of, because these columns may require special handling during data preprocessing or modeling.

Additionally, the “Distributions” tab allows to select columns manually, so that they can be added to a script and selected for further analysis or modeling.

CautionOutlier detection

The TableReport detects outliers using a simple interquartile test, marking as outliers all values that are beyond the IQR. This is a simple heuristic, and should not be treated as perfect. If your problem requires reliable outlier detection, you should not rely exclusively on what the TableReport shows.

2.3.4 The “Associations” tab

TableReport(data, open_tab="associations")

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The “Associations” tab provides insights into the relationships between different columns in the dataset. It shows Pearson’s correlation coefficient for numerical columns, as well as Cramér’s V for all columns.

While this is a somewhat rough measure of association, it can help identify potential relationships worth exploring further during the analysis, and highlights highly correlated columns: depending on the modeling technique used, these may need to be handled specially to avoid issues with multicollinearity.

In this example, we can see that “education-num” and “education” have perfect correlation, which means that one of the two columns can be dropped without losing information.

2.4 Exploring the target variable

Besides dataframes, the TableReport handles series and mono- and bi-dimensional numpy arrays.

So, let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:

TableReport(target)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

2.5 Configuring and saving the TableReport

The TableReport can be saved on disk as an HTML.

TableReport(data).write_html("report.html")

Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.

It is possible to configure various parameters using the skrub global config. For example, it is possible to replace the default Pandas or Polars dataframe display with the TableReport by using patch_display (and unpatch_display):

from skrub import patch_display, unpatch_display

# replace the default pandas repr 
patch_display()
data
Processing column   1 / 14Processing column   2 / 14Processing column   3 / 14Processing column   4 / 14Processing column   5 / 14Processing column   6 / 14Processing column   7 / 14Processing column   8 / 14Processing column   9 / 14Processing column  10 / 14Processing column  11 / 14Processing column  12 / 14Processing column  13 / 14Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

To disable, use unpatch_display:

unpatch_display()
data
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 226802 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private 89814 HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private 160323 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 NaN 103497 Some-college 10 Never-married NaN Own-child White Female 0 0 30 United-States
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States
48838 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States
48839 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States
48840 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States
48841 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States

48842 rows × 14 columns

More detail on the skrub configuration is reported in the User Guide.

2.6 Working with big tables

Plotting and measuring the column correlations are expensive operations and may take a long time, so when the dataframe under study is large it may be more convenient to skip them for quicker development.

The max_plot_columns and max_association_columns parameters allow to set a threshold on the number of columns: the TableReport will skip the respective task if the number of colums in the dataframe is larger than the threshold:

When the number of columns is too large, an information message is shown in the respective tab instead of the plots or correlations.

TableReport(
    data, max_association_columns=3, max_plot_columns=3, open_tab="distributions"
)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

2.7 Conclusions

In this chapter we have learned how the TableReport can be used to speed up data exploration, allowing us to find possible criticalities in the data.

We covered:

  • Creating and configuring a TableReport for fast, interactive data exploration
  • Exploring column statistics, value distributions, and associations visually
  • Detecting nulls, outliers, and highly correlated columns at a glance
  • Filtering columns by type or characteristics using built-in filters
  • Saving and sharing interactive reports as standalone HTML files
  • Adjusting TableReport settings for large datasets to optimize performance

In the next chapter, we will find out how to address some of the possible problems using the skrub Cleaner.