In this chapter, we will show how we use the skrub TableReport to explore tabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data.
First, let’s import the necessary libraries and load the dataset.
import pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import fetch_openml# Load the Adult Census datasetdata = pd.read_csv("../data/adult_census/data.csv")target = pd.read_csv("../data/adult_census/target.csv")
Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:
The size of the dataset.
The data types and names of the columns.
The distribution of values in the columns.
Whether null values are present, in what measure and where.
Discrete/categorical features, and their cardinality.
Columns strongly correlated with each other.
2.2 Exploring data with Pandas tools
Let’s first explore the data using Pandas only.
We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with.
data.head(5)
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
0
25
Private
226802
11th
7
Never-married
Machine-op-inspct
Own-child
Black
Male
0
0
40
United-States
1
38
Private
89814
HS-grad
9
Married-civ-spouse
Farming-fishing
Husband
White
Male
0
0
50
United-States
2
28
Local-gov
336951
Assoc-acdm
12
Married-civ-spouse
Protective-serv
Husband
White
Male
0
0
40
United-States
3
44
Private
160323
Some-college
10
Married-civ-spouse
Machine-op-inspct
Husband
Black
Male
7688
0
40
United-States
4
18
NaN
103497
Some-college
10
Never-married
NaN
Own-child
White
Female
0
0
30
United-States
If we want to have a simpler view of the datatypes in the dataframe, we must use data.info():
With .info() we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column.
We can also get a richer summary of the data with the .describe() method:
data.describe(include="all")
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
count
48842.000000
46043
4.884200e+04
48842
48842.000000
48842
46033
48842
48842
48842
48842.000000
48842.000000
48842.000000
47985
unique
NaN
8
NaN
16
NaN
7
14
6
5
2
NaN
NaN
NaN
41
top
NaN
Private
NaN
HS-grad
NaN
Married-civ-spouse
Prof-specialty
Husband
White
Male
NaN
NaN
NaN
United-States
freq
NaN
33906
NaN
15784
NaN
22379
6172
19716
41762
32650
NaN
NaN
NaN
43832
mean
38.643585
NaN
1.896641e+05
NaN
10.078089
NaN
NaN
NaN
NaN
NaN
1079.067626
87.502314
40.422382
NaN
std
13.710510
NaN
1.056040e+05
NaN
2.570973
NaN
NaN
NaN
NaN
NaN
7452.019058
403.004552
12.391444
NaN
min
17.000000
NaN
1.228500e+04
NaN
1.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
1.000000
NaN
25%
28.000000
NaN
1.175505e+05
NaN
9.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
40.000000
NaN
50%
37.000000
NaN
1.781445e+05
NaN
10.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
40.000000
NaN
75%
48.000000
NaN
2.376420e+05
NaN
12.000000
NaN
NaN
NaN
NaN
NaN
0.000000
0.000000
45.000000
NaN
max
90.000000
NaN
1.490400e+06
NaN
16.000000
NaN
NaN
NaN
NaN
NaN
99999.000000
4356.000000
99.000000
NaN
This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statistics for the numerical columns and the number of null values.
2.3 Exploring data with the TableReport
Now, let’s create a TableReport to explore the dataset.
from skrub import TableReportTableReport(data, verbose=0)
Click a table cell for more info about its column.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
2.3.1 Default view of the TableReport
The TableReport gives us a comprehensive overview of the dataset. The default view shows all the columns in the dataset, and allows to select and copy the content of the cells shown in the preview.
The TableReport is intended to show a preview of the data, so it does not contain all the rows in the dataset, rather it shows only the first and last few rows by default. Similarly, it stores only the top 10 most frequent values for each column, if column distributions are plotted.
2.3.2 The “Stats” tab
TableReport(data, open_tab="stats")
Click a table cell for more info about its column.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Stats” tab provides a variety of descriptive statistics for each column in the dataset. This includes:
The column name
The detected data type of the column
Whether the column is sorted or not
The number of null values in the column, as well as the percentage
The number of unique values in the column
For numerical columns, additional statistics are provided:
Mean
Standard deviation
Minimum and maximum values
Median
Stat columns can also be sorted, for example to quickly identify which columns contain the most nulls, or have the largest cardinality (number of unique values).
NoneFilters
Pre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs.
2.3.3 The “Distributions” tab
TableReport(data, open_tab="distributions")
Click a table cell for more info about its column.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Distributions” tab provides visualizations of the distributions of values in each column. This includes histograms for numerical columns and bar plots for categorical columns.
The “Distributions” tab helps with detecting potential issues in the data, such as:
Skewed distributions
Outliers
Unexpected value frequencies
For example, in this dataset we can see that some columns are heavily skewed, such as “workclass”, “race”, and “native-country”: this is important information to keep track of, because these columns may require special handling during data preprocessing or modeling.
Additionally, the “Distributions” tab allows to select columns manually, so that they can be added to a script and selected for further analysis or modeling.
CautionOutlier detection
The TableReport detects outliers using a simple interquartile test, marking as outliers all values that are beyond the IQR. This is a simple heuristic, and should not be treated as perfect. If your problem requires reliable outlier detection, you should not rely exclusively on what the TableReport shows.
2.3.4 The “Associations” tab
TableReport(data, open_tab="associations")
Click a table cell for more info about its column.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
The “Associations” tab provides insights into the relationships between different columns in the dataset. It shows Pearson’s correlation coefficient for numerical columns, as well as Cramér’s V for all columns.
While this is a somewhat rough measure of association, it can help identify potential relationships worth exploring further during the analysis, and highlights highly correlated columns: depending on the modeling technique used, these may need to be handled specially to avoid issues with multicollinearity.
In this example, we can see that “education-num” and “education” have perfect correlation, which means that one of the two columns can be dropped without losing information.
2.4 Exploring the target variable
Besides dataframes, the TableReport handles series and mono- and bi-dimensional numpy arrays.
So, let’s take a closer look at the target variable, which indicates whether an individual’s income exceeds $50K per year. We can create a separate TableReport for the target variable to explore its distribution:
TableReport(target)
Click a table cell for more info about its column.
class
0
<=50K
1
<=50K
2
>50K
3
>50K
4
<=50K
48,837
<=50K
48,838
>50K
48,839
<=50K
48,840
<=50K
48,841
>50K
class
ObjectDType
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
Most frequent values
<=50K
>50K
List:
['<=50K', '>50K']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
class
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
class
ObjectDType
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
Most frequent values
<=50K
>50K
List:
['<=50K', '>50K']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
No strong associations between any pair of columns were identified by a quick screening of a subsample of the dataframe.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
2.5 Configuring and saving the TableReport
The TableReport can be saved on disk as an HTML.
TableReport(data).write_html("report.html")
Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console.
It is possible to configure various parameters using the skrub global config. For example, it is possible to replace the default Pandas or Polars dataframe display with the TableReport by using patch_display (and unpatch_display):
from skrub import patch_display, unpatch_display# replace the default pandas repr patch_display()data
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
To disable, use unpatch_display:
unpatch_display()data
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
0
25
Private
226802
11th
7
Never-married
Machine-op-inspct
Own-child
Black
Male
0
0
40
United-States
1
38
Private
89814
HS-grad
9
Married-civ-spouse
Farming-fishing
Husband
White
Male
0
0
50
United-States
2
28
Local-gov
336951
Assoc-acdm
12
Married-civ-spouse
Protective-serv
Husband
White
Male
0
0
40
United-States
3
44
Private
160323
Some-college
10
Married-civ-spouse
Machine-op-inspct
Husband
Black
Male
7688
0
40
United-States
4
18
NaN
103497
Some-college
10
Never-married
NaN
Own-child
White
Female
0
0
30
United-States
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
48837
27
Private
257302
Assoc-acdm
12
Married-civ-spouse
Tech-support
Wife
White
Female
0
0
38
United-States
48838
40
Private
154374
HS-grad
9
Married-civ-spouse
Machine-op-inspct
Husband
White
Male
0
0
40
United-States
48839
58
Private
151910
HS-grad
9
Widowed
Adm-clerical
Unmarried
White
Female
0
0
40
United-States
48840
22
Private
201490
HS-grad
9
Never-married
Adm-clerical
Own-child
White
Male
0
0
20
United-States
48841
52
Self-emp-inc
287927
HS-grad
9
Married-civ-spouse
Exec-managerial
Wife
White
Female
15024
0
40
United-States
48842 rows × 14 columns
More detail on the skrub configuration is reported in the User Guide.
2.6 Working with big tables
Plotting and measuring the column correlations are expensive operations and may take a long time, so when the dataframe under study is large it may be more convenient to skip them for quicker development.
The max_plot_columns and max_association_columns parameters allow to set a threshold on the number of columns: the TableReport will skip the respective task if the number of colums in the dataframe is larger than the threshold:
When the number of columns is too large, an information message is shown in the respective tab instead of the plots or correlations.
Click a table cell for more info about its column.
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
0
25
Private
226,802
11th
7
Never-married
Machine-op-inspct
Own-child
Black
Male
0
0
40
United-States
1
38
Private
89,814
HS-grad
9
Married-civ-spouse
Farming-fishing
Husband
White
Male
0
0
50
United-States
2
28
Local-gov
336,951
Assoc-acdm
12
Married-civ-spouse
Protective-serv
Husband
White
Male
0
0
40
United-States
3
44
Private
160,323
Some-college
10
Married-civ-spouse
Machine-op-inspct
Husband
Black
Male
7,688
0
40
United-States
4
18
103,497
Some-college
10
Never-married
Own-child
White
Female
0
0
30
United-States
48,837
27
Private
257,302
Assoc-acdm
12
Married-civ-spouse
Tech-support
Wife
White
Female
0
0
38
United-States
48,838
40
Private
154,374
HS-grad
9
Married-civ-spouse
Machine-op-inspct
Husband
White
Male
0
0
40
United-States
48,839
58
Private
151,910
HS-grad
9
Widowed
Adm-clerical
Unmarried
White
Female
0
0
40
United-States
48,840
22
Private
201,490
HS-grad
9
Never-married
Adm-clerical
Own-child
White
Male
0
0
20
United-States
48,841
52
Self-emp-inc
287,927
HS-grad
9
Married-civ-spouse
Exec-managerial
Wife
White
Female
15,024
0
40
United-States
age
Int64DType
Null values
0 (0.0%)
Unique values
74 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
38.6 ±
13.7
Median ± IQR
37 ±
20
Min | Max
17 |
90
workclass
ObjectDType
Null values
2,799 (5.7%)
Unique values
8 (< 0.1%)
fnlwgt
Int64DType
Null values
0 (0.0%)
Unique values
28,523 (58.4%)
This column has a high cardinality (> 40).
Mean ± Std
1.90e+05 ±
1.06e+05
Median ± IQR
178,142 ±
120,097
Min | Max
12,285 |
1,490,400
education
ObjectDType
Null values
0 (0.0%)
Unique values
16 (< 0.1%)
education-num
Int64DType
Null values
0 (0.0%)
Unique values
16 (< 0.1%)
Mean ± Std
10.1 ±
2.57
Median ± IQR
10 ±
3
Min | Max
1 |
16
marital-status
ObjectDType
Null values
0 (0.0%)
Unique values
7 (< 0.1%)
occupation
ObjectDType
Null values
2,809 (5.8%)
Unique values
14 (< 0.1%)
relationship
ObjectDType
Null values
0 (0.0%)
Unique values
6 (< 0.1%)
race
ObjectDType
Null values
0 (0.0%)
Unique values
5 (< 0.1%)
sex
ObjectDType
Null values
0 (0.0%)
Unique values
2 (< 0.1%)
capital-gain
Int64DType
Null values
0 (0.0%)
Unique values
123 (0.3%)
This column has a high cardinality (> 40).
Mean ± Std
1.08e+03 ±
7.45e+03
Median ± IQR
0 ±
0
Min | Max
0 |
99,999
capital-loss
Int64DType
Null values
0 (0.0%)
Unique values
99 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
87.5 ±
403.
Median ± IQR
0 ±
0
Min | Max
0 |
4,356
hours-per-week
Int64DType
Null values
0 (0.0%)
Unique values
96 (0.2%)
This column has a high cardinality (> 40).
Mean ± Std
40.4 ±
12.4
Median ± IQR
40 ±
5
Min | Max
1 |
99
native-country
ObjectDType
Null values
857 (1.8%)
Unique values
41 (< 0.1%)
This column has a high cardinality (> 40).
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Is sorted
Null values
Unique values
Mean
Std
Min
Median
Max
0
age
Int64DType
False
0 (0.0%)
74 (0.2%)
38.6
13.7
17
37
90
1
workclass
ObjectDType
False
2799 (5.7%)
8 (< 0.1%)
2
fnlwgt
Int64DType
False
0 (0.0%)
28523 (58.4%)
1.90e+05
1.06e+05
12,285
178,142
1,490,400
3
education
ObjectDType
False
0 (0.0%)
16 (< 0.1%)
4
education-num
Int64DType
False
0 (0.0%)
16 (< 0.1%)
10.1
2.57
1
10
16
5
marital-status
ObjectDType
False
0 (0.0%)
7 (< 0.1%)
6
occupation
ObjectDType
False
2809 (5.8%)
14 (< 0.1%)
7
relationship
ObjectDType
False
0 (0.0%)
6 (< 0.1%)
8
race
ObjectDType
False
0 (0.0%)
5 (< 0.1%)
9
sex
ObjectDType
False
0 (0.0%)
2 (< 0.1%)
10
capital-gain
Int64DType
False
0 (0.0%)
123 (0.3%)
1.08e+03
7.45e+03
0
0
99,999
11
capital-loss
Int64DType
False
0 (0.0%)
99 (0.2%)
87.5
403.
0
0
4,356
12
hours-per-week
Int64DType
False
0 (0.0%)
96 (0.2%)
40.4
12.4
1
40
99
13
native-country
ObjectDType
False
857 (1.8%)
41 (< 0.1%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box), select some columns using the checkboxes next
to the column names or the "Select all" button.
Plotting was skipped as the dataframe exceeded the
max_plot_columns
limit set for the TableReport during report creation.
You can adjust this behavior in several ways:
To fix it for a single report, change the parameter passed when creating TableReport (default: 30):
report = TableReport(max_plot_columns=50)
To change the default number of columns to be plotted during the current Python session, use skrub.set_config:
from skrub import set_config
set_config(max_plot_columns=50)
To make the change permanent, use an environment variable:
export SKB_MAX_PLOT_COLUMNS=50
Note that TableReport accepts two parameters:
The max_plot_columns limits the number of columns that are plotted.
The max_association_columns limits the number of columns for which associations should be computed.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Computing pairwise associations was skipped because the dataframe has many columns.
See TableReport's max_association_columns parameter.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
2.7 Conclusions
In this chapter we have learned how the TableReport can be used to speed up data exploration, allowing us to find possible criticalities in the data.
We covered:
Creating and configuring a TableReport for fast, interactive data exploration
Exploring column statistics, value distributions, and associations visually
Detecting nulls, outliers, and highly correlated columns at a glance
Filtering columns by type or characteristics using built-in filters
Saving and sharing interactive reports as standalone HTML files
Adjusting TableReport settings for large datasets to optimize performance
In the next chapter, we will find out how to address some of the possible problems using the skrub Cleaner.
---title: "Exploring dataframes with skrub"format: html: toc: true revealjs: slide-number: true toc: false code-fold: false code-tools: true footer: "[← Back to Slides Index](../../slides/index.html)"---## IntroductionIn this chapter, we will show how we use the skrub `TableReport` to exploretabular data. We will use the Adult Census dataset as our example table, and perform some exploratory analysis to learn about the characteristics of the data. First, let's import the necessary libraries and load the dataset.```{python}import pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import fetch_openml# Load the Adult Census datasetdata = pd.read_csv("../data/adult_census/data.csv")target = pd.read_csv("../data/adult_census/target.csv")```Now that we have a dataframe we can work with, here is a list of features of the data we would like to find out:- The size of the dataset. - The data types and names of the columns. - The distribution of values in the columns. - Whether null values are present, in what measure and where. - Discrete/categorical features, and their cardinality.- Columns strongly correlated with each other. ## Exploring data with Pandas toolsLet's first explore the data using Pandas only.We can get an idea of the content of the table by printing the first few lines, which gives an idea of the datatypes and the columns we are dealing with. ```{python}data.head(5)```If we want to have a simpler view of the datatypes in the dataframe, we must use `data.info()`:```{python}data.info()```With `.info()` we can find out the shape of the dataframe (the number of rows and columns), the datatype and the number of non-null values for each column. We can also get a richer summary of the data with the `.describe()` method:```{python}data.describe(include="all")```This gives us useful information about all the features in the dataset. Among others, we can find the number of unique values in each column, various statisticsfor the numerical columns and the number of null values.## Exploring data with the `TableReport`Now, let's create a TableReport to explore the dataset.```{python}from skrub import TableReportTableReport(data, verbose=0)```::: {.content-hidden when-format="revealjs"}### Default view of the TableReportThe `TableReport` gives us a comprehensive overview of the dataset. The defaultview shows all the columns in the dataset, and allows to select and copy the contentof the cells shown in the preview. The `TableReport` is intended to show a preview of the data, so it does not contain all the rows in the dataset, rather it shows only the first and lastfew rows by default. Similarly, it stores only the top 10 most frequent valuesfor each column, if column distributions are plotted.:::### The "Stats" tab```{python}TableReport(data, open_tab="stats")```::: {.content-hidden when-format="revealjs"}The "Stats" tab provides a variety of descriptive statistics for each column inthe dataset.This includes:- The column name- The detected data type of the column- Whether the column is sorted or not - The number of null values in the column, as well as the percentage- The number of unique values in the columnFor numerical columns, additional statistics are provided:- Mean- Standard deviation- Minimum and maximum values- MedianStat columns can also be sorted, for example to quickly identify which columns contain the most nulls, or have the largest cardinality (number of unique values).:::::: {.callout}### FiltersPre-made column filters are also available, allowing to select columns by dtype or other characteristics. Filters are shared across tabs. :::### The "Distributions" tab```{python}TableReport(data, open_tab="distributions")```::: {.content-hidden when-format="revealjs"}The "Distributions" tab provides visualizations of the distributions of values in each column. This includes histograms for numerical columns and bar plots forcategorical columns.The "Distributions" tab helps with detecting potential issues in the data, such as:- Skewed distributions- Outliers- Unexpected value frequenciesFor example, in this dataset we can see that some columns are heavily skewed, such as "workclass", "race", and "native-country": this is important information to keep track of, because these columns may require special handlingduring data preprocessing or modeling.Additionally, the "Distributions" tab allows to select columns manually, so thatthey can be added to a script and selected for further analysis or modeling.::: {.callout-caution}#### Outlier detectionThe `TableReport` detects outliers using a simple interquartile test, marking as outliers all values that are beyond the IQR. This is a simple heuristic, and should not be treated as perfect. If your problem requires reliable outlier detection, you should not rely exclusively on what the `TableReport` shows. ::::::### The "Associations" tab```{python}TableReport(data, open_tab="associations")```::: {.content-hidden when-format="revealjs"}The "Associations" tab provides insights into the relationships between differentcolumns in the dataset.It shows [Pearson's correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)coefficient for numerical columns, as well as [Cramér's V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) for all columns. While this is a somewhat rough measure of association, it can help identify potentialrelationships worth exploring further during the analysis, and highlights highly correlated columns: depending on the modeling technique used, these may need to be handled specially to avoid issues with multicollinearity.In this example, we can see that "education-num" and "education" have perfect correlation, which means that one of the two columns can be dropped without losinginformation.:::## Exploring the target variableBesides dataframes, the `TableReport` handles series and mono- and bi-dimensional numpy arrays.So, let's take a closer look at the target variable, which indicates whether anindividual's income exceeds $50K per year. We can create a separate `TableReport`for the target variable to explore its distribution: ```{python}TableReport(target)```## Configuring and saving the `TableReport` The `TableReport` can be saved on disk as an HTML. ```{.python}TableReport(data).write_html("report.html")```Then, the report can be opened using any internet browser, with no need to run a Jupyter notebok or a python interactive console. ::: {.content-hidden when-format="revealjs"}It is possible to configure various parameters using the skrub global config. For example, it is possible to replace the default Pandas or Polars dataframedisplay with the TableReport by using `patch_display` (and `unpatch_display`):```{python}from skrub import patch_display, unpatch_display# replace the default pandas repr patch_display()data```To disable, use `unpatch_display`:```{python}unpatch_display()data```More detail on the skrub configuration is reported in the [User Guide](https://skrub-data.org/dev/modules/configuration_and_utils/customizing_configuration.html).:::## Working with big tables::: {.content-hidden when-format="revealjs"}Plotting and measuring the column correlations are expensive operations and maytake a long time, so when the dataframe under study is large it may be moreconvenient to skip them for quicker development. The `max_plot_columns` and `max_association_columns` parameters allow to set a threshold on the number of columns: the `TableReport` will skip the respectivetask if the number of colums in the dataframe is larger than the threshold:When the number of columns is too large, an information message is shown in the respective tab instead of the plots or correlations. :::```{python}TableReport( data, max_association_columns=3, max_plot_columns=3, open_tab="distributions")```## Conclusions::: {.content-hidden when-format="revealjs"}In this chapter we have learned how the `TableReport` can be used to speed up data exploration, allowing us to find possible criticalities in the data. We covered:- Creating and configuring a `TableReport` for fast, interactive data exploration- Exploring column statistics, value distributions, and associations visually- Detecting nulls, outliers, and highly correlated columns at a glance- Filtering columns by type or characteristics using built-in filters- Saving and sharing interactive reports as standalone HTML files- Adjusting `TableReport` settings for large datasets to optimize performanceIn thenext chapter, we will find out how to address some of the possible problems using the skrub `Cleaner`.:::::: {.content-visible when-format="revealjs"}- The `TableReport` is a powerful EDA tool- It shows a rich preview of the content of the dataframe- It provides precomputed statistics for all the columns- It prepares distribution plots for each column- It measures the association between columns:::