5  Quiz: Exploring and sanitizing dataframes with skrub

5.1 Question 1

What do I need to open a TableReport saved with .write_html("report.html")?

Answer: B)

After its generation, the TableReport can be persisted on disk as a HTML file. The file can be opened using a regular internet browswer.

The TableReport is not updated dynamically, and is not connected to python consoles or running kernels.

5.2 Question 2

Consider this dataframe and TableReport, then answer the question.

import pandas as pd
from skrub import TableReport

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 90000, 100000, 110000],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales']
})

TableReport(df, max_plot_columns=5, max_association_columns=3)
Processing column   1 / 5Processing column   2 / 5Processing column   3 / 5Processing column   4 / 5Processing column   5 / 5

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

What does the “Distributions” tab show? What about the “Associations” tab?

Answer: B)

The “Distribution” contains the usual distribution plots, while the computation of the associations was skipped because the number of columns in the dataframe (5) was larger than max_association_columns (3).

5.3 Question 3

Does the TableReport parse datetimes or other data types?

Answer: No, the TableReport is generated on the basis of the datatypes found in the supplied dataframe. Any datatype parsing must be done before generating the report, e.g., by using the Cleaner.

5.4 Question 4

Which of these transformations is executed by default when the Cleaner is fitted on a dataframe?

Answer: B)

Columns that contain only missing values, i.e., where the fraction of missing values is 1.0, are dropped. This is controlled by the drop_null_fraction parameter.

5.5 Question 5

Consider the following dataframe.

import pandas as pd
medical_df = pd.DataFrame({
    'Patient_ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'Visit_Date': ['10 Jan 2023', '15 Feb 2023', '20 Mar 2023', '25 Apr 2023', None],
    'Blood_Pressure': [120.5, 130.2, 125.8, 140.0, 135.6],
    'Diagnosis': ['Hypertension', '?', '?', 'Hypertension', 'Diabetes'],
})

medical_df
Patient_ID Visit_Date Blood_Pressure Diagnosis
0 P001 10 Jan 2023 120.5 Hypertension
1 P002 15 Feb 2023 130.2 ?
2 P003 20 Mar 2023 125.8 ?
3 P004 25 Apr 2023 140.0 Hypertension
4 P005 None 135.6 Diabetes

What is the output of this cleaner?

from skrub import Cleaner
cleaner = Cleaner()
df_clean = cleaner.fit_transform(medical_df)
Patient_ID Visit_Date Blood_Pressure Diagnosis
0 P001 2023-01-10 120.5 Hypertension
1 P002 2023-02-15 130.2 None
2 P003 2023-03-20 125.8 None
3 P004 2023-04-25 140.0 Hypertension
4 P005 NaT 135.6 Diabetes
Patient_ID Visit_Date Blood_Pressure Diagnosis
0 P001 10 Jan 2023 120.5 Hypertension
1 P002 15 Feb 2023 NaN ?
2 P003 20 Mar 2023 125.8 ?
3 P004 25 Apr 2023 140.0 Hypertension
4 P005 None 135.6 Diabetes
Patient_ID Visit_Date Blood_Pressure Diagnosis
0 P001 10 Jan 2023 120.5 Hypertension
1 P002 15 Feb 2023 NaN None
2 P003 20 Mar 2023 125.8 None
3 P004 25 Apr 2023 140.0 Hypertension
4 P005 None 135.6 Diabetes

Answer: A)

The Cleaner replaces strings that are commonly used to denote missing values (such as “?”), and guesses most common datetime formats from their strings.

No empty columns are present, so no further transformations are made.