5 Quiz: Exploring and sanitizing dataframes with skrub – Inria Academy

5.1 Question 1

What do I need to open a TableReport saved with .write_html("report.html")?

A) A python console
B) An internet browser
C) A Jupyter notebook

Solution

Answer: B)

After its generation, the TableReport can be persisted on disk as a HTML file. The file can be opened using a regular internet browswer.

The TableReport is not updated dynamically, and is not connected to python consoles or running kernels.

5.2 Question 2

Consider this dataframe and TableReport, then answer the question.

import pandas as pd
from skrub import TableReport

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 90000, 100000, 110000],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'Sales']
})

TableReport(df, max_plot_columns=5, max_association_columns=3)

Processing column   1 / 5Processing column   2 / 5Processing column   3 / 5Processing column   4 / 5Processing column   5 / 5

	Name	Age	City	Salary	Department
0	Alice	25	New York	70,000	HR
1	Bob	30	Los Angeles	80,000	Finance
2	Charlie	35	Chicago	90,000	IT
3	Diana	40	Houston	100,000	Marketing
4	Eve	45	Phoenix	110,000	Sales

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	Name	ObjectDType	True	5 (100.0%)
1	Age	Int64DType	True	5 (100.0%)	35.0	7.91	25	35	45
2	City	ObjectDType	False	5 (100.0%)
3	Salary	Int64DType	True	5 (100.0%)	9.00e+04	1.58e+04	70,000	90,000	110,000
4	Department	ObjectDType	False	5 (100.0%)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

What does the “Distributions” tab show? What about the “Associations” tab?

A) Both tabs work as normal.
B) The “Distribution” tab shows the plots, “Associations” are not shown.
C) Both tabs contain a message explaining their operation was skipped.

Solution

Answer: B)

The “Distribution” contains the usual distribution plots, while the computation of the associations was skipped because the number of columns in the dataframe (5) was larger than max_association_columns (3).

5.3 Question 3

Does the TableReport parse datetimes or other data types?

Yes, the TableReport automatically converts datetime strings to datetime objects and strings that contain numbers into floats.
No, the TableReport does not perform any conversion.

Solution

Answer: No, the TableReport is generated on the basis of the datatypes found in the supplied dataframe. Any datatype parsing must be done before generating the report, e.g., by using the Cleaner.

5.4 Question 4

Which of these transformations is executed by default when the Cleaner is fitted on a dataframe?

A) Dropping constant columns
B) Dropping columns that contain only missing values
C) Dropping columns that contain more than 90% of missing values
D) Dropping columns where all values are distinct

Solution

Answer: B)

Columns that contain only missing values, i.e., where the fraction of missing values is 1.0, are dropped. This is controlled by the drop_null_fraction parameter.

5.5 Question 5

Consider the following dataframe.

import pandas as pd
medical_df = pd.DataFrame({
    'Patient_ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'Visit_Date': ['10 Jan 2023', '15 Feb 2023', '20 Mar 2023', '25 Apr 2023', None],
    'Blood_Pressure': [120.5, 130.2, 125.8, 140.0, 135.6],
    'Diagnosis': ['Hypertension', '?', '?', 'Hypertension', 'Diabetes'],
})

medical_df

	Patient_ID	Visit_Date	Blood_Pressure	Diagnosis
0	P001	10 Jan 2023	120.5	Hypertension
1	P002	15 Feb 2023	130.2	?
2	P003	20 Mar 2023	125.8	?
3	P004	25 Apr 2023	140.0	Hypertension
4	P005	None	135.6	Diabetes

What is the output of this cleaner?

from skrub import Cleaner
cleaner = Cleaner()
df_clean = cleaner.fit_transform(medical_df)

A)

	Patient_ID	Visit_Date	Blood_Pressure	Diagnosis
0	P001	2023-01-10	120.5	Hypertension
1	P002	2023-02-15	130.2	None
2	P003	2023-03-20	125.8	None
3	P004	2023-04-25	140.0	Hypertension
4	P005	NaT	135.6	Diabetes

B)

	Patient_ID	Visit_Date	Blood_Pressure	Diagnosis
0	P001	10 Jan 2023	120.5	Hypertension
1	P002	15 Feb 2023	NaN	?
2	P003	20 Mar 2023	125.8	?
3	P004	25 Apr 2023	140.0	Hypertension
4	P005	None	135.6	Diabetes

C)

	Patient_ID	Visit_Date	Blood_Pressure	Diagnosis
0	P001	10 Jan 2023	120.5	Hypertension
1	P002	15 Feb 2023	NaN	None
2	P003	20 Mar 2023	125.8	None
3	P004	25 Apr 2023	140.0	Hypertension
4	P005	None	135.6	Diabetes

D)

Solution

Answer: A)

The Cleaner replaces strings that are commonly used to denote missing values (such as “?”), and guesses most common datetime formats from their strings.

No empty columns are present, so no further transformations are made.