8  Quiz: Column-level transformations

9 Column transformers

9.1 Question 1

Consider this diagram. Which column transformer can replicate this behavior if it wraps a OneHotEncoder?

Answer: A) ApplyToCols takes a transformer, then clones it and applies it separately to each column under selection (in this case, Name and Desc). Columns that were not selected are left unchanged.

9.2 Question 2

Consider this diagram. Which column transformer can replicate this behavior if it wraps a PCA?

Answer: B) ApplyToFrame takes a transformer and a list of columns (usually, a subset of the columns in the dataframe), then applies the transformer to all the selected columns at once, replacing them with the output of the transfromer. Columns that were not selected are left unchanged.

9.3 Question 3

Is the following statement true or false?

The output of these two snippets is the same:

encode = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
scale = ApplyToCols(StandardScaler(), cols=s.numeric())
case_1 = make_pipeline(encode, scale)
case_1.fit_transform(df)
case_2 = make_pipeline(scale, encode)
case_2.fit_transform(df)

Answer: False.

The order of the operations matters, and a different order leads to different results.

import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from skrub import SelectCols, ApplyToCols, ApplyToFrame
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import skrub.selectors as s 

n_patients = 5
df = pd.DataFrame(
    {
        "patient_id": [f"P{i:03d}" for i in range(n_patients)],
        "age": np.random.randint(18, 80, size=n_patients),
        "sex": np.random.choice(["M", "F"], size=n_patients),
    }
)
encode = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string())
scale = ApplyToCols(StandardScaler(), cols=s.numeric())
case_1 = make_pipeline(encode, scale)
df_1 = case_1.fit_transform(df)
df_1.head(5)
patient_id_P000 patient_id_P001 patient_id_P002 patient_id_P003 patient_id_P004 age sex_F sex_M
0 2.0 -0.5 -0.5 -0.5 -0.5 -0.368230 0.5 -0.5
1 -0.5 2.0 -0.5 -0.5 -0.5 -1.657034 0.5 -0.5
2 -0.5 -0.5 2.0 -0.5 -0.5 0.368230 -2.0 2.0
3 -0.5 -0.5 -0.5 2.0 -0.5 1.380862 0.5 -0.5
4 -0.5 -0.5 -0.5 -0.5 2.0 0.276172 0.5 -0.5
case_2 = make_pipeline(scale, encode)
df_2 = case_2.fit_transform(df)
df_2.head(5)
patient_id_P000 patient_id_P001 patient_id_P002 patient_id_P003 patient_id_P004 age sex_F sex_M
0 1.0 0.0 0.0 0.0 0.0 -0.368230 1.0 0.0
1 0.0 1.0 0.0 0.0 0.0 -1.657034 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.368230 0.0 1.0
3 0.0 0.0 0.0 1.0 0.0 1.380862 1.0 0.0
4 0.0 0.0 0.0 0.0 1.0 0.276172 1.0 0.0

10 Selectors

For the following questions, refer to this example dataframe:

import pandas as pd
import datetime

data = {
    "age": [25, 34, 29, 42, 31],
    "salary": [45000.0, 52000.0, 61000.0, None, 48000.0],
    "employment_type": ["full-time", "part-time", None, "contract", "full-time"],
    "job_title": ["engineer", "analyst", "consultant", "designer", "developer"],
    "department_title": ["IT", "Finance", "Consulting", "Design", "Development"],
    "is_remote": [False, True, False, True, False],
    "performance_rating": pd.Categorical(["excellent", None, "good", "excellent", "average"]),
    "bonus_category": pd.Categorical(["5K+", "10K+", "15K+", "7K+", "12K+"]),
    "hire_date": [
        datetime.datetime.fromisoformat(dt)
        for dt in [
            "2018-06-01T09:00:00",
            "2019-09-15T14:30:00",
            "2020-11-20T10:15:00",
            "2021-04-10T16:45:00",
        ]
    ]
    + [None],

}
df = pd.DataFrame(data)
df
age salary employment_type job_title department_title is_remote performance_rating bonus_category hire_date
0 25 45000.0 full-time engineer IT False excellent 5K+ 2018-06-01 09:00:00
1 34 52000.0 part-time analyst Finance True NaN 10K+ 2019-09-15 14:30:00
2 29 61000.0 None consultant Consulting False good 15K+ 2020-11-20 10:15:00
3 42 NaN contract designer Design True excellent 7K+ 2021-04-10 16:45:00
4 31 48000.0 full-time developer Development False average 12K+ NaT

10.1 Question 4

What does this selector do?

from skrub import SelectCols
import skrub.selectors as s

def fun(col):
    mean = col.mean()
    return mean > 40000

sel = s.numeric() & s.filter(fun) 
t = SelectCols(cols=sel)
t.fit_transform(df)
salary
0 45000.0
1 52000.0
2 61000.0
3 NaN
4 48000.0

Answer: B)

t.fit_transform(df)
salary
0 45000.0
1 52000.0
2 61000.0
3 NaN
4 48000.0

10.2 Question 5

What does this selector do?

sel = s.cols("salary") | s.filter_names(lambda name: name.endswith("_title"))

t = SelectCols(cols=sel)
t.fit_transform(df)
salary job_title department_title
0 45000.0 engineer IT
1 52000.0 analyst Finance
2 61000.0 consultant Consulting
3 NaN designer Design
4 48000.0 developer Development

Answer: C)

t.fit_transform(df)
salary job_title department_title
0 45000.0 engineer IT
1 52000.0 analyst Finance
2 61000.0 consultant Consulting
3 NaN designer Design
4 48000.0 developer Development