Chapter 1: Introduction

A world without skrub

Let’s consider a world where skrub does not exist, and all we can do is use pandas and scikit-learn to prepare data for a machine learning model.

Load and explore the data

import pandas as pd
import numpy as np

X = pd.read_csv("../data/employee_salaries/data.csv")
y = pd.read_csv("../data/employee_salaries/target.csv")["current_annual_salary"]
X.head(5)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007

Explore the target

Let’s take a look at the target:

y.head(5)

0     69222.18
1     97392.47
2    104717.28
3     52734.57
4     93396.00
Name: current_annual_salary, dtype: float64

This is a regression task: we want to predict the value of current_annual_salary.

Strategizing

We can begin by exploring the dataframe with .describe, and then think of a plan for pre-processing our data.

X.describe(include="all")

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
count	9211	9228	9228	9228	9228	9228	9228	9228.000000
unique	2	37	37	694	2	443	2264	NaN
top	M	POL	Department of Police	School Health Services	Fulltime-Regular	Bus Operator	12/12/2016	NaN
freq	5481	1844	1844	300	8394	638	87	NaN
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.597529
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9.327078
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1965.000000
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1998.000000
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.000000
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2012.000000
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2016.000000

Our plan

We need to:

Impute some missing values in the gender column.
Encode convert categorical features into numerical features.
Convert the column date_first_hired into numerical features.
Scale numerical features.
Evaluate the performance of the model.

Step 1: Convert date features to numerical

We extract numerical features from the date_first_hired column.

# Create a copy to work with
X_processed = X.copy()

# Parse the date column
X_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])

# Extract numerical features from date
X_processed['hired_month'] = X_processed['date_first_hired'].dt.month
X_processed['hired_year'] = X_processed['date_first_hired'].dt.year

# Drop original date column
X_processed = X_processed.drop('date_first_hired', axis=1)

print("Features after date transformation:")
print("\nShape:", X_processed.shape)

Features after date transformation:

Shape: (9228, 9)

Step 2: Encode categorical features

We encode the categorical features using one-hot encoding.

# Identify only the non-numerical (truly categorical) columns
categorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns to encode:", categorical_cols)

# Apply one-hot encoding only to categorical columns
X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
print("\nShape after encoding:", X_encoded.shape)

Categorical columns to encode: ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title']

Shape after encoding: (9228, 1218)

Step 3: Impute missing values

We impute the missing values in the gender column

from sklearn.impute import SimpleImputer

# Impute missing values with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
X_encoded_imputed = pd.DataFrame(
    imputer.fit_transform(X_encoded),
    columns=X_encoded.columns
)

Step 4: Scale numerical features

Scale numerical features for the Ridge regression model.

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_encoded_imputed)
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)

Step 5: Train Ridge model with cross-validation

Train a Ridge regression model and evaluate with cross-validation.

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, cross_validate
import numpy as np

# Initialize Ridge model
ridge = Ridge(alpha=1.0)

# Perform cross-validation (5-fold)
cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])

# Convert MSE to RMSE
test_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])

# Display results
print("Cross-Validation Results:")
print(
    f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})"
)
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")

Cross-Validation Results:
Mean test R²: 0.8722 (+/- 0.0274)
Mean test RMSE: 10367.1206 (+/- 1403.4322)

“Just ask an agent to write the code”

Operations in the wrong order.
Trying to impute categorical features without converting them to numeric values.
The datetime feature was treated like a categorical feature.
Cells could not be executed in order without proper debugging and re-prompting.
pd.get_dummies was executed on the full dataframe, rather than only on the training split, leading to data leakage.

Waking up from a nightmare

Thankfully, we can import skrub:

from skrub import tabular_pipeline

# Perform cross-validation (5-fold)
cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, 
                            scoring=['r2', 'neg_mean_squared_error'],
                            return_train_score=True)

# Convert MSE to RMSE
train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])

# Display results
print("Cross-Validation Results:")
print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")

Cross-Validation Results:
Mean test R²: 0.9089 (+/- 0.0161)
Mean test RMSE: 8773.7865 (+/- 1052.5788)

Roadmap for the course

Data exploration with skrub’s TableReport
Data cleaning and sanitization with the Cleaner
Intermission: simplifying column operations with skrub
Feature engineering with the skrub encoders
Putting everything together: TableVectorizer and tabular_pipeline

What we saw in this chapter

We built a predictive pipeline using traditional tools
We saw some possible shortcomings
We tested skrub’s tabular_pipeline