2  A world without skrub

Let’s begin the lesson by imagining a world without skrub, where we can use only Pandas and scikit-learn to clean data and prepare a machine learning model.

import pandas as pd
import numpy as np

X = pd.read_csv("../data/employee_salaries/data.csv")
y = pd.read_csv("../data/employee_salaries/target.csv")
X.head(5)
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 03/05/2007 2007

Let’s take a look at the target::

y.head(5)
current_annual_salary
0 69222.18
1 97392.47
2 104717.28
3 52734.57
4 93396.00

This is a numerical column, and our task is predicting the value of current_annual_salary.

2.1 Strategizing

We can begin by exploring the dataframe with .describe, and then think of a plan for pre-processing our data.

X.describe(include="all")
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
count 9211 9228 9228 9228 9228 9228 9228 9228.000000
unique 2 37 37 694 2 443 2264 NaN
top M POL Department of Police School Health Services Fulltime-Regular Bus Operator 12/12/2016 NaN
freq 5481 1844 1844 300 8394 638 87 NaN
mean NaN NaN NaN NaN NaN NaN NaN 2003.597529
std NaN NaN NaN NaN NaN NaN NaN 9.327078
min NaN NaN NaN NaN NaN NaN NaN 1965.000000
25% NaN NaN NaN NaN NaN NaN NaN 1998.000000
50% NaN NaN NaN NaN NaN NaN NaN 2005.000000
75% NaN NaN NaN NaN NaN NaN NaN 2012.000000
max NaN NaN NaN NaN NaN NaN NaN 2016.000000

In this example we want to train a regression model to predict the salary, and we will be using a linear model (Ridge) to do so.

Therefore, we need to:

  • Impute some missing values in the gender column.
  • Encode convert categorical features into numerical features.
  • Convert the column date_first_hired into numerical features.
  • Scale numerical features.

Finally, we want to evaluate the performance of the method across multiple cross-validation splits.

2.2 Building a traditional pipeline

Let’s build a traditional predictive pipeline following the steps we just discussed.

2.2.1 Step 1: Convert date features to numerical

Extract numerical features from the date_first_hired column.

# Create a copy to work with
X_processed = X.copy()

# Parse the date column
X_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])

# Extract numerical features from date
X_processed['hired_month'] = X_processed['date_first_hired'].dt.month
X_processed['hired_year'] = X_processed['date_first_hired'].dt.year

# Drop original date column
X_processed = X_processed.drop('date_first_hired', axis=1)

print("Features after date transformation:")
print("\nShape:", X_processed.shape)
Features after date transformation:

Shape: (9228, 9)

2.2.2 Step 2: Encode categorical features

Encode only the non-numerical categorical features using one-hot encoding.

# Identify only the non-numerical (truly categorical) columns
categorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns to encode:", categorical_cols)

# Apply one-hot encoding only to categorical columns
X_encoded = pd.get_dummies(X_processed, columns=categorical_cols)
print("\nShape after encoding:", X_encoded.shape)
Categorical columns to encode: ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title']

Shape after encoding: (9228, 1218)

2.2.3 Step 3: Impute missing values

We’ll impute missing values in the gender column using the most frequent strategy.

from sklearn.impute import SimpleImputer

# Impute missing values with most frequent value
imputer = SimpleImputer(strategy='most_frequent')
X_encoded_imputed = pd.DataFrame(
    imputer.fit_transform(X_encoded),
    columns=X_encoded.columns
)

2.2.4 Step 4: Scale numerical features

Scale numerical features for the Ridge regression model.

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X_encoded_imputed)
X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)

2.2.5 Step 5: Train Ridge model with cross-validation

Train a Ridge regression model and evaluate with cross-validation.

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, cross_validate
import numpy as np

# Initialize Ridge model
ridge = Ridge(alpha=1.0)

# Perform cross-validation (5-fold)
cv_results = cross_validate(
    ridge,
    X_scaled,
    y,
    cv=5,
    scoring=["r2", "neg_mean_squared_error"],
)

# Convert MSE to RMSE
test_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])

# Display results
print("Cross-Validation Results:")
print(
    f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})"
)
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
Cross-Validation Results:
Mean test R²: 0.8722 (+/- 0.0274)
Mean test RMSE: 10367.1206 (+/- 1403.4322)

2.2.6 “Just ask an agent to write the code”

It’s what I did. Here are some of the issues I noticed:

  • Operations in the wrong order.
  • Trying to impute categorical features without encoding them as numerical values.
  • The datetime feature was encoded as a categorical (i.e, with dummmies).
  • Cells could not be executed in order without proper debugging and re-prompting.
  • pd.get_dummies was executed on the full dataframe, rather than only on the training split, leading to data leakage.

This means that I had to spend time re-prompting the model to get it to run, and that’s (intentionally) without removing the leakage.

2.3 Waking up from a nightmare

Thankfully, we live in a world where we can import skrub. Let’s see what we can get if we use skrub.tabular_pipeline.

from skrub import tabular_pipeline

# Perform cross-validation (5-fold)
cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, 
                            scoring=['r2', 'neg_mean_squared_error'],
                            return_train_score=True)

# Convert MSE to RMSE
train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])

# Display results
print("Cross-Validation Results:")
print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")
print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/rcap/work/skrub-tutorials/.pixi/envs/doc/lib/python3.14/site-packages/sklearn/utils/validation.py:1352: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Cross-Validation Results:
Mean test R²: 0.9105 (+/- 0.0160)
Mean test RMSE: 8698.4600 (+/- 1053.9529)

All the code from before, the tokens and the debugging are replaced by a single import that gives better results.

Throughout the tutorial, we will see how each step can be simplified, replaced, or improved using skrub features, going through the various features until we get to the tabular_pipeline.

2.4 Roadmap for the course

We are going to build what could be a typicial pre-processing pipeline:

  1. We will explore the data to identify possible problems and figure out what needs to be cleaned.
  2. We will then sanitize the data to address some common problems.
  3. There will be an intermission on various skrub features that simplify.
  4. Then, we will show how to perform feature engineering using various skrub encoders.
  5. Finally, we will show how we can put everything together.