Let’s begin the lesson by imagining a world without skrub, where we can use only Pandas and scikit-learn to clean data and prepare a machine learning model.
import pandas as pdimport numpy as npX = pd.read_csv("../data/employee_salaries/data.csv")y = pd.read_csv("../data/employee_salaries/target.csv")X.head(5)
gender
department
department_name
division
assignment_category
employee_position_title
date_first_hired
year_first_hired
0
F
POL
Department of Police
MSB Information Mgmt and Tech Division Records...
Fulltime-Regular
Office Services Coordinator
09/22/1986
1986
1
M
POL
Department of Police
ISB Major Crimes Division Fugitive Section
Fulltime-Regular
Master Police Officer
09/12/1988
1988
2
F
HHS
Department of Health and Human Services
Adult Protective and Case Management Services
Fulltime-Regular
Social Worker IV
11/19/1989
1989
3
M
COR
Correction and Rehabilitation
PRRS Facility and Security
Fulltime-Regular
Resident Supervisor II
05/05/2014
2014
4
M
HCA
Department of Housing and Community Affairs
Affordable Housing Programs
Fulltime-Regular
Planning Specialist III
03/05/2007
2007
1.1
Let’s take a look at the target:
y.head(5)
current_annual_salary
0
69222.18
1
97392.47
2
104717.28
3
52734.57
4
93396.00
This is a numerical column, and our task is predicting the value of current_annual_salary.
1.2 Strategizing
We can begin by exploring the dataframe with .describe, and then think of a plan for pre-processing our data.
X.describe(include="all")
gender
department
department_name
division
assignment_category
employee_position_title
date_first_hired
year_first_hired
count
9211
9228
9228
9228
9228
9228
9228
9228.000000
unique
2
37
37
694
2
443
2264
NaN
top
M
POL
Department of Police
School Health Services
Fulltime-Regular
Bus Operator
12/12/2016
NaN
freq
5481
1844
1844
300
8394
638
87
NaN
mean
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2003.597529
std
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9.327078
min
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1965.000000
25%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1998.000000
50%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2005.000000
75%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2012.000000
max
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2016.000000
1.3 Our plan
We want to train a linear regression model (Ridge) to predict the salary. Therefore, we need to:
Impute some missing values in the gender column.
Encode convert categorical features into numerical features.
Convert the column date_first_hired into numerical features.
Scale numerical features.
Evaluate the performance of the model.
1.4 Building a traditional pipeline
Let’s build a traditional predictive pipeline following the steps we just discussed.
1.5 Step 1: Convert date features to numerical
Extract numerical features from the date_first_hired column.
# Create a copy to work withX_processed = X.copy()# Parse the date columnX_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])# Extract numerical features from dateX_processed['hired_month'] = X_processed['date_first_hired'].dt.monthX_processed['hired_year'] = X_processed['date_first_hired'].dt.year# Drop original date columnX_processed = X_processed.drop('date_first_hired', axis=1)print("Features after date transformation:")print("\nShape:", X_processed.shape)
Features after date transformation:
Shape: (9228, 9)
1.6 Step 2: Encode categorical features
Encode only the non-numerical categorical features using one-hot encoding.
# Identify only the non-numerical (truly categorical) columnscategorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()print("Categorical columns to encode:", categorical_cols)# Apply one-hot encoding only to categorical columnsX_encoded = pd.get_dummies(X_processed, columns=categorical_cols)print("\nShape after encoding:", X_encoded.shape)
Categorical columns to encode: ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title']
Shape after encoding: (9228, 1218)
1.7 Step 3: Impute missing values
We’ll impute missing values in the gender column using the most frequent strategy.
from sklearn.impute import SimpleImputer# Impute missing values with most frequent valueimputer = SimpleImputer(strategy='most_frequent')X_encoded_imputed = pd.DataFrame( imputer.fit_transform(X_encoded), columns=X_encoded.columns)
1.8 Step 4: Scale numerical features
Scale numerical features for the Ridge regression model.
from sklearn.preprocessing import StandardScaler# Initialize the scalerscaler = StandardScaler()# Fit and transform the dataX_scaled = scaler.fit_transform(X_encoded_imputed)X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)
1.9 Step 5: Train Ridge model with cross-validation
Train a Ridge regression model and evaluate with cross-validation.
from sklearn.linear_model import Ridgefrom sklearn.model_selection import cross_val_score, cross_validateimport numpy as np# Initialize Ridge modelridge = Ridge(alpha=1.0)# Perform cross-validation (5-fold)cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])# Convert MSE to RMSEtest_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
Cross-Validation Results:
Mean test R²: 0.8722 (+/- 0.0274)
Mean test RMSE: 10367.1206 (+/- 1403.4322)
1.10 “Just ask an agent to write the code”
It’s what I did. Here are some of the issues I noticed:
Operations in the wrong order.
Trying to impute categorical features without encoding them as numerical values.
The datetime feature was encoded as a categorical (i.e, with dummmies).
Cells could not be executed in order without proper debugging and re-prompting.
pd.get_dummies was executed on the full dataframe, rather than only on the training split, leading to data leakage.
This means that I had to spend time re-prompting the model to get it to run, and that’s (intentionally) without removing the leakage.
1.11 Waking up from a nightmare
Thankfully, we live in a world where we can import skrub. Let’s see what we can get if we use skrub.tabular_pipeline.
from skrub import tabular_pipeline# Perform cross-validation (5-fold)cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, scoring=['r2', 'neg_mean_squared_error'], return_train_score=True)# Convert MSE to RMSEtrain_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
Cross-Validation Results:
Mean test R²: 0.9101 (+/- 0.0148)
Mean test RMSE: 8718.2014 (+/- 992.7882)
All the code from before, the tokens and the debugging are replaced by a single import that gives better results.
Throughout the tutorial, we will see how each step can be simplified, replaced, or improved using skrub features, going through the various features until we get to the tabular_pipeline.
1.12 Roadmap for the course
We are going to build what could be a typicial pre-processing pipeline:
We will explore the data to identify possible problems and figure out what needs to be cleaned.
We will then sanitize the data to address some common problems.
There will be an intermission on various skrub features that simplify.
Then, we will show how to perform feature engineering using various skrub encoders.
Finally, we will show how we can put everything together.
1.13 What we saw in this chapter
We build a predictive pipeline using traditional tools
We saw some possible shortcomings
We tested skrub’s tabular_pipeline to avoid some of those issues
---title: "Chapter 1: Introduction"format: html: toc: true revealjs: slide-number: true toc: false code-fold: false code-tools: trueexecute: echo: true---# A world without skrub {.smaller}::: {.content-hidden when-format="revealjs"}Let's begin the lesson by imagining a world without skrub, where we can use only Pandas and scikit-learn to clean data and prepare a machine learning model. :::::: {.content-visible when-format="html"}:::```{python}import pandas as pdimport numpy as npX = pd.read_csv("../data/employee_salaries/data.csv")y = pd.read_csv("../data/employee_salaries/target.csv")X.head(5)```##Let's take a look at the target:```{python}y.head(5)```This is a numerical column, and our task is predicting the value of `current_annual_salary`.## StrategizingWe can begin by exploring the dataframe with `.describe`, and then think of a plan for pre-processing our data. ```{python}X.describe(include="all")```## Our planWe want to train a linear regression model (`Ridge`) to predict the salary.Therefore, we need to: - Impute some missing values in the `gender` column.- Encode convert categorical features into numerical features. - Convert the column `date_first_hired` into numerical features.- Scale numerical features. - Evaluate the performance of the model. ## Building a traditional pipelineLet's build a traditional predictive pipeline following the steps we just discussed. ## Step 1: Convert date features to numerical {.smaller}Extract numerical features from the `date_first_hired` column.```{python}# Create a copy to work withX_processed = X.copy()# Parse the date columnX_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])# Extract numerical features from dateX_processed['hired_month'] = X_processed['date_first_hired'].dt.monthX_processed['hired_year'] = X_processed['date_first_hired'].dt.year# Drop original date columnX_processed = X_processed.drop('date_first_hired', axis=1)print("Features after date transformation:")print("\nShape:", X_processed.shape)```## Step 2: Encode categorical features {.smaller}Encode only the non-numerical categorical features using one-hot encoding.```{python}# Identify only the non-numerical (truly categorical) columnscategorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()print("Categorical columns to encode:", categorical_cols)# Apply one-hot encoding only to categorical columnsX_encoded = pd.get_dummies(X_processed, columns=categorical_cols)print("\nShape after encoding:", X_encoded.shape)```## Step 3: Impute missing values {.smaller}We'll impute missing values in the `gender` column using the most frequent strategy.```{python}from sklearn.impute import SimpleImputer# Impute missing values with most frequent valueimputer = SimpleImputer(strategy='most_frequent')X_encoded_imputed = pd.DataFrame( imputer.fit_transform(X_encoded), columns=X_encoded.columns)```## Step 4: Scale numerical features {.smaller}Scale numerical features for the Ridge regression model.```{python}from sklearn.preprocessing import StandardScaler# Initialize the scalerscaler = StandardScaler()# Fit and transform the dataX_scaled = scaler.fit_transform(X_encoded_imputed)X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)```## Step 5: Train Ridge model with cross-validation {.smaller}Train a `Ridge` regression model and evaluate with cross-validation.```{python}#| warning: falsefrom sklearn.linear_model import Ridgefrom sklearn.model_selection import cross_val_score, cross_validateimport numpy as np# Initialize Ridge modelridge = Ridge(alpha=1.0)# Perform cross-validation (5-fold)cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])# Convert MSE to RMSEtest_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")```::: {.content-hidden when-format="revealjs"}## "Just ask an agent to write the code"It's what I did. Here are some of the issues I noticed: - Operations in the wrong order.- Trying to impute categorical features without encoding them as numerical values.- The datetime feature was encoded as a categorical (i.e, with dummmies).- Cells could not be executed in order without proper debugging and re-prompting.- `pd.get_dummies` was executed on the full dataframe, rather than only on the training split, leading to data leakage. This means that I had to spend time re-prompting the model to get it to run, and that's (intentionally) without removing the leakage. :::## Waking up from a nightmare {.smaller}Thankfully, we live in a world where we can `import skrub`. Let's see what we canget if we use `skrub.tabular_pipeline`. ```{python}#| warning: falsefrom skrub import tabular_pipeline# Perform cross-validation (5-fold)cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, scoring=['r2', 'neg_mean_squared_error'], return_train_score=True)# Convert MSE to RMSEtrain_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")```::: {.content-hidden when-format="revealjs"}All the code from before, the tokens and the debugging are replaced by a single import that gives better results.Throughout the tutorial, we will see how each step can be simplified, replaced, orimproved using skrub features, going through the various features until we get tothe `tabular_pipeline`. :::## Roadmap for the course {.smaller}We are going to build what could be a typicial pre-processing pipeline: 1. We will explore the data to identify possible problems and figure out what needsto be cleaned.2. We will then sanitize the data to address some common problems. 3. There will be an intermission on various skrub features that simplify.4. Then, we will show how to perform feature engineering using various skrub encoders.5. Finally, we will show how we can put everything together.## What we saw in this chapter- We build a predictive pipeline using traditional tools- We saw some possible shortcomings - We tested skrub's `tabular_pipeline` to avoid some of those issues