This is a regression task: we want to predict the value of current_annual_salary.
Strategizing
We can begin by exploring the dataframe with .describe, and then think of a plan for pre-processing our data.
X.describe(include="all")
gender
department
department_name
division
assignment_category
employee_position_title
date_first_hired
year_first_hired
count
9211
9228
9228
9228
9228
9228
9228
9228.000000
unique
2
37
37
694
2
443
2264
NaN
top
M
POL
Department of Police
School Health Services
Fulltime-Regular
Bus Operator
12/12/2016
NaN
freq
5481
1844
1844
300
8394
638
87
NaN
mean
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2003.597529
std
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9.327078
min
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1965.000000
25%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1998.000000
50%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2005.000000
75%
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2012.000000
max
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2016.000000
Our plan
We need to:
Impute some missing values in the gender column.
Encode convert categorical features into numerical features.
Convert the column date_first_hired into numerical features.
Scale numerical features.
Evaluate the performance of the model.
Step 1: Convert date features to numerical
We extract numerical features from the date_first_hired column.
# Create a copy to work withX_processed = X.copy()# Parse the date columnX_processed['date_first_hired'] = pd.to_datetime(X_processed['date_first_hired'])# Extract numerical features from dateX_processed['hired_month'] = X_processed['date_first_hired'].dt.monthX_processed['hired_year'] = X_processed['date_first_hired'].dt.year# Drop original date columnX_processed = X_processed.drop('date_first_hired', axis=1)print("Features after date transformation:")print("\nShape:", X_processed.shape)
Features after date transformation:
Shape: (9228, 9)
Step 2: Encode categorical features
We encode the categorical features using one-hot encoding.
# Identify only the non-numerical (truly categorical) columnscategorical_cols = X_processed.select_dtypes(include=['object']).columns.tolist()print("Categorical columns to encode:", categorical_cols)# Apply one-hot encoding only to categorical columnsX_encoded = pd.get_dummies(X_processed, columns=categorical_cols)print("\nShape after encoding:", X_encoded.shape)
Categorical columns to encode: ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title']
Shape after encoding: (9228, 1218)
Step 3: Impute missing values
We impute the missing values in the gender column
from sklearn.impute import SimpleImputer# Impute missing values with most frequent valueimputer = SimpleImputer(strategy='most_frequent')X_encoded_imputed = pd.DataFrame( imputer.fit_transform(X_encoded), columns=X_encoded.columns)
Step 4: Scale numerical features
Scale numerical features for the Ridge regression model.
from sklearn.preprocessing import StandardScaler# Initialize the scalerscaler = StandardScaler()# Fit and transform the dataX_scaled = scaler.fit_transform(X_encoded_imputed)X_scaled = pd.DataFrame(X_scaled, columns=X_encoded_imputed.columns)
Step 5: Train Ridge model with cross-validation
Train a Ridge regression model and evaluate with cross-validation.
from sklearn.linear_model import Ridgefrom sklearn.model_selection import cross_val_score, cross_validateimport numpy as np# Initialize Ridge modelridge = Ridge(alpha=1.0)# Perform cross-validation (5-fold)cv_results = cross_validate(ridge, X_scaled, y, cv=5, scoring=["r2", "neg_mean_squared_error"])# Convert MSE to RMSEtest_rmse = np.sqrt(-cv_results["test_neg_mean_squared_error"])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
Cross-Validation Results:
Mean test R²: 0.8722 (+/- 0.0274)
Mean test RMSE: 10367.1206 (+/- 1403.4322)
“Just ask an agent to write the code”
Operations in the wrong order.
Trying to impute categorical features without converting them to numeric values.
The datetime feature was treated like a categorical feature.
Cells could not be executed in order without proper debugging and re-prompting.
pd.get_dummies was executed on the full dataframe, rather than only on the training split, leading to data leakage.
Waking up from a nightmare
Thankfully, we can import skrub:
from skrub import tabular_pipeline# Perform cross-validation (5-fold)cv_results = cross_validate(tabular_pipeline("regression"), X, y, cv=5, scoring=['r2', 'neg_mean_squared_error'], return_train_score=True)# Convert MSE to RMSEtrain_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])# Display resultsprint("Cross-Validation Results:")print(f"Mean test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std():.4f})")print(f"Mean test RMSE: {test_rmse.mean():.4f} (+/- {test_rmse.std():.4f})")
Cross-Validation Results:
Mean test R²: 0.9089 (+/- 0.0161)
Mean test RMSE: 8773.7865 (+/- 1052.5788)
Roadmap for the course
Data exploration with skrub’s TableReport
Data cleaning and sanitization with the Cleaner
Intermission: simplifying column operations with skrub
Feature engineering with the skrub encoders
Putting everything together: TableVectorizer and tabular_pipeline
What we saw in this chapter
We built a predictive pipeline using traditional tools