Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder
Encoding: from a dataframe to a numerical matrix for machine learning#
This example demonstrates how to transform a somewhat complicated dataframe to a matrix well suited for machine-learning. We study the case of predicting wages using the employee salaries dataset.
A simple prediction pipeline#
Let’s first retrieve the dataset:
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
We denote X, employees characteristics (our input data), and y, the annual salary (our target column):
- We observe diverse columns in the dataset:
binary (
'gender'
),numerical (
'employee_annual_salary'
),categorical (
'department'
,'department_name'
,'assignment_category'
),datetime (
'date_first_hired'
)dirty categorical (
'employee_position_title'
,'division'
).
Using skrub’s TableVectorizer
, we can now already build a machine-learning
pipeline and train it:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer
pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())
pipeline.fit(X, y)
What just happened here?
We actually gave our dataframe as an input to the TableVectorizer
and it
returned an output useful for the scikit-learn model.
Let’s explore the internals of our encoder, the TableVectorizer
:
from pprint import pprint
# Recover the TableVectorizer from the Pipeline
tv = pipeline.named_steps["tablevectorizer"]
pprint(tv.transformers_)
[('numeric', 'passthrough', ['year_first_hired']),
('datetime', DatetimeEncoder(), ['date_first_hired']),
('low_card_cat',
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False),
['gender', 'department', 'department_name', 'assignment_category']),
('high_card_cat',
GapEncoder(n_components=30),
['division', 'employee_position_title'])]
We observe it has automatically assigned an appropriate encoder to corresponding columns:
The
OneHotEncoder
for low cardinality string variables, the columns'gender'
,'department'
,'department_name'
and'assignment_category'
.
tv.named_transformers_["low_card_cat"].get_feature_names_out()
array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
'department_BOE', 'department_CAT', 'department_CCL',
'department_CEC', 'department_CEX', 'department_COR',
'department_CUS', 'department_DEP', 'department_DGS',
'department_DHS', 'department_DLC', 'department_DOT',
'department_DPS', 'department_DTS', 'department_ECM',
'department_FIN', 'department_FRS', 'department_HCA',
'department_HHS', 'department_HRC', 'department_IGR',
'department_LIB', 'department_MPB', 'department_NDA',
'department_OAG', 'department_OCP', 'department_OHR',
'department_OIG', 'department_OLO', 'department_OMB',
'department_PIO', 'department_POL', 'department_PRO',
'department_REC', 'department_SHF', 'department_ZAH',
'department_name_Board of Appeals Department',
'department_name_Board of Elections',
'department_name_Community Engagement Cluster',
'department_name_Community Use of Public Facilities',
'department_name_Correction and Rehabilitation',
"department_name_County Attorney's Office",
'department_name_County Council',
'department_name_Department of Environmental Protection',
'department_name_Department of Finance',
'department_name_Department of General Services',
'department_name_Department of Health and Human Services',
'department_name_Department of Housing and Community Affairs',
'department_name_Department of Liquor Control',
'department_name_Department of Permitting Services',
'department_name_Department of Police',
'department_name_Department of Public Libraries',
'department_name_Department of Recreation',
'department_name_Department of Technology Services',
'department_name_Department of Transportation',
'department_name_Ethics Commission',
'department_name_Fire and Rescue Services',
'department_name_Merit System Protection Board Department',
'department_name_Non-Departmental Account',
'department_name_Office of Agriculture',
'department_name_Office of Consumer Protection',
'department_name_Office of Emergency Management and Homeland Security',
'department_name_Office of Human Resources',
'department_name_Office of Human Rights',
'department_name_Office of Intergovernmental Relations Department',
'department_name_Office of Legislative Oversight',
'department_name_Office of Management and Budget',
'department_name_Office of Procurement',
'department_name_Office of Public Information',
'department_name_Office of Zoning and Administrative Hearings',
'department_name_Office of the Inspector General',
'department_name_Offices of the County Executive',
"department_name_Sheriff's Office",
'assignment_category_Parttime-Regular'], dtype=object)
The
GapEncoder
for high cardinality string columns,'employee_position_title'
and'division'
. TheGapEncoder
is a powerful encoder that can handle dirty categorical columns.
tv.named_transformers_["high_card_cat"].get_feature_names_out()
['programs, projects, project', 'behavioral, health, school', 'accounts, council, members', 'training, recruit, office', 'safety, traffic, collision', 'district, squad, 3rd', 'spring, silver, ride', 'station, state, estate', 'custody, planning, toddlers', 'supports, support, network', 'highway, welfare, services', 'gaithersburg, clarksburg, the', 'security, mc311, mccf', 'medical, animal, fiscal', 'management, equipment, automotive', 'technology, administration, parking', 'central, montrose, duplicating', 'communications, communication, telecommunications', 'warehouse, employee, liquor', 'patrol, 4th, 6th', 'divisioincrime, family, pedophile', 'compliance, assistance, emergency', 'delivery, cloverly, operations', 'eligibility, maintenance, facilities', 'recycling, collection, solid', 'nicholson, transit, taxicab', 'building, director, resource', 'rockville, twinbrook, downtown', 'environmental, regulatory, centers', 'investigative, investigations, criminal', 'liquor, clerk, store', 'officer, office, police', 'supervisory, supervisor, librarian', 'operator, bus, operations', 'manager, engineer, management', 'planning, senior, background', 'therapist, the, estate', 'school, health, room', 'firefighter, rescuer, recruit', 'communications, telecommunications, safety', 'income, assistance, client', 'coordinator, services, service', 'crossing, guard, parking', 'community, nurse, security', 'information, technology, renovation', 'enforcement, permitting, inspector', 'master, registered, meter', 'specialist, special, environmental', 'lieutenant, maintenance, shift', 'sergeant, cadet, police', 'accountant, assistant, county', 'equipment, investigator, investment', 'program, programs, projects', 'captain, chief, mcfrs', 'representative, legislative, executive', 'technician, mechanic, supply', 'correctional, correction, corporal', 'candidate, sheriff, deputy', 'warehouse, craftsworker, welfare', 'administrative, principal, administration']
The
DatetimeEncoder
to the'date_first_hired'
column. TheDatetimeEncoder
can encode datetime columns for machine learning.
tv.named_transformers_["datetime"].get_feature_names_out()
['date_first_hired_year', 'date_first_hired_month', 'date_first_hired_day', 'date_first_hired_total_seconds']
As we can see, it gave us interpretable column names.
In total, we have a reasonable number of encoded columns:
143
Let’s look at the cross-validated R2 score of our model:
from sklearn.model_selection import cross_val_score
import numpy as np
scores = cross_val_score(pipeline, X, y)
print(f"R2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n")
R2 score: mean: 0.923; std: 0.012
The simple pipeline applied on this complex dataset gave us very good results.
Feature importances in the statistical model#
In this section, after training a regressor, we will plot the feature importances.
First, let’s train another scikit-learn regressor, the RandomForestRegressor
:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
pipeline = make_pipeline(TableVectorizer(), regressor)
pipeline.fit(X, y)
We are retrieving the feature importances:
avg_importances = regressor.feature_importances_
std_importances = np.std(
[tree.feature_importances_ for tree in regressor.estimators_], axis=0
)
indices = np.argsort(avg_importances)[::-1]
And plotting the results:
import matplotlib.pyplot as plt
top_indices = indices[:20]
labels = np.array(feature_names)[top_indices]
plt.figure(figsize=(12, 9))
plt.barh(
y=labels,
width=avg_importances[top_indices],
yerr=std_importances[top_indices],
color="b",
)
plt.yticks(fontsize=15)
plt.title("Feature importances")
plt.tight_layout(pad=1)
plt.show()

We can see that features such the time elapsed since being hired, having a full-time employment, and the position, seem to be the most informative for prediction.
However, feature importances must not be over-interpreted – they capture statistical associations rather than causal effects.
Moreover, the fast feature importance method used here suffers from biases favouring features with larger cardinality, as illustrated in a scikit-learn example.
In general we should prefer permutation_importance()
, but it is a slower method.
Conclusion#
In this example, we motivated the need for a simple machine learning
pipeline, which we built using the TableVectorizer
and a
HistGradientBoostingRegressor
.
We saw that by default, it works well on a heterogeneous dataset.
To better understand our dataset, and without much effort, we were also able to plot the feature importances.
Total running time of the script: (1 minutes 24.938 seconds)