Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder
Encoding: from a dataframe to a numerical matrix for machine learning#
This example demonstrates how to transform a somewhat complicated dataframe to a matrix well suited for machine-learning. We study the case of predicting wages using the employee salaries dataset.
A simple prediction pipeline#
Let’s first retrieve the dataset:
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
We denote X, employees characteristics (our input data), and y, the annual salary (our target column):
- We observe diverse columns in the dataset:
binary (
'gender'
),numerical (
'employee_annual_salary'
),categorical (
'department'
,'department_name'
,'assignment_category'
),datetime (
'date_first_hired'
)dirty categorical (
'employee_position_title'
,'division'
).
Using skrub’s TableVectorizer
, we can now already build a machine-learning
pipeline and train it:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer
pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())
pipeline.fit(X, y)
What just happened here?
We actually gave our dataframe as an input to the TableVectorizer
and it
returned an output useful for the scikit-learn model.
Let’s explore the internals of our encoder, the TableVectorizer
:
from pprint import pprint
# Recover the TableVectorizer from the Pipeline
tv = pipeline.named_steps["tablevectorizer"]
pprint(tv.transformers_)
[('numeric', 'passthrough', ['year_first_hired']),
('datetime', DatetimeEncoder(), ['date_first_hired']),
('low_cardinality',
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False),
['gender', 'department', 'department_name', 'assignment_category']),
('high_cardinality',
GapEncoder(n_components=30),
['division', 'employee_position_title'])]
We observe it has automatically assigned an appropriate encoder to corresponding columns:
The
OneHotEncoder
for low cardinality string variables, the columns'gender'
,'department'
,'department_name'
and'assignment_category'
.
tv.named_transformers_["low_cardinality"].get_feature_names_out()
array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
'department_BOE', 'department_CAT', 'department_CCL',
'department_CEC', 'department_CEX', 'department_COR',
'department_CUS', 'department_DEP', 'department_DGS',
'department_DHS', 'department_DLC', 'department_DOT',
'department_DPS', 'department_DTS', 'department_ECM',
'department_FIN', 'department_FRS', 'department_HCA',
'department_HHS', 'department_HRC', 'department_IGR',
'department_LIB', 'department_MPB', 'department_NDA',
'department_OAG', 'department_OCP', 'department_OHR',
'department_OIG', 'department_OLO', 'department_OMB',
'department_PIO', 'department_POL', 'department_PRO',
'department_REC', 'department_SHF', 'department_ZAH',
'department_name_Board of Appeals Department',
'department_name_Board of Elections',
'department_name_Community Engagement Cluster',
'department_name_Community Use of Public Facilities',
'department_name_Correction and Rehabilitation',
"department_name_County Attorney's Office",
'department_name_County Council',
'department_name_Department of Environmental Protection',
'department_name_Department of Finance',
'department_name_Department of General Services',
'department_name_Department of Health and Human Services',
'department_name_Department of Housing and Community Affairs',
'department_name_Department of Liquor Control',
'department_name_Department of Permitting Services',
'department_name_Department of Police',
'department_name_Department of Public Libraries',
'department_name_Department of Recreation',
'department_name_Department of Technology Services',
'department_name_Department of Transportation',
'department_name_Ethics Commission',
'department_name_Fire and Rescue Services',
'department_name_Merit System Protection Board Department',
'department_name_Non-Departmental Account',
'department_name_Office of Agriculture',
'department_name_Office of Consumer Protection',
'department_name_Office of Emergency Management and Homeland Security',
'department_name_Office of Human Resources',
'department_name_Office of Human Rights',
'department_name_Office of Intergovernmental Relations Department',
'department_name_Office of Legislative Oversight',
'department_name_Office of Management and Budget',
'department_name_Office of Procurement',
'department_name_Office of Public Information',
'department_name_Office of Zoning and Administrative Hearings',
'department_name_Office of the Inspector General',
'department_name_Offices of the County Executive',
"department_name_Sheriff's Office",
'assignment_category_Parttime-Regular'], dtype=object)
The
GapEncoder
for high cardinality string columns,'employee_position_title'
and'division'
. TheGapEncoder
is a powerful encoder that can handle dirty categorical columns.
tv.named_transformers_["high_cardinality"].get_feature_names_out()
array(['compliance, building, violence', 'gaithersburg, clarksburg, the',
'station, state, estate', 'development, planning, accounting',
'patrol, 4th, 5th', 'traffic, safety, alcohol',
'management, equipment, budget', 'toddlers, custody, members',
'services, highway, service', 'behavioral, health, school',
'collection, inspections, operations', 'family, crimes, outreach',
'welfare, childhood, child', 'security, mccf, unit',
'supports, support, network', 'emergency, centers, center',
'district, squad, urban', 'maintenance, facilities, recruit',
'administration, battalion, admin', 'nicholson, transit, taxicab',
'warehouse, delivery, cloverly',
'communications, communication, education', 'spring, silver, king',
'assessment, protective, projects',
'technology, telephone, systems', 'rockville, twinbrook, downtown',
'director, officers, officer', 'assignment, assistance, medical',
'animal, virtual, regional',
'investigative, investigations, explosive',
'firefighter, rescuer, recruit', 'operator, bus, operations',
'officer, office, security', 'government, employee, budget',
'liquor, clerk, store', 'information, technology, renovation',
'manager, engineer, iii', 'income, assistance, client',
'administrative, administration, administrator',
'coordinator, coordinating, transit',
'technician, mechanic, supply', 'accountant, attendant, attorney',
'corporal, pfc, dietary', 'community, health, nurse',
'school, room, behavioral', 'services, supervisor, service',
'enforcement, permitting, inspector', 'lieutenant, captain, chief',
'assistant, library, librarian',
'communications, telecommunications, safety',
'warehouse, welfare, caseworker', 'specialist, special, therapist',
'crossing, purchasing, planning', 'candidate, sheriff, deputy',
'legislative, principal, executive',
'equipment, investment, investigator',
'program, programs, property',
'correctional, correction, regional', 'sergeant, police, cadet',
'master, registered, meter'], dtype=object)
The
DatetimeEncoder
to the'date_first_hired'
column. TheDatetimeEncoder
can encode datetime columns for machine learning.
tv.named_transformers_["datetime"].get_feature_names_out()
array(['date_first_hired_year', 'date_first_hired_month',
'date_first_hired_day', 'date_first_hired_total_seconds'],
dtype=object)
As we can see, it gave us interpretable column names.
In total, we have a reasonable number of encoded columns:
143
Let’s look at the cross-validated R2 score of our model:
from sklearn.model_selection import cross_val_score
import numpy as np
scores = cross_val_score(pipeline, X, y)
print(f"R2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n")
R2 score: mean: 0.923; std: 0.014
The simple pipeline applied on this complex dataset gave us very good results.
Feature importances in the statistical model#
In this section, after training a regressor, we will plot the feature importances.
First, let’s train another scikit-learn regressor, the RandomForestRegressor
:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
pipeline = make_pipeline(TableVectorizer(), regressor)
pipeline.fit(X, y)
We are retrieving the feature importances:
avg_importances = regressor.feature_importances_
std_importances = np.std(
[tree.feature_importances_ for tree in regressor.estimators_], axis=0
)
indices = np.argsort(avg_importances)[::-1]
And plotting the results:
import matplotlib.pyplot as plt
top_indices = indices[:20]
labels = feature_names[top_indices]
plt.figure(figsize=(12, 9))
plt.barh(
y=labels,
width=avg_importances[top_indices],
yerr=std_importances[top_indices],
color="b",
)
plt.yticks(fontsize=15)
plt.title("Feature importances")
plt.tight_layout(pad=1)
plt.show()
We can see that features such the time elapsed since being hired, having a full-time employment, and the position, seem to be the most informative for prediction.
However, feature importances must not be over-interpreted – they capture statistical associations rather than causal effects.
Moreover, the fast feature importance method used here suffers from biases favouring features with larger cardinality, as illustrated in a scikit-learn example.
In general we should prefer permutation_importance()
, but it is a slower method.
Conclusion#
In this example, we motivated the need for a simple machine learning
pipeline, which we built using the TableVectorizer
and a
HistGradientBoostingRegressor
.
We saw that by default, it works well on a heterogeneous dataset.
To better understand our dataset, and without much effort, we were also able to plot the feature importances.
Total running time of the script: (1 minutes 27.261 seconds)