Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder

Encoding: from a dataframe to a numerical matrix for machine learning#

This example demonstrates how to transform a somewhat complicated dataframe to a matrix well suited for machine-learning. We study the case of predicting wages using the employee salaries dataset.

A simple prediction pipeline#

Let’s first retrieve the dataset:

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()

We denote X, employees characteristics (our input data), and y, the annual salary (our target column):

X = dataset.X
y = dataset.y

X

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

9228 rows × 8 columns

We observe diverse columns in the dataset:

binary ('gender'),
numerical ('employee_annual_salary'),
categorical ('department', 'department_name', 'assignment_category'),
datetime ('date_first_hired')
dirty categorical ('employee_position_title', 'division').

Using skrub’s TableVectorizer, we can now already build a machine-learning pipeline and train it:

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer

pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())
pipeline.fit(X, y)

Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

What just happened here?

We actually gave our dataframe as an input to the TableVectorizer and it returned an output useful for the scikit-learn model.

Let’s explore the internals of our encoder, the TableVectorizer:

from pprint import pprint

# Recover the TableVectorizer from the Pipeline
tv = pipeline.named_steps["tablevectorizer"]

pprint(tv.transformers_)

[('numeric', 'passthrough', ['year_first_hired']),
 ('datetime', DatetimeEncoder(), ['date_first_hired']),
 ('low_cardinality',
  OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False),
  ['gender', 'department', 'department_name', 'assignment_category']),
 ('high_cardinality',
  GapEncoder(n_components=30),
  ['division', 'employee_position_title'])]

We observe it has automatically assigned an appropriate encoder to corresponding columns:

The OneHotEncoder for low cardinality string variables, the columns 'gender', 'department', 'department_name' and 'assignment_category'.

tv.named_transformers_["low_cardinality"].get_feature_names_out()

array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
       'department_BOE', 'department_CAT', 'department_CCL',
       'department_CEC', 'department_CEX', 'department_COR',
       'department_CUS', 'department_DEP', 'department_DGS',
       'department_DHS', 'department_DLC', 'department_DOT',
       'department_DPS', 'department_DTS', 'department_ECM',
       'department_FIN', 'department_FRS', 'department_HCA',
       'department_HHS', 'department_HRC', 'department_IGR',
       'department_LIB', 'department_MPB', 'department_NDA',
       'department_OAG', 'department_OCP', 'department_OHR',
       'department_OIG', 'department_OLO', 'department_OMB',
       'department_PIO', 'department_POL', 'department_PRO',
       'department_REC', 'department_SHF', 'department_ZAH',
       'department_name_Board of Appeals Department',
       'department_name_Board of Elections',
       'department_name_Community Engagement Cluster',
       'department_name_Community Use of Public Facilities',
       'department_name_Correction and Rehabilitation',
       "department_name_County Attorney's Office",
       'department_name_County Council',
       'department_name_Department of Environmental Protection',
       'department_name_Department of Finance',
       'department_name_Department of General Services',
       'department_name_Department of Health and Human Services',
       'department_name_Department of Housing and Community Affairs',
       'department_name_Department of Liquor Control',
       'department_name_Department of Permitting Services',
       'department_name_Department of Police',
       'department_name_Department of Public Libraries',
       'department_name_Department of Recreation',
       'department_name_Department of Technology Services',
       'department_name_Department of Transportation',
       'department_name_Ethics Commission',
       'department_name_Fire and Rescue Services',
       'department_name_Merit System Protection Board Department',
       'department_name_Non-Departmental Account',
       'department_name_Office of Agriculture',
       'department_name_Office of Consumer Protection',
       'department_name_Office of Emergency Management and Homeland Security',
       'department_name_Office of Human Resources',
       'department_name_Office of Human Rights',
       'department_name_Office of Intergovernmental Relations Department',
       'department_name_Office of Legislative Oversight',
       'department_name_Office of Management and Budget',
       'department_name_Office of Procurement',
       'department_name_Office of Public Information',
       'department_name_Office of Zoning and Administrative Hearings',
       'department_name_Office of the Inspector General',
       'department_name_Offices of the County Executive',
       "department_name_Sheriff's Office",
       'assignment_category_Parttime-Regular'], dtype=object)

The GapEncoder for high cardinality string columns, 'employee_position_title' and 'division'. The GapEncoder is a powerful encoder that can handle dirty categorical columns.

tv.named_transformers_["high_cardinality"].get_feature_names_out()

array(['compliance, building, violence', 'gaithersburg, clarksburg, the',
       'station, state, estate', 'development, planning, accounting',
       'patrol, 4th, 5th', 'traffic, safety, alcohol',
       'management, equipment, budget', 'toddlers, custody, members',
       'services, highway, service', 'behavioral, health, school',
       'collection, inspections, operations', 'family, crimes, outreach',
       'welfare, childhood, child', 'security, mccf, unit',
       'supports, support, network', 'emergency, centers, center',
       'district, squad, urban', 'maintenance, facilities, recruit',
       'administration, battalion, admin', 'nicholson, transit, taxicab',
       'warehouse, delivery, cloverly',
       'communications, communication, education', 'spring, silver, king',
       'assessment, protective, projects',
       'technology, telephone, systems', 'rockville, twinbrook, downtown',
       'director, officers, officer', 'assignment, assistance, medical',
       'animal, virtual, regional',
       'investigative, investigations, explosive',
       'firefighter, rescuer, recruit', 'operator, bus, operations',
       'officer, office, security', 'government, employee, budget',
       'liquor, clerk, store', 'information, technology, renovation',
       'manager, engineer, iii', 'income, assistance, client',
       'administrative, administration, administrator',
       'coordinator, coordinating, transit',
       'technician, mechanic, supply', 'accountant, attendant, attorney',
       'corporal, pfc, dietary', 'community, health, nurse',
       'school, room, behavioral', 'services, supervisor, service',
       'enforcement, permitting, inspector', 'lieutenant, captain, chief',
       'assistant, library, librarian',
       'communications, telecommunications, safety',
       'warehouse, welfare, caseworker', 'specialist, special, therapist',
       'crossing, purchasing, planning', 'candidate, sheriff, deputy',
       'legislative, principal, executive',
       'equipment, investment, investigator',
       'program, programs, property',
       'correctional, correction, regional', 'sergeant, police, cadet',
       'master, registered, meter'], dtype=object)

The DatetimeEncoder to the 'date_first_hired' column. The DatetimeEncoder can encode datetime columns for machine learning.

tv.named_transformers_["datetime"].get_feature_names_out()

array(['date_first_hired_year', 'date_first_hired_month',
       'date_first_hired_day', 'date_first_hired_total_seconds'],
      dtype=object)

As we can see, it gave us interpretable column names.

In total, we have a reasonable number of encoded columns:

feature_names = tv.get_feature_names_out()

len(feature_names)

Let’s look at the cross-validated R2 score of our model:

from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(pipeline, X, y)
print(f"R2 score:  mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n")

R2 score:  mean: 0.923; std: 0.014

The simple pipeline applied on this complex dataset gave us very good results.

Feature importances in the statistical model#

In this section, after training a regressor, we will plot the feature importances.

First, let’s train another scikit-learn regressor, the RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor()

pipeline = make_pipeline(TableVectorizer(), regressor)
pipeline.fit(X, y)

Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('randomforestregressor', RandomForestRegressor())])

We are retrieving the feature importances:

avg_importances = regressor.feature_importances_
std_importances = np.std(
    [tree.feature_importances_ for tree in regressor.estimators_], axis=0
)
indices = np.argsort(avg_importances)[::-1]

And plotting the results:

import matplotlib.pyplot as plt

top_indices = indices[:20]
labels = feature_names[top_indices]

plt.figure(figsize=(12, 9))
plt.barh(
    y=labels,
    width=avg_importances[top_indices],
    yerr=std_importances[top_indices],
    color="b",
)
plt.yticks(fontsize=15)
plt.title("Feature importances")
plt.tight_layout(pad=1)
plt.show()

We can see that features such the time elapsed since being hired, having a full-time employment, and the position, seem to be the most informative for prediction. However, feature importances must not be over-interpreted – they capture statistical associations rather than causal effects. Moreover, the fast feature importance method used here suffers from biases favouring features with larger cardinality, as illustrated in a scikit-learn example. In general we should prefer permutation_importance(), but it is a slower method.

Conclusion#

In this example, we motivated the need for a simple machine learning pipeline, which we built using the TableVectorizer and a HistGradientBoostingRegressor.

We saw that by default, it works well on a heterogeneous dataset.

To better understand our dataset, and without much effort, we were also able to plot the feature importances.

Total running time of the script: (1 minutes 27.261 seconds)

Gallery generated by Sphinx-Gallery