Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Encoding: from a dataframe to a numerical matrix for machine learning#

This example shows how to transform a rich dataframe with columns of various types into a numerical matrix on which machine-learning algorithms can be applied. We study the case of predicting wages using the employee salaries dataset.

Easy learning on a dataframe#

Let’s first retrieve the dataset, using one of the downloaders from the skrub.datasets module. As all the downloaders, fetch_employee_salaries() returns a dataset with attributes X, and y. X is a dataframe which contains the features (aka design matrix, explanatory variables, independent variables). y is a column (pandas Series) which contains the target (aka dependent, response variable) that we want to learn to predict from X. In this case y is the annual salary.

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y
employees

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

9228 rows × 8 columns

Most machine-learning algorithms work with arrays of numbers. The challenge here is that the employees dataframe is a heterogeneous set of columns: some are numerical ('year_first_hired'), some dates ('date_first_hired'), some have a few categorical entries ('gender'), some many ('employee_position_title'). Therefore our table needs to be “vectorized”: processed to extract numeric features.

skrub provides an easy way to build a simple but reliable machine-learning model which includes this step, working well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_pipeline

model = tabular_pipeline("regressor")
results = cross_validate(model, employees, salaries)
results["test_score"]

array([0.90452368, 0.88345288, 0.91244   , 0.92039782, 0.9244769 ])

The estimator returned by tabular_pipeline combines 2 steps:

a TableVectorizer to preprocess the dataframe and vectorize the features
a supervised learner (by default a HistGradientBoostingRegressor)

model

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiNot fitted

Parameters

	steps	[('tablevectorizer', ...), ('histgradientboostingregressor', ...)]
	transform_input	None
	memory	None
	verbose	False

tablevectorizer: TableVectorizer

Parameters

	cardinality_threshold	40
	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

numeric

PassThrough

Parameters

datetime

DatetimeEncoder

Parameters

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

low_cardinality

ToCategorical

Parameters

high_cardinality

StringEncoder

Parameters

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

HistGradientBoostingRegressor

?Documentation for HistGradientBoostingRegressor

Parameters

	loss	'squared_error'
	quantile	None
	learning_rate	0.1
	max_iter	100
	max_leaf_nodes	31
	max_depth	None
	min_samples_leaf	20
	l2_regularization	0.0
	max_features	1.0
	max_bins	255
	categorical_features	'from_dtype'
	monotonic_cst	None
	interaction_cst	None
	warm_start	False
	early_stopping	'auto'
	scoring	'loss'
	validation_fraction	0.1
	n_iter_no_change	10
	tol	1e-07
	verbose	0
	random_state	None

In the rest of this example, we focus on the first step and explore the capabilities of skrub’s TableVectorizer.

More details on encoding tabular data#

from skrub import TableVectorizer

vectorizer = TableVectorizer()
vectorized_employees = vectorizer.fit_transform(employees)
vectorized_employees

	gender_F	gender_M	gender_nan	department_BOA	department_BOE	department_CAT	department_CCL	department_CEC	department_CEX	department_COR	department_CUS	department_DEP	department_DGS	department_DHS	department_DLC	department_DOT	department_DPS	department_DTS	department_ECM	department_FIN	department_FRS	department_HCA	department_HHS	department_HRC	department_IGR	department_LIB	department_MPB	department_NDA	department_OAG	department_OCP	department_OHR	department_OIG	department_OLO	department_OMB	department_PIO	department_POL	department_PRO	department_REC	department_SHF	department_ZAH	...	division_26	division_27	division_28	division_29	assignment_category_Parttime-Regular	employee_position_title_00	employee_position_title_01	employee_position_title_02	employee_position_title_03	employee_position_title_04	employee_position_title_05	employee_position_title_06	employee_position_title_07	employee_position_title_08	employee_position_title_09	employee_position_title_10	employee_position_title_11	employee_position_title_12	employee_position_title_13	employee_position_title_14	employee_position_title_15	employee_position_title_16	employee_position_title_17	employee_position_title_18	employee_position_title_19	employee_position_title_20	employee_position_title_21	employee_position_title_22	employee_position_title_23	employee_position_title_24	employee_position_title_25	employee_position_title_26	employee_position_title_27	employee_position_title_28	employee_position_title_29	date_first_hired_year	date_first_hired_month	date_first_hired_day	date_first_hired_total_seconds	year_first_hired
0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.029187	0.009604	-0.031597	0.032416	0.0	0.397703	-0.145889	0.179803	-0.065339	0.094766	0.093161	0.781628	0.305310	-0.285337	-0.127317	-0.118575	-0.416054	-0.052083	-0.171975	0.202518	0.009069	-0.117606	-0.022875	0.089707	-0.115838	-0.029446	0.097631	-0.036076	-0.015420	0.020086	-0.028533	0.044979	-0.096462	-0.093053	0.033528	1986.0	9.0	22.0	5.277312e+08	1986.0
1	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	-0.065379	-0.036309	0.078563	0.118464	0.0	0.847239	-0.117935	-0.048590	-0.110615	-0.052886	-0.048624	-0.107769	0.008099	-0.023591	-0.055244	-0.056615	-0.008624	0.056052	-0.058611	-0.021353	-0.002972	-0.006452	-0.078454	-0.147510	-0.020218	-0.104189	0.169656	-0.006593	-0.069411	-0.095294	0.660841	0.196425	0.073967	0.077667	0.098836	1988.0	9.0	12.0	5.900256e+08	1988.0
2	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.101815	-0.227951	-0.307631	-0.123017	0.0	0.048005	0.016050	0.007036	0.087218	0.119996	0.058963	0.212715	-0.234665	0.375214	0.732489	-0.460234	0.047673	0.059960	0.029502	0.175537	0.031689	0.012878	-0.017573	-0.054939	-0.060721	-0.086759	0.028220	0.029657	0.006286	-0.005152	0.060892	0.033596	-0.042682	0.002878	0.110310	1989.0	11.0	19.0	6.274368e+08	1989.0
3	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.056698	0.057097	-0.017365	-0.021537	0.0	0.046073	0.023713	0.069390	0.038924	0.057340	0.048957	0.131004	0.062024	0.020833	0.018177	0.062882	-0.017079	-0.060764	-0.007685	-0.025632	-0.019886	0.116336	0.046973	-0.089283	0.130436	0.152650	-0.012450	0.045580	0.176030	-0.001896	0.074023	-0.234357	-0.311416	0.596178	0.019439	2014.0	5.0	5.0	1.399248e+09	2014.0
4	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.106608	-0.042736	0.003595	0.040278	0.0	0.090343	0.024198	0.025948	0.243111	0.389858	-0.063121	-0.014224	-0.145517	-0.033803	0.044335	0.004473	-0.080458	0.018213	0.122334	-0.190568	0.062593	-0.062285	0.051005	0.072456	-0.091485	0.043049	0.041001	-0.254452	-0.038836	0.014083	-0.136793	0.164698	0.156661	0.176543	0.013222	2007.0	3.0	5.0	1.173053e+09	2007.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9223	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.134804	-0.033937	0.215199	0.081492	0.0	0.049991	0.011395	0.010624	0.089169	0.122423	0.520378	0.270872	0.083378	-0.199043	-0.111103	-0.006210	0.814793	0.236683	0.186492	-0.159872	-0.011200	-0.128973	0.021882	-0.002121	-0.085134	-0.177344	-0.020176	0.097230	0.086775	0.049881	-0.009618	0.210970	-0.177573	-0.011662	-0.034892	2015.0	11.0	3.0	1.446509e+09	2015.0
9224	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.028559	-0.041015	0.124541	0.040464	0.0	0.062241	0.229413	0.002450	-0.031309	0.035628	-0.002899	0.007269	0.051493	0.047162	-0.009746	0.078445	-0.009662	-0.130781	0.252426	0.151820	0.030861	-0.099108	-0.016205	-0.002092	0.000045	0.017420	0.003927	0.039990	0.021610	-0.023790	0.011079	-0.038769	-0.031236	0.022176	-0.002360	1988.0	11.0	28.0	5.966784e+08	1988.0
9225	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.094059	0.035389	0.120561	0.055037	1.0	0.007742	-0.000054	0.042824	0.020265	0.042899	0.003150	0.019045	-0.006844	0.016752	0.022611	0.005787	-0.013206	0.003410	0.004962	-0.005662	0.000369	-0.001978	0.025254	-0.001523	0.005178	0.019769	-0.012185	-0.015940	0.007297	0.008000	0.020587	0.008453	-0.023543	-0.031323	0.020125	2001.0	4.0	30.0	9.885888e+08	2001.0
9226	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	-0.073294	0.177879	-0.237405	0.016165	0.0	0.201066	0.112069	0.031427	0.880795	-0.610547	0.117353	0.099269	-0.098671	0.092763	-0.056669	0.054343	-0.050832	-0.090569	0.032294	-0.102014	-0.034524	-0.037607	-0.037316	-0.034372	-0.078253	0.057185	0.038950	-0.041395	0.025661	0.003578	0.050107	0.071181	-0.021336	-0.020872	-0.067173	2006.0	9.0	5.0	1.157414e+09	2006.0
9227	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.003857	0.034795	0.046666	0.059956	0.0	0.033905	0.010537	0.029032	0.137317	0.229570	-0.019878	-0.002935	-0.060315	-0.014835	0.025116	0.019305	-0.073234	-0.002619	0.095801	-0.129263	-0.009118	-0.018017	0.241626	-0.029612	0.017963	-0.028004	-0.051172	-0.112206	-0.010353	0.004339	0.033606	0.033913	-0.058043	-0.052797	0.027186	2012.0	1.0	30.0	1.327882e+09	2012.0

9228 rows × 143 columns

From our 8 columns, the TableVectorizer has extracted 143 numerical features. Most of them are one-hot encoded representations of the categorical features. For example, we can see that 3 columns 'gender_F', 'gender_M', 'gender_nan' were created to encode the 'gender' column.

By performing appropriate transformations on our complex data, the TableVectorizer produced numeric features that we can use for machine-learning:

from sklearn.ensemble import HistGradientBoostingRegressor

HistGradientBoostingRegressor().fit(vectorized_employees, salaries)

HistGradientBoostingRegressor()

The TableVectorizer bridges the gap between tabular data and machine-learning pipelines. It allows us to apply a machine-learning estimator to our dataframe without manual data wrangling and feature extraction.

Inspecting the TableVectorizer#

The TableVectorizer distinguishes between 4 basic kinds of columns (more may be added in the future). For each kind, it applies a different transformation, which we can configure. The kinds of columns and the default transformation for each of them are:

numeric columns: simply casting to floating-point
datetime columns: extracting features such as year, day, hour with the DatetimeEncoder
low-cardinality categorical columns: one-hot encoding
high-cardinality categorical columns: a simple and effective text representation pipeline provided by the GapEncoder

vectorizer

TableVectorizer()

TableVectorizer

iFitted

Parameters

	cardinality_threshold	40
	low_cardinality	OneHotEncoder..._output=False)
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

numeric

['year_first_hired']

PassThrough

Parameters

datetime

['date_first_hired']

DatetimeEncoder

Parameters

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

low_cardinality

['gender', 'department', 'department_name', 'assignment_category']

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	'if_binary'
	sparse_output	False
	dtype	'float32'
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

high_cardinality

['division', 'employee_position_title']

StringEncoder

Parameters

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

We can inspect which transformation was chosen for a each column and retrieve the fitted transformer. vectorizer.kind_to_columns_ provides an overview of how the vectorizer categorized columns in our input:

vectorizer.kind_to_columns_

{'numeric': ['year_first_hired'], 'datetime': ['date_first_hired'], 'low_cardinality': ['gender', 'department', 'department_name', 'assignment_category'], 'high_cardinality': ['division', 'employee_position_title'], 'specific': []}

The reverse mapping is given by:

vectorizer.column_to_kind_

{'year_first_hired': 'numeric', 'date_first_hired': 'datetime', 'gender': 'low_cardinality', 'department': 'low_cardinality', 'department_name': 'low_cardinality', 'assignment_category': 'low_cardinality', 'division': 'high_cardinality', 'employee_position_title': 'high_cardinality'}

vectorizer.transformers_ gives us a dictionary which maps column names to the corresponding transformer.

vectorizer.transformers_["date_first_hired"]

DatetimeEncoder()

We can also see which features in the vectorizer’s output were derived from a given input column.

vectorizer.input_to_outputs_["date_first_hired"]

['date_first_hired_year', 'date_first_hired_month', 'date_first_hired_day', 'date_first_hired_total_seconds']

vectorized_employees[vectorizer.input_to_outputs_["date_first_hired"]]

	date_first_hired_year	date_first_hired_month	date_first_hired_day	date_first_hired_total_seconds
0	1986.0	9.0	22.0	5.277312e+08
1	1988.0	9.0	12.0	5.900256e+08
2	1989.0	11.0	19.0	6.274368e+08
3	2014.0	5.0	5.0	1.399248e+09
4	2007.0	3.0	5.0	1.173053e+09
...	...	...	...	...
9223	2015.0	11.0	3.0	1.446509e+09
9224	1988.0	11.0	28.0	5.966784e+08
9225	2001.0	4.0	30.0	9.885888e+08
9226	2006.0	9.0	5.0	1.157414e+09
9227	2012.0	1.0	30.0	1.327882e+09

9228 rows × 4 columns

Finally, we can go in the opposite direction: given a column in the input, find out from which input column it was derived.

vectorizer.output_to_input_["department_BOA"]

'department'

Dataframe preprocessing#

Note that "date_first_hired" has been recognized and processed as a datetime ßcolumn.

vectorizer.column_to_kind_["date_first_hired"]

'datetime'

But looking closer at our original dataframe, it was encoded as a string.

employees["date_first_hired"]

     09/22/1986
     09/12/1988
     11/19/1989
     05/05/2014
     03/05/2007
           ...
  11/03/2015
  11/28/1988
  04/30/2001
  09/05/2006
  01/30/2012
Name: date_first_hired, Length: 9228, dtype: object

Note the dtype: object in the output above. Before applying the transformers we specify, the TableVectorizer performs a few preprocessing steps.

For example, strings commonly used to represent missing values such as "N/A" are replaced with actual null. As we saw above, columns containing strings that represent dates (e.g. '2024-05-15') are detected and converted to proper datetimes.

We can inspect the list of steps that were applied to a given column:

vectorizer.all_processing_steps_["date_first_hired"]

[CleanNullStrings(), DropUninformative(), ToDatetime(), DatetimeEncoder(), {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

These preprocessing steps depend on the column:

vectorizer.all_processing_steps_["department"]

[CleanNullStrings(), DropUninformative(), ToStr(), OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), {'department_BOA': ToFloat32(), 'department_BOE': ToFloat32(), ...}]

A simple Pipeline for tabular data#

The TableVectorizer outputs data that can be understood by a scikit-learn estimator. Therefore we can easily build a 2-step scikit-learn Pipeline that we can fit, test or cross-validate and that works well on tabular data.

import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())

results = cross_validate(pipeline, employees, salaries)
scores = results["test_score"]
print(f"R2 score:  mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}")
print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds")

R2 score:  mean: 0.913; std: 0.016
mean fit time: 1.882 seconds

Specializing the TableVectorizer for HistGradientBoosting#

The encoders used by default by the TableVectorizer are safe choices for a wide range of downstream estimators. If we know we want to use it with a HistGradientBoostingRegressor (or classifier) model, we can make some different choices that are only well-suited for tree-based models but can yield a faster pipeline. We make 2 changes.

The HistGradientBoostingRegressor has built-in support for categorical features, so we do not need to one-hot encode them. We do need to tell it which features should be treated as categorical with the categorical_features parameter. In recent versions of scikit-learn, we can set categorical_features='from_dtype', and it will treat all columns in the input that have a Categorical dtype as such. Therefore we change the encoder for low-cardinality columns: instead of OneHotEncoder, we use skrub’s ToCategorical. This transformer will simply ensure our columns have an actual Categorical dtype (as opposed to string for example), so that they can be recognized by the HistGradientBoostingRegressor.

The second change replaces the GapEncoder with a MinHashEncoder. The GapEncoder is a topic model. It produces interpretable embeddings in a vector space where distances are meaningful, which is great for interpretation and necessary for some downstream supervised learners such as linear models. However fitting the topic model is costly in computation time and memory. The MinHashEncoder produces features that are not easy to interpret, but that decision trees can efficiently use to test for the occurrence of particular character n-grams (more details are provided in its documentation). Therefore it can be a faster and very effective alternative, when the supervised learner is built on top of decision trees, which is the case for the HistGradientBoostingRegressor.

The resulting pipeline is identical to the one produced by default by tabular_pipeline.

from skrub import MinHashEncoder, ToCategorical

vectorizer = TableVectorizer(
    low_cardinality=ToCategorical(), high_cardinality=MinHashEncoder()
)
pipeline = make_pipeline(
    vectorizer, HistGradientBoostingRegressor(categorical_features="from_dtype")
)

results = cross_validate(pipeline, employees, salaries)
scores = results["test_score"]
print(f"R2 score:  mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}")
print(f"mean fit time: {np.mean(results['fit_time']):.3f} seconds")

R2 score:  mean: 0.911; std: 0.014
mean fit time: 0.961 seconds

We can see that this new pipeline achieves a similar score but is fitted much faster. This is mostly due to replacing GapEncoder with MinHashEncoder (however this makes the features less interpretable).

Conclusion#

In this example, we motivated the need for a simple machine learning pipeline, which we built using the TableVectorizer and a HistGradientBoostingRegressor.

We saw that by default, it works well on a heterogeneous dataset.

To better understand our dataset, and without much effort, we were also able to plot the feature importances.

Total running time of the script: (0 minutes 41.018 seconds)

Gallery generated by Sphinx-Gallery

Encoding: from a dataframe to a numerical matrix for machine learning#

Easy learning on a dataframe#

More details on encoding tabular data#

Inspecting the TableVectorizer#

Dataframe preprocessing#

A simple Pipeline for tabular data#

Specializing the TableVectorizer for HistGradientBoosting#

Feature importances in the statistical model#

Conclusion#

This Page

	steps	[('tablevectorizer', ...), ('randomforestregressor', ...)]
	transform_input	None
	memory	None
	verbose	False

	n_estimators	50
	criterion	'squared_error'
	max_depth	20
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	1.0
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	0
	verbose	0
	warm_start	False
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None