What’s the best way to encode categorical features? A use case with Skrub encoders
Author
Riccardo Cappuzzo
Encoding categorical values (such as names, addresses, but also textual data) is a very common problem when it comes to prepare tabular data for training ML models, and Skrub provides four encoders to this end:
skrub.MinhashEncoder, a simple encoder based on hashing categories.
skrub.GapEncoder, which encoders strings based on latent categories estimated from the data.
skrub.TextEncoder, a language model-based encoder that uses pre-trained language models to produce vectors for each string.
skrub.StringEncoder, an encoder that vectorizes data with tf-idf and then applies SVD to reduce the number of features.
The objective of this post is to test the performance of each encoder on a few datasets in order to find out which methods should be considered in various circumstances.
Preparing the datasets
We begin by importing and preparing the datasets that should be used for the experiments:
We can test each method by building a scikit-learn pipeline for each categorical encoder, using the default HistGradientBoostingClassifier and HistGradientBoostingRegressor as prediction model.
Them, we use the cross_validate function to track the fit and score time, as well as the prediction performance of each pipeline over different splits. For simplicity, we are not performing hyperparameter optimization for either the categorical encoder or the learner.
from matplotlib import pyplot as pltfrom sklearn.ensemble import ( HistGradientBoostingClassifier, HistGradientBoostingRegressor,)from sklearn.model_selection import cross_validatefrom sklearn.pipeline import make_pipelineimport polars as plfrom skrub import ( GapEncoder, MinHashEncoder, StringEncoder, TableVectorizer, TextEncoder,)def run_experiments(X, y, task, dataset_name):if task =="regression": model = HistGradientBoostingRegressor() scoring ="r2"else: model = HistGradientBoostingClassifier() scoring ="roc_auc" results = []# For each encoder, create a new pipeline gap_pipe = make_pipeline( TableVectorizer(high_cardinality=GapEncoder(n_components=30)), model ) minhash_pipe = make_pipeline( TableVectorizer(high_cardinality=MinHashEncoder(n_components=30)), model ) text_encoder = TextEncoder("sentence-transformers/paraphrase-albert-small-v2", device="cpu", ) text_encoder_pipe = make_pipeline( TableVectorizer(high_cardinality=text_encoder), model, ) string_encoder = StringEncoder(ngram_range=(3, 4), analyzer="char_wb") string_encoder_pipe = make_pipeline( TableVectorizer(high_cardinality=string_encoder), model, ) pipes = [ ("GapEncoder", gap_pipe), ("MinHashEncoder", minhash_pipe), ("TextEncoder", text_encoder_pipe), ("StringEncoder", string_encoder_pipe), ]for name, p in pipes: cross_validate_results = cross_validate(p, X, y, scoring=scoring) results.append( pl.DataFrame(add_results(name, dataset_name, cross_validate_results)) ) df_results = pl.concat(results).with_columns(task=pl.lit(task))return df_results
Running the experiments and saving the results
Finally, I ran the crossvalidation step for each encoder on all dataset and I recorded the files in a csv file. This step took quite some time, and was done offline in a separater script.
all_results = []for dataset_name,v in datasets.items(): X, y, task = v results = run_experiments(X, y, task, dataset_name) all_results.append(results)df_all_results = pl.concat(all_results)df_all_results.write_csv("results-encoder_benchmark.csv")
Plotting the results
Now, we can load the results and start plotting the results. We first split the results in two subtables based on the specific task (either regression or classification), to avoid mixing metrics.
import polars as plimport matplotlib.pyplot as pltdf = pl.read_csv("results-encoder_benchmark.csv")df_regression = df.filter(task="regression")df_classification = df.filter(task="classification")
To see the tradeoff between fit time and prediction performance, we use a scatterplot with error bars to find the average performance and run time for each method.
Then, we can plot the prediction performance as a function of the run time.
def plot_scatter_errorbar(df, ylabel, sharey=False, suptitle=""):# Fixing the colors for each cluster of points and the error bars tab10_colors = plt.get_cmap('tab10').colors colors =dict(zip(df["estimator"].unique().sort().to_list(),tab10_colors[:4])) fig, axs = plt.subplots(1,2, sharey=sharey, layout="constrained", figsize=(8,3))# Each dataset gets a subplotfor idx, (dataset, g) inenumerate(df.group_by("dataset")): ax=axs[idx]# Each estimator is plotted separately as a cluster of pointsfor estimator, gdf in g.group_by("estimator"): estim = estimator[0] color = colors[estim] x = gdf["fit_time"].to_numpy() y = gdf["test_score"].to_numpy() label = estim if idx ==0else"_"+ estim ax.scatter(x=x, y=y, label=label, color=color)# find the mean and the error bars xerr_mean = gdf["fit_time"].mean() yerr_mean = gdf["test_score"].mean() x_err = gdf["fit_time"].std() y_err = gdf["test_score"].std()# plot the error bars ax.errorbar(xerr_mean, yerr_mean, xerr=x_err, fmt="none", color=color) ax.errorbar(xerr_mean, yerr_mean, yerr=y_err, fmt="none", color=color) ax.set_title(dataset[0]) ax.set_xlabel("Fit time (s)") ax.set_ylabel(ylabel) ax.set_xscale("log") fig.suptitle(suptitle) fig.legend(loc="lower center", ncols=2)
The prediction performance is fairly consistent across methods, although this depends on the table under observation. MinhashEncoder and StringEncoder are consistently faster than the alternatives. TextEncoder is always much slower than the other methods, however it must be noted that this example was run on a CPU, rather than a much faster GPU.
To have a better idea of why some methods may outperform others, we should take a look at the actual tables. We can do so very easily thanks to the skrub TableReport object.
from skrub import TableReportfrom skrub.datasets import fetch_toxicity, fetch_movielens, fetch_employee_salaries, fetch_open_payments
# OPEN PAYMENTSdataset = fetch_open_payments()X, y = dataset.X, dataset.yTableReport(X)
Processing column 5 / 5
Click a table cell for more info about its column.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box above), select some columns using the checkboxes next
to the column names or the "Select all" button.
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
# TOXICITYdataset = fetch_toxicity()X, y = dataset.X, dataset.yTableReport(X)
Processing column 1 / 1
Click a table cell for more info about its column.
text
0
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page.
1
The senile credit card shrill from Delaware needs to resign!!
2
He does that a lot -- makes everyone look good but him...I guess it's also probably the Dems and the Media that force him to compulsively tweet abject bullshit like a lying bitch. They're tricky, them libs.
3
F*ck Lizzo
4
Epstein and trump were best buds!!! Pedophiles who play together!!
995
My maternal abuelita taught me how to make plantain empanadas 🥺 and my paternal abuelita needed me to help her brush her dentures 😌 I love them so much 🥰
996
Funnily enough I was looking online last week and wondering why nobody has opened an eSports/Gaming bar round here. Can’t wait to pop in at some point :)
997
I can't bear how nice this is. I guess its bearnessities. I'll see my self out
998
Going to buy a share of Tesla just to ensure it starts going back down
999
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
text
ObjectDType
Null values
0 (0.0%)
Unique values
999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.
All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.
I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip.
He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
List:
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Null values
Unique values
Mean
Std
Min
Median
Max
0
text
ObjectDType
0 (0.0%)
999 (99.9%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box above), select some columns using the checkboxes next
to the column names or the "Select all" button.
text
ObjectDType
Null values
0 (0.0%)
Unique values
999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.
All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.
I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip.
He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
List:
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
No strong associations between any pair of columns were identified by a quick screening of a subsample of the dataframe.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
Click a table cell for more info about its column.
movieId
title
genres
1625
Game, The (1997)
Drama|Mystery|Thriller
1267
Manchurian Candidate, The (1962)
Crime|Thriller|War
19
Ace Ventura: When Nature Calls (1995)
Comedy
138210
13 Hours (2016)
Drama
64278
Pervert's Guide to Cinema, The (2006)
Documentary
5470
The Importance of Being Earnest (1952)
Comedy|Romance
3669
Stay Tuned (1992)
Comedy
2476
Heartbreak Ridge (1986)
Action|War
2600
eXistenZ (1999)
Action|Sci-Fi|Thriller
6214
Irreversible (Irréversible) (2002)
Crime|Drama|Mystery|Thriller
movieId
Int64
Null values
0 (0.0%)
Unique values
9,724 (100.0%)
Mean ± Std
4.22e+04 ±
5.22e+04
Median ± IQR
7.30e+03 ±
7.35e+04
Min | Max
1.00 |
1.94e+05
title
String
Null values
0 (0.0%)
Unique values
9,719 (99.9%)
Most frequent values
Emma (1996)
Saturn 3 (1980)
War of the Worlds (2005)
Confessions of a Dangerous Mind (2002)
Eros (2004)
Bambi (1942)
Message in a Bottle (1999)
Break-Up, The (2006)
This Island Earth (1955)
Resident Evil: Retribution (2012)
List:
['Emma (1996)', 'Saturn 3 (1980)', 'War of the Worlds (2005)', 'Confessions of a Dangerous Mind (2002)', 'Eros (2004)', 'Bambi (1942)', 'Message in a Bottle (1999)', 'Break-Up, The (2006)', 'This Island Earth (1955)', 'Resident Evil: Retribution (2012)']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Null values
Unique values
Mean
Std
Min
Median
Max
0
movieId
Int64
0 (0.0%)
9724 (100.0%)
4.22e+04
5.22e+04
1.00
7.30e+03
1.94e+05
1
title
String
0 (0.0%)
9719 (99.9%)
2
genres
String
0 (0.0%)
951 (9.8%)
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box above), select some columns using the checkboxes next
to the column names or the "Select all" button.
movieId
Int64
Null values
0 (0.0%)
Unique values
9,724 (100.0%)
Mean ± Std
4.22e+04 ±
5.22e+04
Median ± IQR
7.30e+03 ±
7.35e+04
Min | Max
1.00 |
1.94e+05
title
String
Null values
0 (0.0%)
Unique values
9,719 (99.9%)
Most frequent values
Emma (1996)
Saturn 3 (1980)
War of the Worlds (2005)
Confessions of a Dangerous Mind (2002)
Eros (2004)
Bambi (1942)
Message in a Bottle (1999)
Break-Up, The (2006)
This Island Earth (1955)
Resident Evil: Retribution (2012)
List:
['Emma (1996)', 'Saturn 3 (1980)', 'War of the Worlds (2005)', 'Confessions of a Dangerous Mind (2002)', 'Eros (2004)', 'Bambi (1942)', 'Message in a Bottle (1999)', 'Break-Up, The (2006)', 'This Island Earth (1955)', 'Resident Evil: Retribution (2012)']
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
String
Null values
0 (0.0%)
Unique values
694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
List:
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
Column name
dtype
Null values
Unique values
Mean
Std
Min
Median
Max
0
gender
String
17 (0.2%)
2 (< 0.1%)
1
department
String
0 (0.0%)
37 (0.4%)
2
department_name
String
0 (0.0%)
37 (0.4%)
3
division
String
0 (0.0%)
694 (7.5%)
4
assignment_category
String
0 (0.0%)
2 (< 0.1%)
5
employee_position_title
String
0 (0.0%)
443 (4.8%)
6
date_first_hired
String
0 (0.0%)
2264 (24.5%)
7
year_first_hired
Int64
0 (0.0%)
51 (0.6%)
2.00e+03
9.33
1.96e+03
2.00e+03
2.02e+03
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
To construct a list of column names that you can easily copy-paste
(in the box above), select some columns using the checkboxes next
to the column names or the "Select all" button.
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
String
Null values
0 (0.0%)
Unique values
694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
List:
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
The table below shows the strength of association between the most similar columns in the dataframe.
Cramér's V statistic is a number between 0 and 1.
When it is close to 1 the columns are strongly associated — they contain similar information.
In this case, one of them may be redundant and for some models (such as linear models) it might be beneficial to remove it.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are
displaying a report in a Jupyter notebook and you see this message, you may need to
re-execute the cell or to trust the notebook (button on the top right or
"File > Trust notebook").
All datasets include high cardinality features which must be encoded using one of the skrub Encoders. The Toxicity dataset is different from the others in that it involves free-flowing text as tweets, while all other tables include a (possibly) large number of unique categories.
This explains why the TextEncoder is so much better than the other encoders on Toxicity, while its performance on the other datasets is more in line with the others.
On the other hand, the StringEncodershows a strong performance in all cases, while being top-2 on average for the fit time.
In summary
The skrub TableVectorizer transforms categorical features into numbers so that ML models can make better use of the information they contain. The StringEncoder can be considered the best all-rounder, being fast to train in most cases, while maintaining strong performance in general. The TextEncoder shines when textual data is available as it can make full use of the pre-trained language models it relies on. The MinHashEncoder and the GapEncoder are more specialized models that may work better than the alternatives in specific circumstances.