Less wrangling, more machine learning

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning.
Our long-term goal is to directly connect database tables to machine learning estimators.

Get started →

Effortless Pipelines

Create strong scikit-learn pipeline baselines effortlessly with TableVectorizer and tabular_learner.

Given, a complex dataframe df: (expand for full code)

from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
df = dataset.X
y = dataset.y
df
        

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes...	Fulltime-Regular	Master Police Officer	09/12/1988	1988
...	...	...	...	...	...	...	...	...
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation...	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

from sklearn.model_selection import cross_val_score
from skrub import tabular_learner
cross_val_score(tabular_learner('regressor'), df, y)
        

array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

Powerful Feature Engineering

Encode text and high cardinality categorical data with the GapEncoder and MinHashEncoder, and extract features from dates with the DatetimeEncoder.

from skrub import GapEncoder
gap = GapEncoder().fit(df["employee_position_title"])
encoded_labels = gap.transform(
    df["employee_position_title"].head()
)
plt.imshow(encoded_labels)
    

Interactive Data Exploration

Explore your dataframes interactively with TableReport.

from skrub import TableReport
TableReport(df)
    

Try it on your dataset →

Click anywhere on the table

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989

9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	0 (0.0%)		2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
division	assignment_category	0.613
assignment_category	employee_position_title	0.521
division	employee_position_title	0.423
department_name	employee_position_title	0.415
department	employee_position_title	0.415
department	assignment_category	0.400
department_name	assignment_category	0.400
gender	department	0.396
gender	department_name	0.396
department	division	0.362
department_name	division	0.362
gender	employee_position_title	0.290
gender	assignment_category	0.263
gender	division	0.253
employee_position_title	date_first_hired	0.245
date_first_hired	year_first_hired	0.165
department	date_first_hired	0.160
department_name	date_first_hired	0.160
employee_position_title	year_first_hired	0.122

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Our Community

The Skrub project is powered by the efforts of a world-wide community of contributors. Here we display a randomly selected group of 30 contributors.

Try it yourself!

Ready to write less code and get more insights? Dive into skrub now and be part of an emerging community!

Get started →

skrub

Less wrangling, more machine learning

Effortless Pipelines

Powerful Feature Engineering

Interactive Data Exploration

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired