Less wrangling, more machine learning

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning.
Our long-term goal is to directly connect database tables to machine learning estimators.

Effortless Pipelines

Create strong scikit-learn pipeline baselines effortlessly with TableVectorizer and tabular_learner.

Given, a complex dataframe df: (expand for full code)
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
df = dataset.X
y = dataset.y
df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes... Fulltime-Regular Master Police Officer 09/12/1988 1988
... ... ... ... ... ... ... ... ...
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation... Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012
from sklearn.model_selection import cross_val_score
from skrub import tabular_learner
cross_val_score(tabular_learner('regressor'), df, y)
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

Powerful Feature Engineering

Encode text and high cardinality categorical data with the GapEncoder and MinHashEncoder, and extract features from dates with the DatetimeEncoder.

from skrub import GapEncoder
gap = GapEncoder().fit(df["employee_position_title"])
encoded_labels = gap.transform(
    df["employee_position_title"].head()
)
plt.imshow(encoded_labels)

Interactive Data Exploration

Explore your dataframes interactively with TableReport.

from skrub import TableReport
TableReport(df)

Try it on your dataset →

Click anywhere on the table

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Our Community

The Skrub project is powered by the efforts of a world-wide community of contributors. Here we display a randomly selected group of 30 contributors.

Try it yourself!

Ready to write less code and get more insights? Dive into skrub now and be part of an emerging community!