Skrub workshop - Introducing Skrub

Less wrangling, more machine learning

Riccardo Cappuzzo, Jérôme Dockès, Vincent Maladiere

Inria, P16, Probabl.

2025-01-29

Plan for the presentation

  • Introducing skrub
    • Example use case
    • Detailed explanation of the features
    • Getting involved

In the beginning…

Skrub stems from the development of dirty_cat, a package that provided support for handling dirty columns and perform fuzzy joins across tables.

It has since evolved into a package that provides:

  • interactive data exploration
  • automated pre-processing of pandas and polars dataframes
  • powerful feature engineering

Skrub’s vision

The goal of skrub is to facilitate building and deploying machine-learning models on pandas and polars dataframes (later, SQL databases…)

Skrub is high-level, with a philosophy and an API matching that of scikit-learn. It strives to bridge the worlds of databases and machine-learning, enabling imperfect assembly and representations of the data when it is noisy.

Let’s set the stage

An example use case

  1. Gather some data
    • Employee salaries, census, customer churn…
  2. Explore the data
    • Null values, dtypes, correlated features…
  3. Pre-process the data
  4. Build a scikit-learn estimator
  5. ???
  6. Profit 📈

Exploring the data

import skrub
import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

employees.describe(include="all")
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
count 9211 9228 9228 9228 9228 9228 9228 9228.000000
unique 2 37 37 694 2 443 2264 NaN
top M POL Department of Police School Health Services Fulltime-Regular Bus Operator 12/12/2016 NaN
freq 5481 1844 1844 300 8394 638 87 NaN
mean NaN NaN NaN NaN NaN NaN NaN 2003.597529
std NaN NaN NaN NaN NaN NaN NaN 9.327078
min NaN NaN NaN NaN NaN NaN NaN 1965.000000
25% NaN NaN NaN NaN NaN NaN NaN 1998.000000
50% NaN NaN NaN NaN NaN NaN NaN 2005.000000
75% NaN NaN NaN NaN NaN NaN NaN 2012.000000
max NaN NaN NaN NaN NaN NaN NaN 2016.000000

Exploring the data… interactively!

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

  • Obtain high-level statistics about the data (number of uniques, missing values…)
  • Explore the distribution of values and find outliers
  • Discover highly correlated columns
  • Export and share the report as an html file

Build a predictive pipeline

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = make_pipeline(StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['year_first_hired']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'department',
                                                   'department_name',
                                                   'division',
                                                   'assignment_category',
                                                   'employee_position_title',
                                                   'date_first_hired'])])),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Enter: tabular_learner

import skrub
from sklearn.linear_model import Ridge
tl = skrub.tabular_learner(Ridge())

Enter: tabular_learner

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  True
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
alpha  1.0
fit_intercept  True
copy_X  True
max_iter  None
tol  0.0001
solver  'auto'
positive  False
random_state  None

A robust baseline: tabular_learner

Given a scikit-learn estimator, tabular_learner:

You can also write “tabular_learner("regressor")”:

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
loss  'squared_error'
quantile  None
learning_rate  0.1
max_iter  100
max_leaf_nodes  31
max_depth  None
min_samples_leaf  20
l2_regularization  0.0
max_features  1.0
max_bins  255
categorical_features  'from_dtype'
monotonic_cst  None
interaction_cst  None
warm_start  False
early_stopping  'auto'
scoring  'loss'
validation_fraction  0.1
n_iter_no_change  10
tol  1e-07
verbose  0
random_state  None

Skrub’s main features

Unmasking the tabular_learner

Under the hood: TableVectorizer

  1. Pre-process the data
  2. Convert complex data types (datetimes, text) into numerical features

Under the hood: TableVectorizer

Pre-process the data

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Convert complex data types (datetimes, text) into numerical features

High-cardinality encoders are robust in presence of typos and dirty data.

Under the hood: TableVectorizer

vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(employees)
from pprint import pprint

pprint(vectorizer.column_to_kind_)
{'assignment_category': 'low_cardinality',
 'date_first_hired': 'datetime',
 'department': 'low_cardinality',
 'department_name': 'low_cardinality',
 'division': 'high_cardinality',
 'employee_position_title': 'high_cardinality',
 'gender': 'low_cardinality',
 'year_first_hired': 'numeric'}
pprint(vectorizer.all_processing_steps_["date_first_hired"])
[CleanNullStrings(),
 DropUninformative(),
 ToDatetime(),
 DatetimeEncoder(),
 {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

Encoding datetime features

import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/skrub-data/datasets/master"
    "/data/bike-sharing-dataset.csv"
)
# Extract our input data (X) and the target column (y)
y = data["cnt"]
X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

X
date holiday temp hum windspeed weathersit
0 2011-01-01 00:00:00 0 0.24 0.81 0.0000 1
1 2011-01-01 01:00:00 0 0.22 0.80 0.0000 1
2 2011-01-01 02:00:00 0 0.22 0.80 0.0000 1
3 2011-01-01 03:00:00 0 0.24 0.75 0.0000 1
4 2011-01-01 04:00:00 0 0.24 0.75 0.0000 1
... ... ... ... ... ... ...
17374 2012-12-31 19:00:00 0 0.26 0.60 0.1642 2
17375 2012-12-31 20:00:00 0 0.26 0.60 0.1642 2
17376 2012-12-31 21:00:00 0 0.26 0.60 0.1642 1
17377 2012-12-31 22:00:00 0 0.26 0.56 0.1343 1
17378 2012-12-31 23:00:00 0 0.26 0.65 0.1343 1

17379 rows × 6 columns

Encoding datetime features

from pprint import pprint
from skrub import TableVectorizer, DatetimeEncoder

table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
pprint(table_vec_weekday.get_feature_names_out())
array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
       'windspeed', 'weathersit'], dtype='<U18')

Encoding datetime features

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor())

cross_val_score(
    pipeline_weekday, X, y, scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)
array([ -3694.45159469,  -3180.1148674 , -15183.44808403,  -4824.29173547,
        -5391.06731737])

Encoding datetime features

Encoding datetime features

Timeseries support in skrub is still in its early stages! Please stay tuned for new developments.

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

AggJoiner: automatically aggregate values

AggJoiner: automatically aggregate values

import skrub
import pandas as pd

df1 = pd.DataFrame({
    "UID": [28, 32, 28], 
    "Basket ID": [1100, 1300, 1400]})
df2 = pd.DataFrame({
    "Basket ID": [1100, 1100, 1100, 1300, 1400], 
    "Product ID": ["A521", "B695", "F221", "W214", "B695",], 
    "Price": [25, 30, 10, 320, 30]})

joiner = skrub.AggJoiner(df2, operations="sum", key="Basket ID", cols="Price")
joiner.fit_transform(df1)
UID Basket ID Price_sum
0 28 1100 65
1 32 1300 320
2 28 1400 30

InterpolationJoiner: infer missing values

import pandas as pd

from skrub.datasets import fetch_flight_delays

dataset = fetch_flight_delays()
weather = dataset.weather
weather = weather.sample(100_000, random_state=0, ignore_index=True)
stations = dataset.stations
weather = stations.merge(weather, on="ID")[
    ["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY", "TMAX", "PRCP", "SNOW"]
]
weather["YEAR/MONTH/DAY"] = pd.to_datetime(weather["YEAR/MONTH/DAY"])
LATITUDE LONGITUDE YEAR/MONTH/DAY TMAX PRCP SNOW
0 25.333 55.517 2008-11-16 297.0 NaN NaN
1 25.333 55.517 2008-04-12 333.0 NaN NaN
2 25.255 55.364 2008-08-28 430.0 0.0 NaN
3 25.255 55.364 2008-02-17 264.0 0.0 NaN
4 25.255 55.364 2008-11-25 291.0 NaN NaN

InterpolationJoiner

from skrub import InterpolationJoiner
joiner = InterpolationJoiner(
    aux_table,
    key=["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY"],
    suffix="_predicted",
).fit(main_table)
join = joiner.transform(main_table)
join.head()
LATITUDE LONGITUDE YEAR/MONTH/DAY TMAX PRCP SNOW TMAX_predicted PRCP_predicted SNOW_predicted
0 25.333 55.517 2008-11-16 297.0 NaN NaN 253.131219 10.099097 -0.001023
1 25.333 55.517 2008-04-12 333.0 NaN NaN 292.267381 8.770696 0.142712
2 25.255 55.364 2008-08-28 430.0 0.0 NaN 330.768383 22.541307 -0.053227
3 25.255 55.364 2008-02-17 264.0 0.0 NaN 274.812990 12.980406 0.239012
4 25.255 55.364 2008-11-25 291.0 NaN NaN 260.411121 6.621378 -0.014548

InterpolationJoiner

Augmenting features through joining

All Joiner objects are scikit-learn estimators, so they can be used in a Pipeline.

Additional goodies: deduplication

deduplicate misspelled categories

from skrub import deduplicate
pprint(duplicated)
deduplicate_correspondence = deduplicate(duplicated)
pprint(deduplicate_correspondence.to_dict())
['ulack', 'black', 'black', 'xudte', 'white', 'white']
{'black': 'black', 'ulack': 'black', 'white': 'white', 'xudte': 'white'}

Doc

Additional goodies: Wikipedia embeddings as features

KEN embeddings capture relational information about all entities in Wikipedia.

from skrub.datasets import fetch_ken_embeddings
embedding_games = fetch_ken_embeddings(
    search_types="game",
    exclude="companies|developer",
    embedding_table_id="games",
)

Doc

Getting involved

Install skrub

Base installation:

# with pip
pip install skrub -U
# with conda
conda install -c conda-forge skrub

For deep learning features such as TextEncoder:

# with pip
pip install skrub[transformers] -U
# with conda
conda install -c conda-forge skrub[transformers]

Join the community

Contribute to skrub

Q & A time!