Skrub

SODA Kickoff 2025 - Machine Learning with Dataframes

Riccardo Cappuzzo

2025-06-03

Fun facts

I’m Italian, but I don’t drink coffee, wine, and I like pizza with fries
I did my PhD in Côte d’Azur, and I moved away because it was too sunny and I don’t like the sea

Fun facts

I’m mildly obsessed with matplotlib

Boost your productivity with `skrub`!

skrub simplifies many tedious data preparation operations

An example pipeline

Gather some data
Explore the data
Pre-process the data
Perform feature engineering
Build a scikit-learn pipeline
???
Profit?

Exploring the data

import pandas as pd
import matplotlib.pyplot as plt
import skrub

dataset = skrub.datasets.fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

df = pd.DataFrame(employees)

# Plot the distribution of the numerical values using a histogram
fig, axs = plt.subplots(2,1, figsize=(10, 6))
ax1, ax2 = axs

ax1.hist(df['year_first_hired'], bins=30, edgecolor='black', alpha=0.7)
ax1.set_xlabel('Year first hired')
ax1.set_ylabel('Frequency')
ax1.grid(True, linestyle='--', alpha=0.5)

# Count the frequency of each category
category_counts = df['department'].value_counts()

# Create a bar plot
category_counts.plot(kind='bar', edgecolor='black', ax=ax2)

# Add labels and title
ax2.set_xlabel('Department')
ax2.set_ylabel('Frequency')
ax2.grid(True, linestyle='--', axis='y', alpha=0.5)  # Add grid lines for y-axis

fig.suptitle("Distribution of values")

# Show the plot
plt.show()

Exploring the data

Exploring the data with `skrub`

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

Obtain high-level statistics about the data
Explore the distribution of values and find outliers
Discover highly correlated columns
Export and share the report as an html file

Data cleaning with Pandas

import pandas as pd
import numpy as np

data = {
    'A': [1, 1, 1],  # Single unique value
    'B': [2, 3, 2],  # Multiple unique values
    'C': ['x', 'x', 'x'],  # Single unique value
    'D': [4, 5, 6],  # Multiple unique values
    'E': [np.nan, np.nan, np.nan],  # All missing values 
    'F': ['', '', ''],  # All empty strings
    'Date': ['01/01/2023', '02/01/2023', '03/01/2023'],
}
df = pd.DataFrame(data)
display(df)

	A	B	C	D	E	Date
0	1	2	x	4	NaN	01/01/2023
1	1	3	x	5	NaN	02/01/2023
2	1	2	x	6	NaN	03/01/2023

Data cleaning with Pandas

# Parse the datetime strings with a specific format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Drop columns with only a single unique value
df_cleaned = df.loc[:, df.nunique(dropna=True) > 1]

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_cleaned = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_cleaned.columns[df_cleaned.eq('').all()]
    df_cleaned = df_cleaned.drop(columns=empty_string_cols)
    return df_cleaned

# Apply the function to the DataFrame
df_cleaned = drop_empty_columns(df_cleaned)
display(df_cleaned)

	B	D	Date
0	2	4	2023-01-01
1	3	5	2023-01-02
2	2	6	2023-01-03

Lightweight data cleaning: `Cleaner`

from skrub import Cleaner
cleaner = Cleaner(drop_if_constant=True, datetime_format='%d/%m/%Y')
df_cleaned = cleaner.fit_transform(df)
display(df_cleaned)

	B	D	Date
0	2	4	2023-01-01
1	3	5	2023-01-02
2	2	6	2023-01-03

Encoding datetime features with Pandas

import pandas as pd
data = {
    'date': ['2023-01-01 12:34:56', '2023-02-15 08:45:23', '2023-03-20 18:12:45'],
    'value': [10, 20, 30]
}
df = pd.DataFrame(data)
df_expanded = df.copy()
datetime_column = "date"
df_expanded[datetime_column] = pd.to_datetime(df_expanded[datetime_column], errors='coerce')

df_expanded['year'] = df_expanded[datetime_column].dt.year
df_expanded['month'] = df_expanded[datetime_column].dt.month
df_expanded['day'] = df_expanded[datetime_column].dt.day
df_expanded['hour'] = df_expanded[datetime_column].dt.hour
df_expanded['minute'] = df_expanded[datetime_column].dt.minute
df_expanded['second'] = df_expanded[datetime_column].dt.second

Encoding datetime features with Pandas

df_expanded['hour_sin'] = np.sin(2 * np.pi * df_expanded['hour'] / 24)
df_expanded['hour_cos'] = np.cos(2 * np.pi * df_expanded['hour'] / 24)

df_expanded['month_sin'] = np.sin(2 * np.pi * df_expanded['month'] / 12)
df_expanded['month_cos'] = np.cos(2 * np.pi * df_expanded['month'] / 12)

print("Original DataFrame:")
print(df)
print("\nDataFrame with expanded datetime columns:")
print(df_expanded)

Original DataFrame:
                  date  value
0  2023-01-01 12:34:56     10
1  2023-02-15 08:45:23     20
2  2023-03-20 18:12:45     30

DataFrame with expanded datetime columns:
                 date  value  year  month  day  hour  minute  second  \
0 2023-01-01 12:34:56     10  2023      1    1    12      34      56   
1 2023-02-15 08:45:23     20  2023      2   15     8      45      23   
2 2023-03-20 18:12:45     30  2023      3   20    18      12      45   

       hour_sin      hour_cos  month_sin     month_cos  
0  1.224647e-16 -1.000000e+00   0.500000  8.660254e-01  
1  8.660254e-01 -5.000000e-01   0.866025  5.000000e-01  
2 -1.000000e+00 -1.836970e-16   1.000000  6.123234e-17

Encoding datetime features `skrub.DatetimeEncoder`

from skrub import DatetimeEncoder, ToDatetime

de = DatetimeEncoder(periodic_encoding="circular")
X_date = ToDatetime().fit_transform(df["date"])
X_enc = de.fit_transform(X_date)
print(X_enc)

   date_year  date_total_seconds  date_month_circular_0  \
0     2023.0        1.672577e+09               0.500000   
1     2023.0        1.676451e+09               0.866025   
2     2023.0        1.679336e+09               1.000000   

   date_month_circular_1  date_day_circular_0  date_day_circular_1  \
0           8.660254e-01         2.079117e-01             0.978148   
1           5.000000e-01         1.224647e-16            -1.000000   
2           6.123234e-17        -8.660254e-01            -0.500000   

   date_hour_circular_0  date_hour_circular_1  
0          1.224647e-16         -1.000000e+00  
1          8.660254e-01         -5.000000e-01  
2         -1.000000e+00         -1.836970e-16

}

Encoding all the features: `TableVectorizer`

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(StandardScaler(), SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline with `tabular_learner`

import skrub
from sklearn.linear_model import Ridge
model = skrub.tabular_learner(Ridge())
model

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We now have a pipeline!

Gather some data
- skrub.datasets, or user data
Explore the data
- skrub.TableReport
Pre-process the data
- skrub.TableVectorizer, Cleaner, DatetimeEncoder …
Perform feature engineering
- skrub.TableVectorizer, TextEncoder, StringEncoder…
Build a scikit-learn pipeline
- tabular_learner, sklearn.pipeline.make_pipeline …
???
Profit 📈

What if we had a better pipeline?

`skrub` expressions

When a normal pipe is not enough…

Expressions come to the rescue 🚒:

Keep track of train, validation and test splits to avoid data leakage
Simplify hyperparameter tuning and reporting
Handle complex pipelines that involve multiple tables and custom
Persist all objects for reproducibility

Starting with expressions

data = skrub.datasets.fetch_credit_fraud()

X = skrub.X(data.baskets[["ID"]]) # mark as "X"
y = skrub.y(data.baskets["fraud_flag"]) # mark as "y"
products = skrub.var("products", data.products) # add a new variable

X, y, products represent inputs to the pipeline
skrub keeps track of splits

Build and inspect complex pipelines

pred.skb.full_report()

report

Hyperparameter tuning in `scikit-learn`

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]
model = RandomizedSearchCV(pipe, grid)

Hyperparameter tuning with `skrub` expressions

dim_reduction = X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)
regressor = dim_reduction.skb.apply(
    skrub.choose_from(
        {
            "Ridge": Ridge(alpha=skrub.choose_float(0.1, 10.0, log=True)),
            "RandomForest": RandomForestClassifier(
                n_estimators=skrub.choose_int(20, 200, log=True)
            )
        }, name="regressor"
    )
)
regressor.skb.get_randomized_search(scoring="roc_auc", fitted=True)

Observe the impact of the hyperparameters

search = pred.skb.get_randomized_search(scoring="roc_auc", fitted=True)

search.plot_parallel_coord()

tl;dw

skrub provides

interactive data exploration
automated pre-processing of pandas and polars dataframes
powerful feature engineering
soon™️, complex pipelines, hyperparameter tuning, (almost) no leakage

Skrub

Fun facts

Fun facts

Boost your productivity with `skrub`!

An example pipeline

Exploring the data

Exploring the data

Exploring the data with `skrub`

Data cleaning with Pandas

Data cleaning with Pandas

Lightweight data cleaning: `Cleaner`

Encoding datetime features with Pandas

Encoding datetime features with Pandas

Encoding datetime features `skrub.DatetimeEncoder`

Encoding all the features: `TableVectorizer`

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline with `tabular_learner`

We now have a pipeline!

What if we had a better pipeline?

`skrub` expressions

Starting with expressions

Build and inspect complex pipelines

Hyperparameter tuning in `scikit-learn`

Hyperparameter tuning with `skrub` expressions

Observe the impact of the hyperparameters

tl;dw

That’s it!

Getting involved

Skrub

Fun facts

Fun facts

Boost your productivity with skrub!

An example pipeline

Exploring the data

Exploring the data

Exploring the data with skrub

Data cleaning with Pandas

Data cleaning with Pandas

Lightweight data cleaning: Cleaner

Encoding datetime features with Pandas

Encoding datetime features with Pandas

Encoding datetime features skrub.DatetimeEncoder

Encoding all the features: TableVectorizer

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline with tabular_learner

We now have a pipeline!

What if we had a better pipeline?

skrub expressions

Starting with expressions

Build and inspect complex pipelines

Hyperparameter tuning in scikit-learn

Hyperparameter tuning with skrub expressions

Observe the impact of the hyperparameters

tl;dw

That’s it!

Getting involved

Boost your productivity with `skrub`!

Exploring the data with `skrub`

Lightweight data cleaning: `Cleaner`

Encoding datetime features `skrub.DatetimeEncoder`

Encoding all the features: `TableVectorizer`

Build a predictive pipeline with `tabular_learner`

`skrub` expressions

Hyperparameter tuning in `scikit-learn`

Hyperparameter tuning with `skrub` expressions