Skrub

SODA Kickoff 2025 - Machine Learning with Dataframes

Riccardo Cappuzzo

2025-06-03

Fun facts

  • I’m Italian, but I don’t drink coffee, wine, and I like pizza with fries
  • I did my PhD in Côte d’Azur, and I moved away because it was too sunny and I don’t like the sea

Fun facts

  • I’m mildly obsessed with matplotlib

Boost your productivity with skrub!

skrub simplifies many tedious data preparation operations

An example pipeline

  1. Gather some data
  2. Explore the data
  3. Pre-process the data
  4. Perform feature engineering
  5. Build a scikit-learn pipeline
  6. ???
  7. Profit?

Exploring the data

import pandas as pd
import matplotlib.pyplot as plt
import skrub

dataset = skrub.datasets.fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

df = pd.DataFrame(employees)

# Plot the distribution of the numerical values using a histogram
fig, axs = plt.subplots(2,1, figsize=(10, 6))
ax1, ax2 = axs

ax1.hist(df['year_first_hired'], bins=30, edgecolor='black', alpha=0.7)
ax1.set_xlabel('Year first hired')
ax1.set_ylabel('Frequency')
ax1.grid(True, linestyle='--', alpha=0.5)

# Count the frequency of each category
category_counts = df['department'].value_counts()

# Create a bar plot
category_counts.plot(kind='bar', edgecolor='black', ax=ax2)

# Add labels and title
ax2.set_xlabel('Department')
ax2.set_ylabel('Frequency')
ax2.grid(True, linestyle='--', axis='y', alpha=0.5)  # Add grid lines for y-axis

fig.suptitle("Distribution of values")

# Show the plot
plt.show()

Exploring the data

Exploring the data with skrub

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

  • Obtain high-level statistics about the data
  • Explore the distribution of values and find outliers
  • Discover highly correlated columns
  • Export and share the report as an html file

Data cleaning with Pandas

import pandas as pd
import numpy as np

data = {
    'A': [1, 1, 1],  # Single unique value
    'B': [2, 3, 2],  # Multiple unique values
    'C': ['x', 'x', 'x'],  # Single unique value
    'D': [4, 5, 6],  # Multiple unique values
    'E': [np.nan, np.nan, np.nan],  # All missing values 
    'F': ['', '', ''],  # All empty strings
    'Date': ['01/01/2023', '02/01/2023', '03/01/2023'],
}
df = pd.DataFrame(data)
display(df)
A B C D E F Date
0 1 2 x 4 NaN 01/01/2023
1 1 3 x 5 NaN 02/01/2023
2 1 2 x 6 NaN 03/01/2023

Data cleaning with Pandas

# Parse the datetime strings with a specific format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Drop columns with only a single unique value
df_cleaned = df.loc[:, df.nunique(dropna=True) > 1]

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_cleaned = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_cleaned.columns[df_cleaned.eq('').all()]
    df_cleaned = df_cleaned.drop(columns=empty_string_cols)
    return df_cleaned

# Apply the function to the DataFrame
df_cleaned = drop_empty_columns(df_cleaned)
display(df_cleaned)
B D Date
0 2 4 2023-01-01
1 3 5 2023-01-02
2 2 6 2023-01-03

Lightweight data cleaning: Cleaner

from skrub import Cleaner
cleaner = Cleaner(drop_if_constant=True, datetime_format='%d/%m/%Y')
df_cleaned = cleaner.fit_transform(df)
display(df_cleaned)
B D Date
0 2 4 2023-01-01
1 3 5 2023-01-02
2 2 6 2023-01-03

Encoding datetime features with Pandas

import pandas as pd
data = {
    'date': ['2023-01-01 12:34:56', '2023-02-15 08:45:23', '2023-03-20 18:12:45'],
    'value': [10, 20, 30]
}
df = pd.DataFrame(data)
df_expanded = df.copy()
datetime_column = "date"
df_expanded[datetime_column] = pd.to_datetime(df_expanded[datetime_column], errors='coerce')

df_expanded['year'] = df_expanded[datetime_column].dt.year
df_expanded['month'] = df_expanded[datetime_column].dt.month
df_expanded['day'] = df_expanded[datetime_column].dt.day
df_expanded['hour'] = df_expanded[datetime_column].dt.hour
df_expanded['minute'] = df_expanded[datetime_column].dt.minute
df_expanded['second'] = df_expanded[datetime_column].dt.second

Encoding datetime features with Pandas

df_expanded['hour_sin'] = np.sin(2 * np.pi * df_expanded['hour'] / 24)
df_expanded['hour_cos'] = np.cos(2 * np.pi * df_expanded['hour'] / 24)

df_expanded['month_sin'] = np.sin(2 * np.pi * df_expanded['month'] / 12)
df_expanded['month_cos'] = np.cos(2 * np.pi * df_expanded['month'] / 12)

print("Original DataFrame:")
print(df)
print("\nDataFrame with expanded datetime columns:")
print(df_expanded)
Original DataFrame:
                  date  value
0  2023-01-01 12:34:56     10
1  2023-02-15 08:45:23     20
2  2023-03-20 18:12:45     30

DataFrame with expanded datetime columns:
                 date  value  year  month  day  hour  minute  second  \
0 2023-01-01 12:34:56     10  2023      1    1    12      34      56   
1 2023-02-15 08:45:23     20  2023      2   15     8      45      23   
2 2023-03-20 18:12:45     30  2023      3   20    18      12      45   

       hour_sin      hour_cos  month_sin     month_cos  
0  1.224647e-16 -1.000000e+00   0.500000  8.660254e-01  
1  8.660254e-01 -5.000000e-01   0.866025  5.000000e-01  
2 -1.000000e+00 -1.836970e-16   1.000000  6.123234e-17  

Encoding datetime features skrub.DatetimeEncoder

from skrub import DatetimeEncoder, ToDatetime

de = DatetimeEncoder(periodic_encoding="circular")
X_date = ToDatetime().fit_transform(df["date"])
X_enc = de.fit_transform(X_date)
print(X_enc)
   date_year  date_total_seconds  date_month_circular_0  \
0     2023.0        1.672577e+09               0.500000   
1     2023.0        1.676451e+09               0.866025   
2     2023.0        1.679336e+09               1.000000   

   date_month_circular_1  date_day_circular_0  date_day_circular_1  \
0           8.660254e-01         2.079117e-01             0.978148   
1           5.000000e-01         1.224647e-16            -1.000000   
2           6.123234e-17        -8.660254e-01            -0.500000   

   date_hour_circular_0  date_hour_circular_1  
0          1.224647e-16         -1.000000e+00  
1          8.660254e-01         -5.000000e-01  
2         -1.000000e+00         -1.836970e-16  

}

Encoding all the features: TableVectorizer

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(StandardScaler(), SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline with tabular_learner

import skrub
from sklearn.linear_model import Ridge
model = skrub.tabular_learner(Ridge())
model
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We now have a pipeline!

  1. Gather some data
    • skrub.datasets, or user data
  2. Explore the data
    • skrub.TableReport
  3. Pre-process the data
    • skrub.TableVectorizer, Cleaner, DatetimeEncoder
  4. Perform feature engineering
    • skrub.TableVectorizer, TextEncoder, StringEncoder
  5. Build a scikit-learn pipeline
    • tabular_learner, sklearn.pipeline.make_pipeline
  6. ???
  7. Profit 📈

What if we had a better pipeline?

skrub expressions

When a normal pipe is not enough…

Expressions come to the rescue 🚒:

  • Keep track of train, validation and test splits to avoid data leakage
  • Simplify hyperparameter tuning and reporting
  • Handle complex pipelines that involve multiple tables and custom
  • Persist all objects for reproducibility

Starting with expressions

data = skrub.datasets.fetch_credit_fraud()

X = skrub.X(data.baskets[["ID"]]) # mark as "X"
y = skrub.y(data.baskets["fraud_flag"]) # mark as "y"
products = skrub.var("products", data.products) # add a new variable
  • X, y, products represent inputs to the pipeline
  • skrub keeps track of splits

Build and inspect complex pipelines

pred.skb.full_report()


report

Hyperparameter tuning in scikit-learn

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]
model = RandomizedSearchCV(pipe, grid)

Hyperparameter tuning with skrub expressions

dim_reduction = X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)
regressor = dim_reduction.skb.apply(
    skrub.choose_from(
        {
            "Ridge": Ridge(alpha=skrub.choose_float(0.1, 10.0, log=True)),
            "RandomForest": RandomForestClassifier(
                n_estimators=skrub.choose_int(20, 200, log=True)
            )
        }, name="regressor"
    )
)
regressor.skb.get_randomized_search(scoring="roc_auc", fitted=True)

Observe the impact of the hyperparameters

search = pred.skb.get_randomized_search(scoring="roc_auc", fitted=True)

search.plot_parallel_coord()

tl;dw

skrub provides

  • interactive data exploration
  • automated pre-processing of pandas and polars dataframes
  • powerful feature engineering
  • soon™️, complex pipelines, hyperparameter tuning, (almost) no leakage

That’s it!

Getting involved