Probabl Sprint - July 2025

Machine Learning with Dataframes

Riccardo Cappuzzo

2025-07-09

Fun facts

I’m Italian, but I don’t drink coffee, wine, and I like pizza with fries
I did my PhD in Côte d’Azur, and I moved away because it was too sunny and I don’t like the sea

Boost your productivity with `skrub`!

skrub simplifies many tedious data preparation operations

An example pipeline

Gather some data
Explore the data
Pre-process the data
Perform feature engineering
Build a scikit-learn pipeline
???
Profit?

Exploring the data

import pandas as pd
import matplotlib.pyplot as plt
import skrub

dataset = skrub.datasets.fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

df = pd.DataFrame(employees)

# Plot the distribution of the numerical values using a histogram
fig, axs = plt.subplots(2,1, figsize=(10, 6))
ax1, ax2 = axs

ax1.hist(df['year_first_hired'], bins=30, edgecolor='black', alpha=0.7)
ax1.set_xlabel('Year first hired')
ax1.set_ylabel('Frequency')
ax1.grid(True, linestyle='--', alpha=0.5)

# Count the frequency of each category
category_counts = df['department'].value_counts()

# Create a bar plot
category_counts.plot(kind='bar', edgecolor='black', ax=ax2)

# Add labels and title
ax2.set_xlabel('Department')
ax2.set_ylabel('Frequency')
ax2.grid(True, linestyle='--', axis='y', alpha=0.5)  # Add grid lines for y-axis

fig.suptitle("Distribution of values")

# Show the plot
plt.show()

Exploring the data

Exploring the data with `skrub`

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

Obtain high-level statistics about the data
Explore the distribution of values and find outliers
Discover highly correlated columns
Export and share the report as an html file

Data cleaning with Pandas

import pandas as pd
import numpy as np

data = {
    'Constant int': [1, 1, 1],  # Single unique value
    'B': [2, 3, 2],  # Multiple unique values
    'Constant str': ['x', 'x', 'x'],  # Single unique value
    'D': [4, 5, 6],  # Multiple unique values
    'All nan': [np.nan, np.nan, np.nan],  # All missing values 
    'All empty': ['', '', ''],  # All empty strings
    'Date': ['01/01/2023', '02/01/2023', '03/01/2023'],
}
df = pd.DataFrame(data)
display(df)

	Constant int	B	Constant str	D	All nan	Date
0	1	2	x	4	NaN	01/01/2023
1	1	3	x	5	NaN	02/01/2023
2	1	2	x	6	NaN	03/01/2023

Data cleaning with Pandas

# Parse the datetime strings with a specific format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Drop columns with only a single unique value
df_cleaned = df.loc[:, df.nunique(dropna=True) > 1]

# Function to drop columns with only missing values or empty strings
def drop_empty_columns(df):
    # Drop columns with only missing values
    df_cleaned = df.dropna(axis=1, how='all')
    # Drop columns with only empty strings
    empty_string_cols = df_cleaned.columns[df_cleaned.eq('').all()]
    df_cleaned = df_cleaned.drop(columns=empty_string_cols)
    return df_cleaned

# Apply the function to the DataFrame
df_cleaned = drop_empty_columns(df_cleaned)
display(df_cleaned)

	B	D	Date
0	2	4	2023-01-01
1	3	5	2023-01-02
2	2	6	2023-01-03

Lightweight data cleaning: `Cleaner`

from skrub import Cleaner
cleaner = Cleaner(drop_if_constant=True, datetime_format='%d/%m/%Y')
df_cleaned = cleaner.fit_transform(df)
display(df_cleaned)

	B	D	Date
0	2	4	2023-01-01
1	3	5	2023-01-02
2	2	6	2023-01-03

Encoding datetime features with Pandas

import pandas as pd
data = {
    'date': ['2023-01-01 12:34:56', '2023-02-15 08:45:23', '2023-03-20 18:12:45'],
    'value': [10, 20, 30]
}
df = pd.DataFrame(data)
df_expanded = df.copy()
datetime_column = "date"
df_expanded[datetime_column] = pd.to_datetime(df_expanded[datetime_column], errors='coerce')

df_expanded['year'] = df_expanded[datetime_column].dt.year
df_expanded['month'] = df_expanded[datetime_column].dt.month
df_expanded['day'] = df_expanded[datetime_column].dt.day
df_expanded['hour'] = df_expanded[datetime_column].dt.hour
df_expanded['minute'] = df_expanded[datetime_column].dt.minute
df_expanded['second'] = df_expanded[datetime_column].dt.second

Encoding datetime features with Pandas

df_expanded['hour_sin'] = np.sin(2 * np.pi * df_expanded['hour'] / 24)
df_expanded['hour_cos'] = np.cos(2 * np.pi * df_expanded['hour'] / 24)

df_expanded['month_sin'] = np.sin(2 * np.pi * df_expanded['month'] / 12)
df_expanded['month_cos'] = np.cos(2 * np.pi * df_expanded['month'] / 12)

print("Original DataFrame:")
print(df)
print("\nDataFrame with expanded datetime columns:")
print(df_expanded)

Original DataFrame:
                  date  value
0  2023-01-01 12:34:56     10
1  2023-02-15 08:45:23     20
2  2023-03-20 18:12:45     30

DataFrame with expanded datetime columns:
                 date  value  year  month  day  hour  minute  second  \
0 2023-01-01 12:34:56     10  2023      1    1    12      34      56   
1 2023-02-15 08:45:23     20  2023      2   15     8      45      23   
2 2023-03-20 18:12:45     30  2023      3   20    18      12      45   

       hour_sin      hour_cos  month_sin     month_cos  
0  1.224647e-16 -1.000000e+00   0.500000  8.660254e-01  
1  8.660254e-01 -5.000000e-01   0.866025  5.000000e-01  
2 -1.000000e+00 -1.836970e-16   1.000000  6.123234e-17

Encoding datetime features `skrub.DatetimeEncoder`

from skrub import DatetimeEncoder, ToDatetime

de = DatetimeEncoder(periodic_encoding="circular")
X_date = ToDatetime().fit_transform(df["date"])
X_enc = de.fit_transform(X_date)
print(X_enc)

   date_year  date_total_seconds  date_month_circular_0  \
0     2023.0        1.672577e+09               0.500000   
1     2023.0        1.676451e+09               0.866025   
2     2023.0        1.679336e+09               1.000000   

   date_month_circular_1  date_day_circular_0  date_day_circular_1  \
0           8.660254e-01         2.079117e-01             0.978148   
1           5.000000e-01         1.224647e-16            -1.000000   
2           6.123234e-17        -8.660254e-01            -0.500000   

   date_hour_circular_0  date_hour_circular_1  
0          1.224647e-16         -1.000000e+00  
1          8.660254e-01         -5.000000e-01  
2         -1.000000e+00         -1.836970e-16

What periodic features look like

Encoding all the features: `TableVectorizer`

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(StandardScaler(), SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline with `tabular_learner`

import skrub
from sklearn.linear_model import Ridge
model = skrub.tabular_learner(Ridge())
model

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

SimpleImputer

?Documentation for SimpleImputer

Parameters

	missing_values	nan
	strategy	'mean'
	fill_value	None
	copy	True
	add_indicator	True
	keep_empty_features	False

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

Ridge

?Documentation for Ridge

Parameters

	alpha	1.0
	fit_intercept	True
	copy_X	True
	max_iter	None
	tol	0.0001
	solver	'auto'
	positive	False
	random_state	None

What if we had a better pipeline?

A realistic scenario

A data scientist needs to train a ML model, but features are spread across multiple tables.

Warning

Many issues with this!

scikit-learn pipelines support only a single feature matrix X
Dataframe operations cannot be tuned
Data leakage must be accounted for
Persisting and reproducing operations is complex

`skrub` DataOps

When a normal pipe is not enough…

… the skrub DataOps come to the rescue 🚒

DataOps…

Extend the scikit-learn machinery to complex multi-table operations
Track all operations with a computational graph (a data plan)
Allow tuning any operation in the data plan
Can be persisted and shared easily by generating a learner

DataOps, Data Plans, `learner`s: oh my!

A DataOp (singular) wraps a single operation, and can be combined and concatenated with other DataOps.
The Data Plan is a collective name for a sequence and combination of DataOps.
The Data Plan can be exported as a standalone object called learner. The learner takes a dictionary of values rather than just X and y.

How do DataOps work, though?

DataOps wrap around user operations, where user operations are:

any dataframe operation (e.g., merge, group by, aggregate etc.)
scikit-learn estimators (a Random Forest, RidgeCV etc.)
custom user code (load data from a path, fetch from an URL etc.)

Important

DataOps record user operations, so that they can later be replayed in the same order and with the same arguments on unseen data.

Starting with the `DataOps`

data = skrub.datasets.fetch_credit_fraud()

baskets = skrub.var("baskets", data.baskets)
X = baskets[["ID"]].skb.mark_as_X()
y = baskets["fraud_flag"].skb.mark_as_y()

products = skrub.var("products", data.products) # add a new variable

X, y, products represent inputs to the pipeline.
skrub splits X and y when training.

Building a full data plan

from skrub import selectors as s
from sklearn.ensemble import ExtraTreesClassifier  

vectorizer = skrub.TableVectorizer(high_cardinality=skrub.StringEncoder(), n_jobs=-1)
vectorized_products = products.skb.apply(vectorizer, cols=s.all() - "basket_ID")
aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index()
features = X.merge(aggregated_products, left_on="ID", right_on="basket_ID")
features = features.drop(columns=["ID", "basket_ID"])
predictions = features.skb.apply(ExtraTreesClassifier(n_jobs=-1), y=y)

Inspecting the data plan

predictions.skb.full_report()

report

Exporting the plan in a `learner`

The data plan can be exported as a learner:

# anywhere
learner = predictions.skb.make_learner()
# search is a HPO object
best_learner = search.skb.best_learner_

Then, the learner can be pickled …

import pickle

with open("learner.bin", "wb") as fp:
    pickle.dump(learner, fp)

… and loaded

with open("learner.bin", "rb") as fp:
    learner = pickle.load(fp)

learner.predict({"baskets": new_baskets, "products": new_products})

Hyperparameter tuning in a Data Plan

skrub implements four choose_* functions:

choose_from: select from the given list of options
choose_int: select an integer within a range
choose_float: select a float within a range
choose_bool: select a bool
optional: chooses between a value or DataOp and no op

Hyperparameter tuning in a Data Plan

It’s possible to nest these functions to create complex grids:

X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)

Tuning in `scikit-learn` can be complex

pipe = Pipeline([("dim_reduction", PCA()), ("regressor", Ridge())])
grid = [
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [Ridge()],
        "regressor__alpha": loguniform(0.1, 10.0),
    },
    {
        "dim_reduction": [PCA()],
        "dim_reduction__n_components": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
    {
        "dim_reduction": [SelectKBest()],
        "dim_reduction__k": [10, 20, 30],
        "regressor": [RandomForestClassifier()],
        "regressor__n_estimators": loguniform(20, 200),
    },
]
model = RandomizedSearchCV(pipe, grid)

Tuning with `DataOps` is simple!

dim_reduction = X.skb.apply(
    skrub.choose_from(
        {
            "PCA": PCA(n_components=skrub.choose_int(10, 30)),
            "SelectKBest": SelectKBest(k=skrub.choose_int(10, 30))
        }, name="dim_reduction"
    )
)
regressor = dim_reduction.skb.apply(
    skrub.choose_from(
        {
            "Ridge": Ridge(alpha=skrub.choose_float(0.1, 10.0, log=True)),
            "RandomForest": RandomForestClassifier(
                n_estimators=skrub.choose_int(20, 200, log=True)
            )
        }, name="regressor"
    )
)
regressor.skb.get_randomized_search(scoring="roc_auc", fitted=True)

Observe the impact of the hyperparameters

search = pred.skb.get_randomized_search(scoring="roc_auc", fitted=True)

search.plot_parallel_coord()

tl;dw

skrub provides

interactive data exploration
automated pre-processing of pandas and polars dataframes
powerful feature engineering
soon™️, DataOps, data plans, hyperparameter tuning, (almost) no leakage

	categories	'auto'
	drop	'if_binary'
	sparse_output	False
	dtype	'float32'
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

Probabl Sprint - July 2025

Fun facts

Boost your productivity with `skrub`!

An example pipeline

Exploring the data

Exploring the data

Exploring the data with `skrub`

Data cleaning with Pandas

Data cleaning with Pandas

Lightweight data cleaning: `Cleaner`

Encoding datetime features with Pandas

Encoding datetime features with Pandas

Encoding datetime features `skrub.DatetimeEncoder`

What periodic features look like

Encoding all the features: `TableVectorizer`

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline with `tabular_learner`

We now have a pipeline!

What if we had a better pipeline?

A realistic scenario

`skrub` DataOps

DataOps…

DataOps, Data Plans, `learner`s: oh my!

How do DataOps work, though?

Starting with the `DataOps`

Building a full data plan

Inspecting the data plan

Exporting the plan in a `learner`

Hyperparameter tuning in a Data Plan

Hyperparameter tuning in a Data Plan

Tuning in `scikit-learn` can be complex

Tuning with `DataOps` is simple!

Observe the impact of the hyperparameters

tl;dw

That’s it!

Getting involved

	steps	[('tablevectorizer', ...), ('simpleimputer', ...), ...]
	transform_input	None
	memory	None
	verbose	False

	cardinality_threshold	40
	low_cardinality	OneHotEncoder..._output=False)
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncod...ding='spline')
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	'spline'

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

Probabl Sprint - July 2025

Fun facts

Boost your productivity with skrub!

An example pipeline

Exploring the data

Exploring the data

Exploring the data with skrub

Data cleaning with Pandas

Data cleaning with Pandas

Lightweight data cleaning: Cleaner

Encoding datetime features with Pandas

Encoding datetime features with Pandas

Encoding datetime features skrub.DatetimeEncoder

What periodic features look like

Encoding all the features: TableVectorizer

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline with tabular_learner

We now have a pipeline!

What if we had a better pipeline?

A realistic scenario

skrub DataOps

DataOps…

DataOps, Data Plans, learners: oh my!

How do DataOps work, though?

Starting with the DataOps

Building a full data plan

Inspecting the data plan

Exporting the plan in a learner

Hyperparameter tuning in a Data Plan

Hyperparameter tuning in a Data Plan

Tuning in scikit-learn can be complex

Tuning with DataOps is simple!

Observe the impact of the hyperparameters

tl;dw

That’s it!

Getting involved

Boost your productivity with `skrub`!

Exploring the data with `skrub`

Lightweight data cleaning: `Cleaner`

Encoding datetime features `skrub.DatetimeEncoder`

Encoding all the features: `TableVectorizer`

Build a predictive pipeline with `tabular_learner`

`skrub` DataOps

DataOps, Data Plans, `learner`s: oh my!

Starting with the `DataOps`

Exporting the plan in a `learner`

Tuning in `scikit-learn` can be complex

Tuning with `DataOps` is simple!