Skrub

Less wrangling, more machine learning

Riccardo Cappuzzo, Jérôme Dockès, Vincent Maladiere

Inria, P16, Probabl.

Plan for the presentation

Introducing skrub
- Example use case
- Detailed explanation of the features
- Getting involved
Development roadmap
- Recipe
- Timeseries
- Discover
- Database interfaces

Before we start

If you have questions, raise your hand and someone will come to you
Write your suggestions on Slido
Open issues on Github (we can help with that)
All material is available at https://bit.ly/skrub-workshop

In the beginning…

Skrub stems from the development of dirty_cat, a package that provided support for handling dirty columns and perform fuzzy joins across tables.

It has since evolved into a package that provides:

interactive data exploration
automated pre-processing of pandas and polars dataframes
powerful feature engineering

Skrub’s vision

The goal of skrub is to facilitate building and deploying machine-learning models on pandas and polars dataframes (later, SQL databases…)

Skrub is high-level, with a philosophy and an API matching that of scikit-learn. It strives to bridge the worlds of databases and machine-learning, enabling imperfect assembly and representations of the data when it is noisy.

Let’s set the stage

An example use case

Gather some data
- Employee salaries, census, customer churn…
Explore the data
- Null values, dtypes, correlated features…
Pre-process the data
Build a scikit-learn estimator
???
Profit 📈

Exploring the data

import skrub
import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

employees.describe(include="all")

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
count	9211	9228	9228	9228	9228	9228	9228	9228.000000
unique	2	37	37	694	2	443	2264	NaN
top	M	POL	Department of Police	School Health Services	Fulltime-Regular	Bus Operator	12/12/2016	NaN
freq	5481	1844	1844	300	8394	638	87	NaN
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.597529
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9.327078
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1965.000000
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1998.000000
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.000000
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2012.000000
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2016.000000

Exploring the data… interactively!

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

Obtain high-level statistics about the data (number of uniques, missing values…)
Explore the distribution of values and find outliers
Discover highly correlated columns
Export and share the report as an html file

Try it out yourself!

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = make_pipeline(StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['year_first_hired']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'department',
                                                   'department_name',
                                                   'division',
                                                   'assignment_category',
                                                   'employee_position_title',
                                                   'date_first_hired'])])),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Enter: `tabular_learner`

import skrub
from sklearn.linear_model import Ridge
tl = skrub.tabular_learner(Ridge())

Enter: `tabular_learner`

Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])

A robust baseline: `tabular_learner`

Given a scikit-learn estimator, tabular_learner:

extracts numerical features
imputes missing values with SimpleImputer (optional)
scales the data with StandardScaler (optional)

You can also write “tabular_learner("regressor")”:

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=MinHashEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

Skrub’s main features

Unmasking the `tabular_learner`

Under the hood: `TableVectorizer`

Pre-process the data
Convert complex data types (datetimes, text) into numerical features

Under the hood: `TableVectorizer`

Pre-process the data

Ensure consistent column names
Detect missing values such as “N/A”
Drop empty columns
Check and convert dtypes to np.float32
Parse dates, ensuring consistent dtype and timezone
Identify which categorical features are low- and high-cardinality

Under the hood: `TableVectorizer`

Convert complex data types (datetimes, text) into numerical features

Encode dates with DateTimeEncoder
Encode low-cardinality features (<=30 cat.) with OneHotEncoder
Encode high-cardinality features (>30 cat.) with:
- GapEncoder: Relatively slow, easily interpretable, good quality embeddings. Target encoding and hashing.
- MinHashEncoder: Very fast, somewhat low quality embeddings. Hashing ngrams.
- TextEncoder: Very slow, relies on language models, best solution for text and when context is available.
- StringEncoder: Best trade-off between compute cost and embeddings quality. Tf-idf followed by SVD.

High-cardinality encoders are robust in presence of typos and dirty data.

Under the hood: `TableVectorizer`

vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(employees)
from pprint import pprint

pprint(vectorizer.column_to_kind_)

{'assignment_category': 'low_cardinality',
 'date_first_hired': 'datetime',
 'department': 'low_cardinality',
 'department_name': 'low_cardinality',
 'division': 'high_cardinality',
 'employee_position_title': 'high_cardinality',
 'gender': 'low_cardinality',
 'year_first_hired': 'numeric'}

pprint(vectorizer.all_processing_steps_["date_first_hired"])

[CleanNullStrings(),
 DropIfTooManyNulls(),
 ToDatetime(),
 DatetimeEncoder(),
 {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

Encoding datetime features

import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/skrub-data/datasets/master"
    "/data/bike-sharing-dataset.csv"
)
# Extract our input data (X) and the target column (y)
y = data["cnt"]
X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

X

	date	holiday	temp	hum	windspeed	weathersit
0	2011-01-01 00:00:00	0	0.24	0.81	0.0000	1
1	2011-01-01 01:00:00	0	0.22	0.80	0.0000	1
2	2011-01-01 02:00:00	0	0.22	0.80	0.0000	1
3	2011-01-01 03:00:00	0	0.24	0.75	0.0000	1
4	2011-01-01 04:00:00	0	0.24	0.75	0.0000	1
...	...	...	...	...	...	...
17374	2012-12-31 19:00:00	0	0.26	0.60	0.1642	2
17375	2012-12-31 20:00:00	0	0.26	0.60	0.1642	2
17376	2012-12-31 21:00:00	0	0.26	0.60	0.1642	1
17377	2012-12-31 22:00:00	0	0.26	0.56	0.1343	1
17378	2012-12-31 23:00:00	0	0.26	0.65	0.1343	1

17379 rows × 6 columns

Encoding datetime features

from pprint import pprint
from skrub import TableVectorizer, DatetimeEncoder

table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
pprint(table_vec_weekday.get_feature_names_out())

array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
       'windspeed', 'weathersit'], dtype='<U18')

Encoding datetime features

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor())

cross_val_score(
    pipeline_weekday, X, y, scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)

array([ -3694.45159469,  -3180.1148674 , -15183.44808403,  -4778.26021095,
        -5413.53488034])

Encoding datetime features

Encoding datetime features

Timeseries support in skrub is still in its early stages! Please stay tuned for new developments.

Augmenting features through joining

`AggJoiner`: automatically aggregate values

import skrub
import pandas as pd

df1 = pd.DataFrame({
    "UID": [28, 32, 28], 
    "Basket ID": [1100, 1300, 1400]})
df2 = pd.DataFrame({
    "Basket ID": [1100, 1100, 1100, 1300, 1400], 
    "Product ID": ["A521", "B695", "F221", "W214", "B695",], 
    "Price": [25, 30, 10, 320, 30]})

joiner = skrub.AggJoiner(df2, operations="sum", key="Basket ID", cols="Price")
joiner.fit_transform(df1)

	UID	Basket ID	Price_sum
0	28	1100	65
1	32	1300	320
2	28	1400	30

`InterpolationJoiner`: infer missing values

import pandas as pd

from skrub.datasets import fetch_flight_delays

dataset = fetch_flight_delays()
weather = dataset.weather
weather = weather.sample(100_000, random_state=0, ignore_index=True)
stations = dataset.stations
weather = stations.merge(weather, on="ID")[
    ["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY", "TMAX", "PRCP", "SNOW"]
]
weather["YEAR/MONTH/DAY"] = pd.to_datetime(weather["YEAR/MONTH/DAY"])

	LATITUDE	LONGITUDE	YEAR/MONTH/DAY	TMAX	PRCP	SNOW
0	25.333	55.517	2008-11-16	297.0	NaN	NaN
1	25.333	55.517	2008-04-12	333.0	NaN	NaN
2	25.255	55.364	2008-08-28	430.0	0.0	NaN
3	25.255	55.364	2008-02-17	264.0	0.0	NaN
4	25.255	55.364	2008-11-25	291.0	NaN	NaN

`InterpolationJoiner`

from skrub import InterpolationJoiner
joiner = InterpolationJoiner(
    aux_table,
    key=["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY"],
    suffix="_predicted",
).fit(main_table)
join = joiner.transform(main_table)
join.head()

	LATITUDE	LONGITUDE	YEAR/MONTH/DAY	TMAX	PRCP	SNOW	TMAX_predicted	PRCP_predicted	SNOW_predicted
0	25.333	55.517	2008-11-16	297.0	NaN	NaN	283.132081	17.235774	0.064046
1	25.333	55.517	2008-04-12	333.0	NaN	NaN	308.980154	18.474388	0.177245
2	25.255	55.364	2008-08-28	430.0	0.0	NaN	356.484409	61.132701	0.010053
3	25.255	55.364	2008-02-17	264.0	0.0	NaN	298.779061	42.264107	0.402316
4	25.255	55.364	2008-11-25	291.0	NaN	NaN	276.454237	12.241012	-0.007211

`InterpolationJoiner`

Augmenting features through joining

Joiner: Perform fuzzy-joining: join columns that contain similar-looking values.
AggJoiner Aggregate an auxiliary dataframe before joining it on a base dataframe, and create new features that aggregate (sum, mean, mode…) the values in the columns.
MultiAggJoiner extends AggJoiner to a multi-table scenario.
InterpolationJoiner Perform an equi-join and estimate what missing rows would contain if they existed in the table.

All Joiner objects are scikit-learn estimators, so they can be used in a Pipeline.

Additional goodies: deduplication

deduplicate misspelled categories

from skrub import deduplicate
pprint(duplicated)
deduplicate_correspondence = deduplicate(duplicated)
pprint(deduplicate_correspondence.to_dict())

['ulack', 'black', 'black', 'xudte', 'white', 'white']
{'black': 'black', 'ulack': 'black', 'white': 'white', 'xudte': 'white'}

Doc

Additional goodies: Wikipedia embeddings as features

KEN embeddings capture relational information about all entities in Wikipedia.

from skrub.datasets import fetch_ken_embeddings
embedding_games = fetch_ken_embeddings(
    search_types="game",
    exclude="companies|developer",
    embedding_table_id="games",
)

Doc

Getting involved

Install skrub

Base installation:

# with pip
pip install skrub -U
# with conda
conda install -c conda-forge skrub

For deep learning features such as TextEncoder:

# with pip
pip install skrub[transformers] -U
# with conda
conda install -c conda-forge skrub[transformers]

Documentation

Join the community

Contribute to skrub

Open an issue on GitHub
Check out the documentation on how to contribute
Join the breakout session this afternoon

Q & A time!

Please play around with the TableReport examples on the website!

Do you program directly? In which language?
Do you use specific platforms (e.g., Dataiku)?
Do you use mostly pandas, polars, or other libraries?
Where is your data stored?
What dev environment do you use? Notebooks, IDEs…
How do you organize your code? A complete package, a few scripts, …
Do you use circular encoding, or spline encoding for datetime features?

https://bit.ly/skrub-workshop

Skrub

Plan for the presentation

Before we start

In the beginning…

Skrub’s vision

Let’s set the stage

An example use case

Exploring the data

Exploring the data… interactively!

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Enter: tabular_learner

Enter: tabular_learner

A robust baseline: tabular_learner

Skrub’s main features

Unmasking the tabular_learner

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Encoding datetime features

Encoding datetime features

Encoding datetime features

Encoding datetime features

Encoding datetime features

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

AggJoiner: automatically aggregate values

AggJoiner: automatically aggregate values

InterpolationJoiner: infer missing values

InterpolationJoiner

InterpolationJoiner

Augmenting features through joining

Additional goodies: deduplication

Additional goodies: Wikipedia embeddings as features

Getting involved

Install skrub

Join the community

Contribute to skrub

Q & A time!

Enter: `tabular_learner`

Enter: `tabular_learner`

A robust baseline: `tabular_learner`

Unmasking the `tabular_learner`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

`AggJoiner`: automatically aggregate values

`AggJoiner`: automatically aggregate values

`InterpolationJoiner`: infer missing values

`InterpolationJoiner`

`InterpolationJoiner`