Skrub

Less wrangling, more machine learning

Riccardo Cappuzzo, Jérôme Dockès, Vincent Maladiere

Inria, P16, Probabl.

Plan for the presentation

  • Introducing skrub
    • Example use case
    • Detailed explanation of the features
    • Getting involved
  • Development roadmap
    • Recipe
    • Timeseries
    • Discover
    • Database interfaces

Before we start

  • If you have questions, raise your hand and someone will come to you
  • Write your suggestions on Slido
  • Open issues on Github (we can help with that)
  • All material is available at https://bit.ly/skrub-workshop

In the beginning…

Skrub stems from the development of dirty_cat, a package that provided support for handling dirty columns and perform fuzzy joins across tables.

It has since evolved into a package that provides:

  • interactive data exploration
  • automated pre-processing of pandas and polars dataframes
  • powerful feature engineering

Skrub’s vision

The goal of skrub is to facilitate building and deploying machine-learning models on pandas and polars dataframes (later, SQL databases…)

Skrub is high-level, with a philosophy and an API matching that of scikit-learn. It strives to bridge the worlds of databases and machine-learning, enabling imperfect assembly and representations of the data when it is noisy.

Let’s set the stage

An example use case

  1. Gather some data
    • Employee salaries, census, customer churn…
  2. Explore the data
    • Null values, dtypes, correlated features…
  3. Pre-process the data
  4. Build a scikit-learn estimator
  5. ???
  6. Profit 📈

Exploring the data

import skrub
import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

employees.describe(include="all")
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
count 9211 9228 9228 9228 9228 9228 9228 9228.000000
unique 2 37 37 694 2 443 2264 NaN
top M POL Department of Police School Health Services Fulltime-Regular Bus Operator 12/12/2016 NaN
freq 5481 1844 1844 300 8394 638 87 NaN
mean NaN NaN NaN NaN NaN NaN NaN 2003.597529
std NaN NaN NaN NaN NaN NaN NaN 9.327078
min NaN NaN NaN NaN NaN NaN NaN 1965.000000
25% NaN NaN NaN NaN NaN NaN NaN 1998.000000
50% NaN NaN NaN NaN NaN NaN NaN 2005.000000
75% NaN NaN NaN NaN NaN NaN NaN 2012.000000
max NaN NaN NaN NaN NaN NaN NaN 2016.000000

Exploring the data… interactively!

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

  • Obtain high-level statistics about the data (number of uniques, missing values…)
  • Explore the distribution of values and find outliers
  • Discover highly correlated columns
  • Export and share the report as an html file

Build a predictive pipeline

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = make_pipeline(StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['year_first_hired']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'department',
                                                   'department_name',
                                                   'division',
                                                   'assignment_category',
                                                   'employee_position_title',
                                                   'date_first_hired'])])),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Enter: tabular_learner

import skrub
from sklearn.linear_model import Ridge
tl = skrub.tabular_learner(Ridge())

Enter: tabular_learner

Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

A robust baseline: tabular_learner

Given a scikit-learn estimator, tabular_learner:

  • extracts numerical features
  • imputes missing values with SimpleImputer (optional)
  • scales the data with StandardScaler (optional)

You can also write “tabular_learner("regressor")”:

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=MinHashEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Skrub’s main features

Unmasking the tabular_learner

Under the hood: TableVectorizer

  1. Pre-process the data
  2. Convert complex data types (datetimes, text) into numerical features

Under the hood: TableVectorizer

Pre-process the data

  • Ensure consistent column names
  • Detect missing values such as “N/A”
  • Drop empty columns
  • Check and convert dtypes to np.float32
  • Parse dates, ensuring consistent dtype and timezone
  • Identify which categorical features are low- and high-cardinality

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Convert complex data types (datetimes, text) into numerical features

  • Encode dates with DateTimeEncoder
  • Encode low-cardinality features (<=30 cat.) with OneHotEncoder
  • Encode high-cardinality features (>30 cat.) with:
    • GapEncoder: Relatively slow, easily interpretable, good quality embeddings. Target encoding and hashing.
    • MinHashEncoder: Very fast, somewhat low quality embeddings. Hashing ngrams.
    • TextEncoder: Very slow, relies on language models, best solution for text and when context is available.
    • StringEncoder: Best trade-off between compute cost and embeddings quality. Tf-idf followed by SVD.

High-cardinality encoders are robust in presence of typos and dirty data.

Under the hood: TableVectorizer

vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(employees)
from pprint import pprint

pprint(vectorizer.column_to_kind_)
{'assignment_category': 'low_cardinality',
 'date_first_hired': 'datetime',
 'department': 'low_cardinality',
 'department_name': 'low_cardinality',
 'division': 'high_cardinality',
 'employee_position_title': 'high_cardinality',
 'gender': 'low_cardinality',
 'year_first_hired': 'numeric'}
pprint(vectorizer.all_processing_steps_["date_first_hired"])
[CleanNullStrings(),
 DropIfTooManyNulls(),
 ToDatetime(),
 DatetimeEncoder(),
 {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

Encoding datetime features

import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/skrub-data/datasets/master"
    "/data/bike-sharing-dataset.csv"
)
# Extract our input data (X) and the target column (y)
y = data["cnt"]
X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

X
date holiday temp hum windspeed weathersit
0 2011-01-01 00:00:00 0 0.24 0.81 0.0000 1
1 2011-01-01 01:00:00 0 0.22 0.80 0.0000 1
2 2011-01-01 02:00:00 0 0.22 0.80 0.0000 1
3 2011-01-01 03:00:00 0 0.24 0.75 0.0000 1
4 2011-01-01 04:00:00 0 0.24 0.75 0.0000 1
... ... ... ... ... ... ...
17374 2012-12-31 19:00:00 0 0.26 0.60 0.1642 2
17375 2012-12-31 20:00:00 0 0.26 0.60 0.1642 2
17376 2012-12-31 21:00:00 0 0.26 0.60 0.1642 1
17377 2012-12-31 22:00:00 0 0.26 0.56 0.1343 1
17378 2012-12-31 23:00:00 0 0.26 0.65 0.1343 1

17379 rows × 6 columns

Encoding datetime features

from pprint import pprint
from skrub import TableVectorizer, DatetimeEncoder

table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
pprint(table_vec_weekday.get_feature_names_out())
array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
       'windspeed', 'weathersit'], dtype='<U18')

Encoding datetime features

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor())

cross_val_score(
    pipeline_weekday, X, y, scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)
array([ -3694.45159469,  -3180.1148674 , -15183.44808403,  -4778.26021095,
        -5413.53488034])

Encoding datetime features

Encoding datetime features

Timeseries support in skrub is still in its early stages! Please stay tuned for new developments.

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

AggJoiner: automatically aggregate values

AggJoiner: automatically aggregate values

import skrub
import pandas as pd

df1 = pd.DataFrame({
    "UID": [28, 32, 28], 
    "Basket ID": [1100, 1300, 1400]})
df2 = pd.DataFrame({
    "Basket ID": [1100, 1100, 1100, 1300, 1400], 
    "Product ID": ["A521", "B695", "F221", "W214", "B695",], 
    "Price": [25, 30, 10, 320, 30]})

joiner = skrub.AggJoiner(df2, operations="sum", key="Basket ID", cols="Price")
joiner.fit_transform(df1)
UID Basket ID Price_sum
0 28 1100 65
1 32 1300 320
2 28 1400 30

InterpolationJoiner: infer missing values

import pandas as pd

from skrub.datasets import fetch_flight_delays

dataset = fetch_flight_delays()
weather = dataset.weather
weather = weather.sample(100_000, random_state=0, ignore_index=True)
stations = dataset.stations
weather = stations.merge(weather, on="ID")[
    ["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY", "TMAX", "PRCP", "SNOW"]
]
weather["YEAR/MONTH/DAY"] = pd.to_datetime(weather["YEAR/MONTH/DAY"])
LATITUDE LONGITUDE YEAR/MONTH/DAY TMAX PRCP SNOW
0 25.333 55.517 2008-11-16 297.0 NaN NaN
1 25.333 55.517 2008-04-12 333.0 NaN NaN
2 25.255 55.364 2008-08-28 430.0 0.0 NaN
3 25.255 55.364 2008-02-17 264.0 0.0 NaN
4 25.255 55.364 2008-11-25 291.0 NaN NaN

InterpolationJoiner

from skrub import InterpolationJoiner
joiner = InterpolationJoiner(
    aux_table,
    key=["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY"],
    suffix="_predicted",
).fit(main_table)
join = joiner.transform(main_table)
join.head()
LATITUDE LONGITUDE YEAR/MONTH/DAY TMAX PRCP SNOW TMAX_predicted PRCP_predicted SNOW_predicted
0 25.333 55.517 2008-11-16 297.0 NaN NaN 283.132081 17.235774 0.064046
1 25.333 55.517 2008-04-12 333.0 NaN NaN 308.980154 18.474388 0.177245
2 25.255 55.364 2008-08-28 430.0 0.0 NaN 356.484409 61.132701 0.010053
3 25.255 55.364 2008-02-17 264.0 0.0 NaN 298.779061 42.264107 0.402316
4 25.255 55.364 2008-11-25 291.0 NaN NaN 276.454237 12.241012 -0.007211

InterpolationJoiner

Augmenting features through joining

  • Joiner: Perform fuzzy-joining: join columns that contain similar-looking values.
  • AggJoiner Aggregate an auxiliary dataframe before joining it on a base dataframe, and create new features that aggregate (sum, mean, mode…) the values in the columns.
  • MultiAggJoiner extends AggJoiner to a multi-table scenario.
  • InterpolationJoiner Perform an equi-join and estimate what missing rows would contain if they existed in the table.

All Joiner objects are scikit-learn estimators, so they can be used in a Pipeline.

Additional goodies: deduplication

deduplicate misspelled categories

from skrub import deduplicate
pprint(duplicated)
deduplicate_correspondence = deduplicate(duplicated)
pprint(deduplicate_correspondence.to_dict())
['ulack', 'black', 'black', 'xudte', 'white', 'white']
{'black': 'black', 'ulack': 'black', 'white': 'white', 'xudte': 'white'}

Doc

Additional goodies: Wikipedia embeddings as features

KEN embeddings capture relational information about all entities in Wikipedia.

from skrub.datasets import fetch_ken_embeddings
embedding_games = fetch_ken_embeddings(
    search_types="game",
    exclude="companies|developer",
    embedding_table_id="games",
)

Doc

Getting involved

Install skrub

Base installation:

# with pip
pip install skrub -U
# with conda
conda install -c conda-forge skrub

For deep learning features such as TextEncoder:

# with pip
pip install skrub[transformers] -U
# with conda
conda install -c conda-forge skrub[transformers]

Join the community

Contribute to skrub

  • Open an issue on GitHub
  • Check out the documentation on how to contribute
  • Join the breakout session this afternoon

Q & A time!

Please play around with the TableReport examples on the website!

  • Do you program directly? In which language?
  • Do you use specific platforms (e.g., Dataiku)?
  • Do you use mostly pandas, polars, or other libraries?
  • Where is your data stored?
  • What dev environment do you use? Notebooks, IDEs…
  • How do you organize your code? A complete package, a few scripts, …
  • Do you use circular encoding, or spline encoding for datetime features?

https://bit.ly/skrub-workshop