Let’s set the stage

An example use case

Gather some data
- Employee salaries, census, customer churn…
Explore the data
- Null values, dtypes, correlated features…
Pre-process the data
Build a scikit-learn estimator
???
Profit 📈

Exploring the data

import skrub
import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees, salaries = dataset.X, dataset.y

employees.describe(include="all")

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
count	9211	9228	9228	9228	9228	9228	9228	9228.000000
unique	2	37	37	694	2	443	2264	NaN
top	M	POL	Department of Police	School Health Services	Fulltime-Regular	Bus Operator	12/12/2016	NaN
freq	5481	1844	1844	300	8394	638	87	NaN
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.597529
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9.327078
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1965.000000
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1998.000000
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.000000
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2012.000000
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2016.000000

Exploring the data… interactively!

from skrub import TableReport
TableReport(employee_salaries)

Preview

Main features:

Obtain high-level statistics about the data (number of uniques, missing values…)
Explore the distribution of values and find outliers
Discover highly correlated columns
Export and share the report as an html file

More examples here

Build a predictive pipeline

from sklearn.linear_model import Ridge
model = Ridge()

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = make_pipeline(StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

model = make_pipeline(SimpleImputer(), StandardScaler(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

categorical_columns = selector(dtype_include=object)(employees)
numerical_columns = selector(dtype_exclude=object)(employees)

ct = make_column_transformer(
      (StandardScaler(),
       numerical_columns),
      (OneHotEncoder(handle_unknown="ignore"),
       categorical_columns))

model = make_pipeline(ct, SimpleImputer(), Ridge())

Build a predictive pipeline

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['year_first_hired']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['gender', 'department',
                                                   'department_name',
                                                   'division',
                                                   'assignment_category',
                                                   'employee_position_title',
                                                   'date_first_hired'])])),
                ('simpleimputer', SimpleImputer()), ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiNot fitted

Parameters

	steps	[('columntransformer', ...), ('simpleimputer', ...), ...]
	transform_input	None
	memory	None
	verbose	False

columntransformer: ColumnTransformer

?Documentation for columntransformer: ColumnTransformer

Parameters

	transformers	[('standardscaler', ...), ('onehotencoder', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

standardscaler

['year_first_hired']

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

onehotencoder

['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'date_first_hired']

OneHotEncoder

?Documentation for OneHotEncoder

Parameters

	categories	'auto'
	drop	None
	sparse_output	True
	dtype	<class 'numpy.float64'>
	handle_unknown	'ignore'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

SimpleImputer

?Documentation for SimpleImputer

Parameters

	missing_values	nan
	strategy	'mean'
	fill_value	None
	copy	True
	add_indicator	False
	keep_empty_features	False

Ridge

?Documentation for Ridge

Parameters

	alpha	1.0
	fit_intercept	True
	copy_X	True
	max_iter	None
	tol	0.0001
	solver	'auto'
	positive	False
	random_state	None

Enter: `tabular_learner`

import skrub
from sklearn.linear_model import Ridge
tl = skrub.tabular_learner(Ridge())

Enter: `tabular_learner`

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='spline'))),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])

SimpleImputer

?Documentation for SimpleImputer

Parameters

	missing_values	nan
	strategy	'mean'
	fill_value	None
	copy	True
	add_indicator	True
	keep_empty_features	False

StandardScaler

?Documentation for StandardScaler

Parameters

	copy	True
	with_mean	True
	with_std	True

Ridge

?Documentation for Ridge

Parameters

	alpha	1.0
	fit_intercept	True
	copy_X	True
	max_iter	None
	tol	0.0001
	solver	'auto'
	positive	False
	random_state	None

Skrub’s main features

Unmasking the `tabular_learner`

Under the hood: `TableVectorizer`

Pre-process the data
Convert complex data types (datetimes, text) into numerical features

Under the hood: `TableVectorizer`

Pre-process the data

Ensure consistent column names
Detect missing values such as “N/A”
Drop empty columns
Check and convert dtypes to np.float32
Parse dates, ensuring consistent dtype and timezone
Identify which categorical features are low- and high-cardinality

Under the hood: `TableVectorizer`

Convert complex data types (datetimes, text) into numerical features

Encode dates with DateTimeEncoder
Encode low-cardinality features (<=30 cat.) with OneHotEncoder
Encode high-cardinality features (>30 cat.) with:
- GapEncoder: Relatively slow, easily interpretable, good quality embeddings. Target encoding and hashing.
- MinHashEncoder: Very fast, somewhat low quality embeddings. Hashing ngrams.
- TextEncoder: Very slow, relies on language models, best solution for text and when context is available.
- StringEncoder: Best trade-off between compute cost and embeddings quality. Tf-idf followed by SVD.

High-cardinality encoders are robust in presence of typos and dirty data.

Under the hood: `TableVectorizer`

vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(employees)
from pprint import pprint

pprint(vectorizer.column_to_kind_)

{'assignment_category': 'low_cardinality',
 'date_first_hired': 'datetime',
 'department': 'low_cardinality',
 'department_name': 'low_cardinality',
 'division': 'high_cardinality',
 'employee_position_title': 'high_cardinality',
 'gender': 'low_cardinality',
 'year_first_hired': 'numeric'}

pprint(vectorizer.all_processing_steps_["date_first_hired"])

[CleanNullStrings(),
 DropUninformative(),
 ToDatetime(),
 DatetimeEncoder(),
 {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

Encoding datetime features

import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/skrub-data/datasets/master"
    "/data/bike-sharing-dataset.csv"
)
# Extract our input data (X) and the target column (y)
y = data["cnt"]
X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

X

	date	holiday	temp	hum	windspeed	weathersit
0	2011-01-01 00:00:00	0	0.24	0.81	0.0000	1
1	2011-01-01 01:00:00	0	0.22	0.80	0.0000	1
2	2011-01-01 02:00:00	0	0.22	0.80	0.0000	1
3	2011-01-01 03:00:00	0	0.24	0.75	0.0000	1
4	2011-01-01 04:00:00	0	0.24	0.75	0.0000	1
...	...	...	...	...	...	...
17374	2012-12-31 19:00:00	0	0.26	0.60	0.1642	2
17375	2012-12-31 20:00:00	0	0.26	0.60	0.1642	2
17376	2012-12-31 21:00:00	0	0.26	0.60	0.1642	1
17377	2012-12-31 22:00:00	0	0.26	0.56	0.1343	1
17378	2012-12-31 23:00:00	0	0.26	0.65	0.1343	1

17379 rows × 6 columns

Encoding datetime features

from pprint import pprint
from skrub import TableVectorizer, DatetimeEncoder

table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
pprint(table_vec_weekday.get_feature_names_out())

array(['date_year', 'date_month', 'date_day', 'date_hour',
       'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
       'windspeed', 'weathersit'], dtype='<U18')

Encoding datetime features

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor())

cross_val_score(
    pipeline_weekday, X, y, scoring="neg_mean_squared_error",
    cv=TimeSeriesSplit(n_splits=5),
)

array([ -3694.45159469,  -3180.1148674 , -15183.44808403,  -4824.29173547,
        -5391.06731737])

Encoding datetime features

Timeseries support in skrub is still in its early stages! Please stay tuned for new developments.

Augmenting features through joining

`AggJoiner`: automatically aggregate values

import skrub
import pandas as pd

df1 = pd.DataFrame({
    "UID": [28, 32, 28], 
    "Basket ID": [1100, 1300, 1400]})
df2 = pd.DataFrame({
    "Basket ID": [1100, 1100, 1100, 1300, 1400], 
    "Product ID": ["A521", "B695", "F221", "W214", "B695",], 
    "Price": [25, 30, 10, 320, 30]})

joiner = skrub.AggJoiner(df2, operations="sum", key="Basket ID", cols="Price")
joiner.fit_transform(df1)

	UID	Basket ID	Price_sum
0	28	1100	65
1	32	1300	320
2	28	1400	30

`InterpolationJoiner`: infer missing values

import pandas as pd

from skrub.datasets import fetch_flight_delays

dataset = fetch_flight_delays()
weather = dataset.weather
weather = weather.sample(100_000, random_state=0, ignore_index=True)
stations = dataset.stations
weather = stations.merge(weather, on="ID")[
    ["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY", "TMAX", "PRCP", "SNOW"]
]
weather["YEAR/MONTH/DAY"] = pd.to_datetime(weather["YEAR/MONTH/DAY"])

	LATITUDE	LONGITUDE	YEAR/MONTH/DAY	TMAX	PRCP	SNOW
0	25.333	55.517	2008-11-16	297.0	NaN	NaN
1	25.333	55.517	2008-04-12	333.0	NaN	NaN
2	25.255	55.364	2008-08-28	430.0	0.0	NaN
3	25.255	55.364	2008-02-17	264.0	0.0	NaN
4	25.255	55.364	2008-11-25	291.0	NaN	NaN

`InterpolationJoiner`

from skrub import InterpolationJoiner
joiner = InterpolationJoiner(
    aux_table,
    key=["LATITUDE", "LONGITUDE", "YEAR/MONTH/DAY"],
    suffix="_predicted",
).fit(main_table)
join = joiner.transform(main_table)
join.head()

	LATITUDE	LONGITUDE	YEAR/MONTH/DAY	TMAX	PRCP	SNOW	TMAX_predicted	PRCP_predicted	SNOW_predicted
0	25.333	55.517	2008-11-16	297.0	NaN	NaN	253.131219	10.099097	-0.001023
1	25.333	55.517	2008-04-12	333.0	NaN	NaN	292.267381	8.770696	0.142712
2	25.255	55.364	2008-08-28	430.0	0.0	NaN	330.768383	22.541307	-0.053227
3	25.255	55.364	2008-02-17	264.0	0.0	NaN	274.812990	12.980406	0.239012
4	25.255	55.364	2008-11-25	291.0	NaN	NaN	260.411121	6.621378	-0.014548

`InterpolationJoiner`

Augmenting features through joining

Joiner: Perform fuzzy-joining: join columns that contain similar-looking values.
AggJoiner Aggregate an auxiliary dataframe before joining it on a base dataframe, and create new features that aggregate (sum, mean, mode…) the values in the columns.
MultiAggJoiner extends AggJoiner to a multi-table scenario.
InterpolationJoiner Perform an equi-join and estimate what missing rows would contain if they existed in the table.

All Joiner objects are scikit-learn estimators, so they can be used in a Pipeline.

Additional goodies: deduplication

deduplicate misspelled categories

from skrub import deduplicate
pprint(duplicated)
deduplicate_correspondence = deduplicate(duplicated)
pprint(deduplicate_correspondence.to_dict())

['ulack', 'black', 'black', 'xudte', 'white', 'white']
{'black': 'black', 'ulack': 'black', 'white': 'white', 'xudte': 'white'}

Doc

Additional goodies: Wikipedia embeddings as features

KEN embeddings capture relational information about all entities in Wikipedia.

from skrub.datasets import fetch_ken_embeddings
embedding_games = fetch_ken_embeddings(
    search_types="game",
    exclude="companies|developer",
    embedding_table_id="games",
)

Doc

Getting involved

Install skrub

Base installation:

# with pip
pip install skrub -U
# with conda
conda install -c conda-forge skrub

For deep learning features such as TextEncoder:

# with pip
pip install skrub[transformers] -U
# with conda
conda install -c conda-forge skrub[transformers]

Documentation

Join the community

Contribute to skrub

Open an issue on GitHub
Check out the documentation on how to contribute

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

	steps	[('tablevectorizer', ...), ('simpleimputer', ...), ...]
	transform_input	None
	memory	None
	verbose	False

	cardinality_threshold	40
	low_cardinality	OneHotEncoder..._output=False)
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncod...ding='spline')
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	'spline'

	steps	[('tablevectorizer', ...), ('histgradientboostingregressor', ...)]
	transform_input	None
	memory	None
	verbose	False

	cardinality_threshold	40
	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

	loss	'squared_error'
	quantile	None
	learning_rate	0.1
	max_iter	100
	max_leaf_nodes	31
	max_depth	None
	min_samples_leaf	20
	l2_regularization	0.0
	max_features	1.0
	max_bins	255
	categorical_features	'from_dtype'
	monotonic_cst	None
	interaction_cst	None
	warm_start	False
	early_stopping	'auto'
	scoring	'loss'
	validation_fraction	0.1
	n_iter_no_change	10
	tol	1e-07
	verbose	0
	random_state	None

Skrub workshop - Introducing Skrub

Plan for the presentation

In the beginning…

Skrub’s vision

Let’s set the stage

An example use case

Exploring the data

Exploring the data… interactively!

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Build a predictive pipeline

Enter: tabular_learner

Enter: tabular_learner

A robust baseline: tabular_learner

Skrub’s main features

Unmasking the tabular_learner

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Under the hood: TableVectorizer

Encoding datetime features

Encoding datetime features

Encoding datetime features

Encoding datetime features

Encoding datetime features

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

Augmenting features through joining

AggJoiner: automatically aggregate values

AggJoiner: automatically aggregate values

InterpolationJoiner: infer missing values

InterpolationJoiner

InterpolationJoiner

Augmenting features through joining

Additional goodies: deduplication

Additional goodies: Wikipedia embeddings as features

Getting involved

Install skrub

Join the community

Contribute to skrub

Q & A time!

Enter: `tabular_learner`

Enter: `tabular_learner`

A robust baseline: `tabular_learner`

Unmasking the `tabular_learner`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

Under the hood: `TableVectorizer`

`AggJoiner`: automatically aggregate values

`AggJoiner`: automatically aggregate values

`InterpolationJoiner`: infer missing values

`InterpolationJoiner`

`InterpolationJoiner`