Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Getting Started#

This guide showcases some of the features of skrub, an open-source package that aims at bridging the gap between tabular data stored in Pandas or Polars dataframes, and machine-learning models.

Much of skrub revolves around simplifying many of the tasks that are involved in pre-processing raw data into a format that shallow or classic machine-learning models can understand, that is, numerical data.

skrub does this by vectorizing, assembling, and encoding tabular data through a number of features that we present in this example and the following.

Downloading example datasets#

The datasets module allows us to download tabular datasets and demonstrate skrub’s features.

Note

You can control the directory where the datasets are stored by:

setting in your environment the SKRUB_DATA_DIRECTORY variable to an absolute directory path,
using the parameter data_directory in fetch functions, which takes precedence over the envar.

By default, the datasets are stored in a folder named “skrub_data” in the user home folder.

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y

Explore all the available datasets in Datasets.

Preliminary exploration and parsing of data#

Typically, the first operations that are done on new data involve data exploration and parsing. To quickly get an overview of a dataframe’s contents, use the TableReport. Here, we also use the Cleaner, a transformer that cleans the dataframe by parsing nulls and dates, and by dropping “uninformative” columns (e.g., that contain too many nulls, or that are constant).

from skrub import Cleaner, TableReport

TableReport(employees_df)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
division	assignment_category	0.567
division	employee_position_title	0.463
assignment_category	employee_position_title	0.459
department_name	employee_position_title	0.417
department	employee_position_title	0.417
department	assignment_category	0.399
department_name	assignment_category	0.399
department	division	0.368
department_name	division	0.368
gender	department	0.362
gender	department_name	0.362
employee_position_title	date_first_hired	0.276
gender	employee_position_title	0.249
gender	assignment_category	0.248
gender	division	0.239
department	date_first_hired	0.152
department_name	date_first_hired	0.152
date_first_hired	year_first_hired	0.148
employee_position_title	year_first_hired	0.139

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

From the Report above, we can see that there are datetime columns, so we use the Cleaner to parse them.

employees_df = Cleaner().fit_transform(employees_df)
TableReport(employees_df)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	False	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.916
division	assignment_category	0.619
assignment_category	employee_position_title	0.530
department_name	employee_position_title	0.455
department	employee_position_title	0.455
division	employee_position_title	0.427
department	assignment_category	0.400
department_name	assignment_category	0.400
department	division	0.372
department_name	division	0.372
gender	department	0.362
gender	department_name	0.362
gender	employee_position_title	0.276
gender	division	0.257
gender	assignment_category	0.241
employee_position_title	year_first_hired	0.137
employee_position_title	date_first_hired	0.135
department	date_first_hired	0.0801
department_name	date_first_hired	0.0801

Please enable javascript

You can use the interactive display above to explore the dataset visually.

Note

You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.

It is also possible to tell skrub to replace the default pandas & polars displays with TableReport by modifying the global config with set_config().

from skrub import set_config

set_config(use_table_report=True)

employees_df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	False	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.923
division	assignment_category	0.591
assignment_category	employee_position_title	0.501
department_name	assignment_category	0.431
department	assignment_category	0.431
division	employee_position_title	0.424
department	employee_position_title	0.422
department_name	employee_position_title	0.422
gender	department_name	0.377
gender	department	0.377
department_name	division	0.368
department	division	0.368
gender	employee_position_title	0.285
gender	assignment_category	0.272
gender	division	0.272
employee_position_title	year_first_hired	0.140
employee_position_title	date_first_hired	0.139
division	year_first_hired	0.0863
division	date_first_hired	0.0847

Please enable javascript

This setting can easily be reverted:

set_config(use_table_report=False)

employees_df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	1986-09-22	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30	2012

9228 rows × 8 columns

Easily building a strong baseline for tabular machine learning#

The goal of skrub is to ease tabular data preparation for machine learning. The tabular_pipeline() function provides an easy way to build a simple but reliable machine learning model that works well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_pipeline

model = tabular_pipeline("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]

array([0.90515874, 0.88150207, 0.91658913, 0.9211787 , 0.92464814])

To handle rich tabular data and feed it to a machine learning model, the pipeline returned by tabular_pipeline() preprocesses and encodes strings, categories and dates using the TableVectorizer. See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for more details. An overview of the chosen defaults is available in Strong baseline pipelines.

Assembling data#

skrub allows imperfect assembly of data, such as joining dataframes on columns that contain typos. skrub’s joiners have fit and transform methods, storing information about the data across calls.

The Joiner allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the max_dist parameter.

In the following, we add information about countries to a table containing airports and the cities they are in:

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
# Notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)

	airport_id	airport_name	city	capital	country
0	1	Charles de Gaulle	Paris	Paris	France
1	2	Aeroporto Leonardo da Vinci	Roma	Rome	Italy

Information about countries have been added, even if the rows aren’t exactly matching.

skrub allows to aggregate multiple tables according to various strategies: you can see other ways to join multiple tables in Assembling: joining multiple tables.

Encoding any data as numerical features#

Tabular data can contain a variety of datatypes, ranging from numerical, to datetimes, to categories, strings, and text. Encoding features in a meaningful way requires a lot of effort and is a major part of the feature engineering process that is required to properly train machine learning models.

skrub helps with this by providing various transformers that automatically encode different datatypes into float32 features.

For numerical features, the SquashingScaler applies a robust scaling technique that is less sensitive to outliers. Check the relative example for more information on the feature.

For datetime columns, skrub provides the DatetimeEncoder which can extract useful features such as year, month, day, as well as additional features such as weekday or day of year. Periodic encoding with trigonometric or spline features is also available. Refer to the DatetimeEncoder documentation for more detail.

import pandas as pd

data = pd.DataFrame(
    {
        "event": ["A", "B", "C"],
        "date_1": ["2020-01-01", "2020-06-15", "2021-03-22"],
        "date_2": ["2020-01-15", "2020-07-01", "2021-04-05"],
    }
)
data = Cleaner().fit_transform(data)
TableReport(data)

	event	date_1	date_2
0	A	2020-01-01 00:00:00	2020-01-15 00:00:00
1	B	2020-06-15 00:00:00	2020-07-01 00:00:00
2	C	2021-03-22 00:00:00	2021-04-05 00:00:00

Column	Column name	dtype	Is sorted	Unique values	Min	Max
0	event	ObjectDType	True	3 (100.0%)
1	date_1	DateTime64DType	True	3 (100.0%)	2020-01-01T00:00:00	2021-03-22T00:00:00
2	date_2	DateTime64DType	True	3 (100.0%)	2020-01-15T00:00:00	2021-04-05T00:00:00

Column 1	Column 2	Cramér's V
date_1	date_2	1.00
event	date_2	1.00
event	date_1	1.00

Please enable javascript

skrub transformers are applied column-by-column, but it is possible to use the ApplyToCols meta-transformer to apply a transformer to multiple columns at once. Complex column selection is possible using skrub’s column selectors.

from skrub import ApplyToCols, DatetimeEncoder

ApplyToCols(
    DatetimeEncoder(add_total_seconds=False), cols=["date_1", "date_2"]
).fit_transform(data)

	event	date_1_year	date_1_month	date_1_day	date_2_year	date_2_month	date_2_day
0	A	2020.0	1.0	1.0	2020.0	1.0	15.0
1	B	2020.0	6.0	15.0	2020.0	7.0	1.0
2	C	2021.0	3.0	22.0	2021.0	4.0	5.0

Finally, when a column contains categorical or string data, it can be encoded using various encoders provided by skrub. The default encoder is the StringEncoder, which encodes categories using Latent Semantic Analysis (LSA). It is a simple and efficient way to encode categories, and works well in practice.

data = pd.DataFrame(
    {
        "city": ["Paris", "London", "Berlin", "Madrid", "Rome"],
        "country": ["France", "UK", "Germany", "Spain", "Italy"],
    }
)
TableReport(data)
from skrub import StringEncoder

StringEncoder(n_components=3).fit_transform(data["city"])

	city_0	city_1	city_2
0	9.549681e-08	-2.087682e-08	1.525498e+00
1	1.014391e+00	-1.847013e-01	-8.526597e-08
2	8.380367e-01	9.361470e-01	1.514242e-07
3	2.806120e-01	5.614111e-01	-4.905649e-08
4	-7.191011e-01	1.049512e+00	3.392050e-08

If your data includes a lot of text, you may want to use the TextEncoder, which uses pre-trained language models retrieved from the HuggingFace hub to create meaningful text embeddings. See Feature engineering for categorical data for more details on all the categorical encoders provided by skrub, and Encoding: from a dataframe to a numerical matrix for machine learning for a comparison between the different methods.

Advanced use cases#

If your use case involves more complex data preparation, hyperparameter tuning, or model selection, if you want to build a multi-table pipeline that requires assembling and preparing multiple tables, or if you want to make sure that the data preparation can be reproduced exactly, you can use the skrub Data Ops, a powerful framework which provides tools to build complex data processing pipelines. See the relative user guide and the Skrub DataOps examples for more details.

Next steps#

We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to it!

Please refer to our User Guide for a more in-depth presentation of skrub’s concepts, or visit our examples for more illustrations of the tools that we provide!

Total running time of the script: (0 minutes 9.884 seconds)

Gallery generated by Sphinx-Gallery

Getting Started#

Downloading example datasets#

Preliminary exploration and parsing of data#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

Easily building a strong baseline for tabular machine learning#

Assembling data#

Encoding any data as numerical features#

event

date_1

date_2

event

date_1

date_2

Please enable javascript

Advanced use cases#

Next steps#

This Page