Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Getting Started#

This guide showcases the features of skrub, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models.

Much of skrub revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand.

Downloading example datasets#

The datasets module allows us to download tabular datasets and demonstrate skrub’s features.

Note

You can control the directory where the datasets are stored by:

setting in your environment the SKRUB_DATA_DIRECTORY variable to an absolute directory path,
using the parameter data_directory in fetch functions, which takes precedence over the envar.

By default, the datasets are stored in a folder named “skrub_data” in the user home folder.

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y

Explore all the available datasets in Datasets.

Generating an interactive report for a dataframe#

The Cleaner allows to clean the dataframe, parsing nulls, dates, and dropping columns with too many nulls. To quickly get an overview of a dataframe’s contents, use the TableReport.

from skrub import Cleaner, TableReport

employees_df = Cleaner().fit_transform(employees_df)
TableReport(employees_df)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	False	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.944
division	assignment_category	0.594
assignment_category	employee_position_title	0.480
department_name	employee_position_title	0.420
department	employee_position_title	0.420
division	employee_position_title	0.414
department	assignment_category	0.411
department_name	assignment_category	0.411
gender	department	0.373
gender	department_name	0.373
department	division	0.363
department_name	division	0.363
gender	employee_position_title	0.277
gender	division	0.259
gender	assignment_category	0.240
employee_position_title	year_first_hired	0.125
employee_position_title	date_first_hired	0.124
department_name	date_first_hired	0.0817
department	date_first_hired	0.0817

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

You can use the interactive display above to explore the dataset visually.

Note

You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.

It is also possible to tell skrub to replace the default pandas & polars displays with TableReport.

from skrub import set_config

set_config(use_table_report=True)

employees_df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	1986-09-22 00:00:00	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12 00:00:00	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19 00:00:00	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05 00:00:00	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05 00:00:00	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03 00:00:00	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28 00:00:00	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30 00:00:00	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05 00:00:00	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30 00:00:00	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	DateTime64DType	False	0 (0.0%)	2264 (24.5%)			1965-09-30T00:00:00		2016-12-27T00:00:00
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
date_first_hired	year_first_hired	0.948
division	assignment_category	0.626
assignment_category	employee_position_title	0.533
division	employee_position_title	0.434
department_name	employee_position_title	0.418
department	employee_position_title	0.418
department	assignment_category	0.379
department_name	assignment_category	0.379
gender	department	0.375
gender	department_name	0.375
department	division	0.372
department_name	division	0.372
gender	employee_position_title	0.271
gender	assignment_category	0.251
gender	division	0.247
employee_position_title	date_first_hired	0.145
employee_position_title	year_first_hired	0.145
gender	date_first_hired	0.0823
gender	year_first_hired	0.0804

Please enable javascript

This setting can easily be reverted:

set_config(use_table_report=False)

employees_df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	1986-09-22	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	1989-11-19	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	2014-05-05	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	2007-03-05	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	2015-11-03	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	1988-11-28	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	2001-04-30	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	2006-09-05	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	2012-01-30	2012

9228 rows × 8 columns

Easily building a strong baseline for tabular machine learning#

The goal of skrub is to ease tabular data preparation for machine learning. The tabular_pipeline() function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_pipeline

model = tabular_pipeline("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]

array([0.90530035, 0.87919783, 0.9167169 , 0.91965008, 0.92252475])

To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by tabular_pipeline() preprocesses and encodes strings, categories and dates using the TableVectorizer. See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for more details. An overview of the chosen defaults is available in Strong baseline pipelines.

Assembling data#

Skrub allows imperfect assembly of data, such as joining dataframes on columns that contain typos. Skrub’s joiners have fit and transform methods, storing information about the data across calls.

The Joiner allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the max_dist parameter.

In the following, we add information about countries to a table containing airports and the cities they are in:

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
# notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)

	airport_id	airport_name	city	capital	country
0	1	Charles de Gaulle	Paris	Paris	France
1	2	Aeroporto Leonardo da Vinci	Roma	Rome	Italy

Information about countries have been added, even if the rows aren’t exactly matching.

It’s also possible to augment data by joining and aggregating multiple dataframes with the AggJoiner. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports:

from skrub import AggJoiner

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],  # the cols to perform aggregation on
    operations=["mean", "std"],  # the operations to compute
)
agg_joiner.fit_transform(airports)

	airport_id	airport_name	city	total_passengers_mean	total_passengers_std
0	1	Charles de Gaulle	Paris	103.333333	15.275252
1	2	Aeroporto Leonardo da Vinci	Roma	80.000000	10.000000

For joining multiple auxiliary tables on a main table at once, use the MultiAggJoiner.

See other ways to join multiple tables in Assembling: joining multiple tables.

Encoding data#

When a column contains categories with variations and typos, it can be encoded using one of skrub’s encoders, such as the GapEncoder.

The GapEncoder creates a continuous encoding, based on the activation of latent categories. It will create the encoding based on combinations of substrings which frequently co-occur.

For instance, we might want to encode a column X that contains information about cities, being either Madrid or Rome :

from skrub import GapEncoder

X = pd.Series(
    [
        "Rome, Italy",
        "Rome",
        "Roma, Italia",
        "Madrid, SP",
        "Madrid, spain",
        "Madrid",
        "Romq",
        "Rome, It",
    ],
    name="city",
)
enc = GapEncoder(n_components=2, random_state=0)  # 2 topics in the data
enc.fit(X)

GapEncoder(n_components=2, random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The GapEncoder has found the following two topics:

enc.get_feature_names_out()

['city: madrid, spain, sp', 'city: italia, italy, romq']

Which correspond to the two cities.

Let’s see the activation of each topic depending on the rows of X:

encoded = enc.fit_transform(X).assign(original=X)
encoded

	city: madrid, spain, sp	city: italia, italy, romq	original
0	0.005360	1.389512	Rome, Italy
1	0.005149	0.312800	Rome
2	0.006490	1.542228	Roma, Italia
3	1.235593	0.005433	Madrid, SP
4	1.697212	0.005352	Madrid, spain
5	0.620396	0.005245	Madrid
6	0.005130	0.312819	Romq
7	0.005456	0.927878	Rome, It

The higher the activation, the closer the row to the latent topic. These columns can now be understood by a machine-learning model.

The other encoders are presented in Feature engineering for categorical data.

Next steps#

We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to it !

Please refer to our User Guide for a more in-depth presentation of skrub’s concepts, or visit our examples for more illustrations of the tools that we provide !

Total running time of the script: (0 minutes 8.839 seconds)

Gallery generated by Sphinx-Gallery

	n_components	2
	batch_size	1024
	gamma_shape_prior	1.1
	gamma_scale_prior	1.0
	rho	0.95
	rescale_rho	False
	hashing	False
	hashing_n_features	4096
	init	'k-means++'
	max_iter	5
	ngram_range	(2, ...)
	analyzer	'char'
	add_words	False
	random_state	0
	rescale_W	True
	max_iter_e_step	1
	max_no_improvement	5
	verbose	0

Getting Started#

Downloading example datasets#

Generating an interactive report for a dataframe#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

Easily building a strong baseline for tabular machine learning#

Assembling data#

Encoding data#

Next steps#

This Page