Getting Started#

This guide showcases some of the features of skrub, an open-source package that aims at bridging the gap between tabular data stored in Pandas or Polars dataframes, and machine-learning models.

Much of skrub revolves around simplifying many of the tasks that are involved in pre-processing raw data into a format that shallow or classic machine-learning models can understand, that is, numerical data.

skrub does this by vectorizing, assembling, and encoding tabular data through a number of features that we present in this example and the following.

Downloading example datasets#

The datasets module allows us to download tabular datasets and demonstrate skrub’s features.

Note

You can control the directory where the datasets are stored by:

  • setting in your environment the SKRUB_DATA_DIRECTORY variable to an absolute directory path,

  • using the parameter data_directory in fetch functions, which takes precedence over the envar.

By default, the datasets are stored in a folder named “skrub_data” in the user home folder.

Explore all the available datasets in Datasets.

Preliminary exploration and parsing of data#

Typically, the first operations that are done on new data involve data exploration and parsing. To quickly get an overview of a dataframe’s contents, use the TableReport. Here, we also use the Cleaner, a transformer that cleans the dataframe by parsing nulls and dates, and by dropping “uninformative” columns (e.g., that contain too many nulls, or that are constant).

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



From the Report above, we can see that there are datetime columns, so we use the Cleaner to parse them.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



You can use the interactive display above to explore the dataset visually.

Note

You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.

It is also possible to tell skrub to replace the default pandas & polars displays with TableReport by modifying the global config with set_config().

from skrub import set_config

set_config(use_table_report=True)

employees_df

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



This setting can easily be reverted:

set_config(use_table_report=False)

employees_df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 1986-09-22 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 1988-09-12 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 1989-11-19 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 2014-05-05 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 2007-03-05 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II 2015-11-03 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief 1988-11-28 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist 2001-04-30 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 2006-09-05 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 2012-01-30 2012

9228 rows × 8 columns



Easily building a strong baseline for tabular machine learning#

The goal of skrub is to ease tabular data preparation for machine learning. The tabular_pipeline() function provides an easy way to build a simple but reliable machine learning model that works well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_pipeline

model = tabular_pipeline("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.90515874, 0.88150207, 0.91658913, 0.9211787 , 0.92464814])

To handle rich tabular data and feed it to a machine learning model, the pipeline returned by tabular_pipeline() preprocesses and encodes strings, categories and dates using the TableVectorizer. See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for more details. An overview of the chosen defaults is available in Strong baseline pipelines.

Assembling data#

skrub allows imperfect assembly of data, such as joining dataframes on columns that contain typos. skrub’s joiners have fit and transform methods, storing information about the data across calls.

The Joiner allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the max_dist parameter.

In the following, we add information about countries to a table containing airports and the cities they are in:

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
# Notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy


Information about countries have been added, even if the rows aren’t exactly matching.

skrub allows to aggregate multiple tables according to various strategies: you can see other ways to join multiple tables in Assembling: joining multiple tables.

Encoding any data as numerical features#

Tabular data can contain a variety of datatypes, ranging from numerical, to datetimes, to categories, strings, and text. Encoding features in a meaningful way requires a lot of effort and is a major part of the feature engineering process that is required to properly train machine learning models.

skrub helps with this by providing various transformers that automatically encode different datatypes into float32 features.

For numerical features, the SquashingScaler applies a robust scaling technique that is less sensitive to outliers. Check the relative example for more information on the feature.

For datetime columns, skrub provides the DatetimeEncoder which can extract useful features such as year, month, day, as well as additional features such as weekday or day of year. Periodic encoding with trigonometric or spline features is also available. Refer to the DatetimeEncoder documentation for more detail.

import pandas as pd

data = pd.DataFrame(
    {
        "event": ["A", "B", "C"],
        "date_1": ["2020-01-01", "2020-06-15", "2021-03-22"],
        "date_2": ["2020-01-15", "2020-07-01", "2021-04-05"],
    }
)
data = Cleaner().fit_transform(data)
TableReport(data)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



skrub transformers are applied column-by-column, but it is possible to use the ApplyToCols meta-transformer to apply a transformer to multiple columns at once. Complex column selection is possible using skrub’s column selectors.

from skrub import ApplyToCols, DatetimeEncoder

ApplyToCols(
    DatetimeEncoder(add_total_seconds=False), cols=["date_1", "date_2"]
).fit_transform(data)
event date_1_year date_1_month date_1_day date_2_year date_2_month date_2_day
0 A 2020.0 1.0 1.0 2020.0 1.0 15.0
1 B 2020.0 6.0 15.0 2020.0 7.0 1.0
2 C 2021.0 3.0 22.0 2021.0 4.0 5.0


Finally, when a column contains categorical or string data, it can be encoded using various encoders provided by skrub. The default encoder is the StringEncoder, which encodes categories using Latent Semantic Analysis (LSA). It is a simple and efficient way to encode categories, and works well in practice.

data = pd.DataFrame(
    {
        "city": ["Paris", "London", "Berlin", "Madrid", "Rome"],
        "country": ["France", "UK", "Germany", "Spain", "Italy"],
    }
)
TableReport(data)
from skrub import StringEncoder

StringEncoder(n_components=3).fit_transform(data["city"])
city_0 city_1 city_2
0 9.549681e-08 -2.087682e-08 1.525498e+00
1 1.014391e+00 -1.847013e-01 -8.526597e-08
2 8.380367e-01 9.361470e-01 1.514242e-07
3 2.806120e-01 5.614111e-01 -4.905649e-08
4 -7.191011e-01 1.049512e+00 3.392050e-08


If your data includes a lot of text, you may want to use the TextEncoder, which uses pre-trained language models retrieved from the HuggingFace hub to create meaningful text embeddings. See Feature engineering for categorical data for more details on all the categorical encoders provided by skrub, and Encoding: from a dataframe to a numerical matrix for machine learning for a comparison between the different methods.

Advanced use cases#

If your use case involves more complex data preparation, hyperparameter tuning, or model selection, if you want to build a multi-table pipeline that requires assembling and preparing multiple tables, or if you want to make sure that the data preparation can be reproduced exactly, you can use the skrub Data Ops, a powerful framework which provides tools to build complex data processing pipelines. See the relative user guide and the Skrub DataOps examples for more details.

Next steps#

We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to it!

Please refer to our User Guide for a more in-depth presentation of skrub’s concepts, or visit our examples for more illustrations of the tools that we provide!

Total running time of the script: (0 minutes 9.884 seconds)

Gallery generated by Sphinx-Gallery