Getting Started#

This guide showcases the features of skrub, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models.

Much of skrub revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand.

Downloading example datasets#

The datasets module allows us to download tabular datasets and demonstrate skrub’s features.

Note

You can control the directory where the datasets are stored by:

  • setting in your environment the SKRUB_DATA_DIRECTORY variable to an absolute directory path,

  • using the parameter data_directory in fetch functions, which takes precedence over the envar.

By default, the datasets are stored in a folder named “skrub_data” in the user home folder.

Explore all the available datasets in downloading_a_dataset_ref.

Generating an interactive report for a dataframe#

The Cleaner allows to clean the dataframe, parsing nulls, dates, and dropping columns with too many nulls. To quickly get an overview of a dataframe’s contents, use the TableReport.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



You can use the interactive display above to explore the dataset visually.

Note

You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.

It is also possible to tell skrub to replace the default pandas & polars displays with TableReport.

from skrub import set_config

set_config(use_tablereport=True)

employees_df

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



This setting can easily be reverted:

set_config(use_tablereport=False)

employees_df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 1986-09-22 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 1988-09-12 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 1989-11-19 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 2014-05-05 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 2007-03-05 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II 2015-11-03 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief 1988-11-28 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist 2001-04-30 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 2006-09-05 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 2012-01-30 2012

9228 rows × 8 columns



Easily building a strong baseline for tabular machine learning#

The goal of skrub is to ease tabular data preparation for machine learning. The tabular_learner() function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_learner

model = tabular_learner("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.91129818, 0.88013711, 0.91451364, 0.92117174, 0.92487738])

To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by tabular_learner() preprocesses and encodes strings, categories and dates using the TableVectorizer. See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for more details. An overview of the chosen defaults is available in End-to-end predictive models.

Assembling data#

Skrub allows imperfect assembly of data, such as joining dataframes on columns that contain typos. Skrub’s joiners have fit and transform methods, storing information about the data across calls.

The Joiner allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the max_dist parameter.

In the following, we add information about countries to a table containing airports and the cities they are in:

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
# notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy


Information about countries have been added, even if the rows aren’t exactly matching.

It’s also possible to augment data by joining and aggregating multiple dataframes with the AggJoiner. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports:

from skrub import AggJoiner

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],  # the cols to perform aggregation on
    operations=["mean", "std"],  # the operations to compute
)
agg_joiner.fit_transform(airports)
airport_id airport_name city total_passengers_mean total_passengers_std
0 1 Charles de Gaulle Paris 103.333333 15.275252
1 2 Aeroporto Leonardo da Vinci Roma 80.000000 10.000000


For joining multiple auxiliary tables on a main table at once, use the MultiAggJoiner.

See other ways to join multiple tables in Assembling: joining multiple tables.

Encoding data#

When a column contains categories with variations and typos, it can be encoded using one of skrub’s encoders, such as the GapEncoder.

The GapEncoder creates a continuous encoding, based on the activation of latent categories. It will create the encoding based on combinations of substrings which frequently co-occur.

For instance, we might want to encode a column X that contains information about cities, being either Madrid or Rome :

from skrub import GapEncoder

X = pd.Series(
    [
        "Rome, Italy",
        "Rome",
        "Roma, Italia",
        "Madrid, SP",
        "Madrid, spain",
        "Madrid",
        "Romq",
        "Rome, It",
    ],
    name="city",
)
enc = GapEncoder(n_components=2, random_state=0)  # 2 topics in the data
enc.fit(X)
GapEncoder(n_components=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


The GapEncoder has found the following two topics:

['city: madrid, spain, sp', 'city: italia, italy, romq']

Which correspond to the two cities.

Let’s see the activation of each topic depending on the rows of X:

encoded = enc.fit_transform(X).assign(original=X)
encoded
city: madrid, spain, sp city: italia, italy, romq original
0 0.052257 13.547743 Rome, Italy
1 0.050202 3.049798 Rome
2 0.063282 15.036718 Roma, Italia
3 12.047028 0.052972 Madrid, SP
4 16.547818 0.052182 Madrid, spain
5 6.048861 0.051139 Madrid
6 0.050019 3.049981 Romq
7 0.053193 9.046807 Rome, It


The higher the activation, the closer the row to the latent topic. These columns can now be understood by a machine-learning model.

The other encoders are presented in Encoding: creating feature matrices.

Next steps#

We have briefly covered pipeline creation, vectorizing, assembling, and encoding data. We presented the main functionalities of skrub, but there is much more to it !

Please refer to our User guide for a more in-depth presentation of skrub’s concepts, or visit our examples for more illustrations of the tools that we provide !

Total running time of the script: (0 minutes 8.493 seconds)

Gallery generated by Sphinx-Gallery