Getting Started#

This guide showcases the features of skrub, an open-source package that aims at bridging the gap between tabular data sources and machine-learning models.

Much of skrub revolves around vectorizing, assembling, and encoding tabular data, to prepare data in a format that shallow or classic machine-learning models understand.

Downloading example datasets#

The datasets module allows us to download tabular datasets and demonstrate skrub’s features.

Explore all the available datasets in Downloading a dataset.

Generating an interactive report for a dataframe#

To quickly get an overview of a dataframe’s contents, use the TableReport.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



You can use the interactive display above to explore the dataset visually.

Note

You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.

It is also possible to tell skrub to replace the default pandas & polars displays with TableReport.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



The effect of patch_display can be undone with skrub.unpatch_display()

Easily building a strong baseline for tabular machine learning#

The goal of skrub is to ease tabular data preparation for machine learning. The tabular_learner() function provides an easy way to build a simple but reliable machine-learning model, working well on most tabular data.

from sklearn.model_selection import cross_validate

from skrub import tabular_learner

model = tabular_learner("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

To handle rich tabular data and feed it to a machine-learning model, the pipeline returned by tabular_learner() preprocesses and encodes strings, categories and dates using the TableVectorizer. See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for more details. An overview of the chosen defaults is available in End-to-end predictive models.

Assembling data#

Skrub allows imperfect assembly of data, such as joining dataframes on columns that contain typos. Skrub’s joiners have fit and transform methods, storing information about the data across calls.

The Joiner allows fuzzy-joining multiple tables, each row of a main table will be augmented with values from the best match in the auxiliary table. You can control how distant fuzzy-matches are allowed to be with the max_dist parameter.

In the following, we add information about countries to a table containing airports and the cities they are in:

import pandas as pd

from skrub import Joiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
# notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
    {"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
    capitals,
    main_key="city",
    aux_key="capital",
    max_dist=0.8,
    add_match_info=False,
)
joiner.fit_transform(airports)
airport_id airport_name city capital country
0 1 Charles de Gaulle Paris Paris France
1 2 Aeroporto Leonardo da Vinci Roma Rome Italy


Information about countries have been added, even if the rows aren’t exactly matching.

It’s also possible to augment data by joining and aggregating multiple dataframes with the AggJoiner. This is particularly useful to summarize information scattered across tables, for instance adding statistics about flights to the dataframe of airports:

from skrub import AggJoiner

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],  # the cols to perform aggregation on
    operations=["mean", "std"],  # the operations to compute
)