Skrub

skrub-data.org

Less wrangling, more machine learning

Machine learning and tabular data

  • ML expects numeric arrays
  • Real data is more complex:
    • Multiple tables
    • Dates, categories, text, locations, …

Skrub: bridge the gap between dataframes and scikit-learn.

Skrub helps at several stages of a tabular learning project

  1. What’s in the data? (EDA)
  2. Can we learn anything? (baselines)
  3. How do I represent the data? (feature extraction)
  4. How do I bring it all together? (building a pipeline)

1. What’s in the data?

import skrub
import pandas as pd

employee_salaries = pd.read_csv("https://figshare.com/ndownloader/files/51755822")
employees = employee_salaries.drop("current_annual_salary", axis=1)
salaries = employee_salaries["current_annual_salary"]
employees.iloc[0]
gender                                                                     F
department                                                               POL
department_name                                         Department of Police
division                   MSB Information Mgmt and Tech Division Records...
assignment_category                                         Fulltime-Regular
employee_position_title                          Office Services Coordinator
date_first_hired                                                  09/22/1986
year_first_hired                                                        1986
Name: 0, dtype: object

TableReport: interactive display of a dataframe

skrub.TableReport(employees, verbose=0)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We can tell skrub to patch the default display of polars and pandas dataframes.

skrub.patch_display(verbose=0)

2. Can we learn anything?

X, y = employees, salaries

tabular_learner: a pre-made robust baseline

learner = skrub.tabular_learner("regressor")
learner
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=MinHashEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(categorical_features='from_dtype'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.model_selection import cross_val_score

cross_val_score(learner, X, y, scoring="r2")
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])

The tabular_learner adapts to the supervised estimator we choose

from sklearn.linear_model import Ridge

learner = skrub.tabular_learner(Ridge())
learner
Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('simpleimputer', SimpleImputer(add_indicator=True)),
                ('standardscaler', StandardScaler()), ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cross_val_score(learner, X, y, scoring="r2")
array([0.7332394 , 0.71975525, 0.75090551, 0.74173251, 0.73988883])

3. How do I represent the data?

Skrub helps extract informative features from tabular data.

TableVectorizer: apply an appropriate transformer to each column

vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(X)

The TableVectorizer identifies several kinds of columns:

  • categorical, low cardinality
  • categorical, high cardinality
  • datetime
  • numeric
  • … we may add more
from pprint import pprint

pprint(vectorizer.column_to_kind_)
{'assignment_category': 'low_cardinality',
 'date_first_hired': 'datetime',
 'department': 'low_cardinality',
 'department_name': 'low_cardinality',
 'division': 'high_cardinality',
 'employee_position_title': 'high_cardinality',
 'gender': 'low_cardinality',
 'year_first_hired': 'numeric'}

For each kind, it applies an appropriate transformer

vectorizer.transformers_["department"]  # low-cardinality categorical
OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
vectorizer.transformers_["employee_position_title"]  # high-cardinality categorical
GapEncoder(n_components=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
vectorizer.transformers_["date_first_hired"]  # datetime
DatetimeEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

… and those transformers turn the input into numeric features that can be used for ML

transformed[vectorizer.input_to_outputs_["date_first_hired"]]

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

For high-cardinality categorical columns the default GapEncoder identifies sparse topics (more later).

transformed[vectorizer.input_to_outputs_["employee_position_title"]]

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The transformer used for each column kind can be easily configured.

Preprocessing in the TableVectorizer

The TableVectorizer actually performs a lot of preprocessing before applying the final transformers, such as:

  • ensuring consistent column names
  • detecting missing values such as "N/A"
  • dropping empty columns
  • handling pandas dtypes – float64, nan vs Float64, NA
  • parsing numbers
  • parsing dates, ensuring consistent dtype and timezone
  • converting numbers to float32 for faster computation & less memory downstream
pprint(vectorizer.all_processing_steps_["date_first_hired"])
[CleanNullStrings(),
 DropIfTooManyNulls(),
 ToDatetime(),
 DatetimeEncoder(),
 {'date_first_hired_day': ToFloat32(), 'date_first_hired_month': ToFloat32(), ...}]

Extracting good features

Skrub offers several encoders to extract features from different columns. In particular from categorical columns.

GapEncoder

Categories are somewhere between text and an enumeration… The GapEncoder is somewhere between a topic model and a one-hot encoder!

import seaborn as sns
from matplotlib import pyplot as plt

gap = skrub.GapEncoder()
pos_title = X["employee_position_title"]
loadings = gap.fit_transform(pos_title).set_index(pos_title.values).head()

loadings.columns = [c.split(": ")[1] for c in loadings.columns]
sns.heatmap(loadings)
_ = plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")

TextEncoder

Extract embeddings from a text column using any model from the HuggingFace Hub.

import pandas as pd

X = pd.Series(["airport", "flight", "plane", "pineapple", "fruit"])
encoder = skrub.TextEncoder(model_name="all-MiniLM-L6-v2", n_components=None)
embeddings = encoder.fit_transform(X).set_index(X.values)

sns.heatmap(embeddings @ embeddings.T)

MinHashEncoder

A fast, stateless way of encoding strings that works especially well with models based on decision trees (gradient boosting, random forest).

4. How do I bring it all together?

Skrub has several transformers that allow peforming typical dataframe operations such as projections, joins and aggregations inside a scikit-learn pipeline.

Performing these operations in the machine-learning pipeline has several advantages:

  • Choices / hyperparameters can be optimized
  • Relevant state can be stored to ensure consistent transformations
  • All transformations are packaged together in an estimator

There are several transformers such as SelectCols, Joiner (fuzzy joining), InterpolationJoiner, AggJoiner, …

A toy example using the AggJoiner:

from skrub import AggJoiner

airports = pd.DataFrame(
    {
        "airport_id": [1, 2],
        "airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
        "city": ["Paris", "Roma"],
    }
)
airports

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

flights = pd.DataFrame(
    {
        "flight_id": range(1, 7),
        "from_airport": [1, 1, 1, 2, 2, 2],
        "total_passengers": [90, 120, 100, 70, 80, 90],
        "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
    }
)
flights

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

agg_joiner = AggJoiner(
    aux_table=flights,
    main_key="airport_id",
    aux_key="from_airport",
    cols=["total_passengers"],
    operations=["mean", "std"],
)
agg_joiner.fit_transform(airports)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

More interactive and expressive pipelines

To go further than what can be done with scikit-learn Pipelines and the skrub transformers shown above, we are developing new utilities to easily define and inspect flexible pipelines that can process several dataframes.

A prototype will be shown in a separate notebook.