Hands-On with Column Selection and Transformers#

In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_learner() to create pipelines.

In this new example, we show how to create more flexible pipelines by selecting and transforming dataframe columns using arbitrary logic.

We begin with loading a dataset with heterogeneous datatypes, and replacing Pandas’s display with the TableReport display via skrub.set_config().

import skrub
from skrub.datasets import fetch_employee_salaries

skrub.set_config(use_tablereport=True)
data = fetch_employee_salaries()
X, y = data.X, data.y
X

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Our goal is now to apply a StringEncoder to two columns of our choosing: division and employee_position_title.

We can achieve this using ApplyToCols, whose job is to apply a transformer to multiple columns independently, and let unmatched columns through without changes. This can be seen as a handy drop-in replacement of the ColumnTransformer.

Since we selected two columns and set the number of components to 30 each, ApplyToCols will create 2*30 embedding columns in the dataframe Xt, which we prefix with lsa_.

from skrub import ApplyToCols, StringEncoder

apply_string_encoder = ApplyToCols(
    StringEncoder(n_components=30),
    cols=["division", "employee_position_title"],
    rename_columns="lsa_{}",
)
Xt = apply_string_encoder.fit_transform(X)
Xt

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



In addition to the ApplyToCols class, the ApplyToFrame class is useful for transformers that work on multiple columns at once, such as the PCA which reduces the number of components.

To select columns without hardcoding their names, we introduce selectors, which allow for flexible matching pattern and composable logic.

The regex selector below will match all columns prefixed with "lsa", and pass them to ApplyToFrame which will assemble these columns into a dataframe and finally pass it to the PCA.

from sklearn.decomposition import PCA

from skrub import ApplyToFrame
from skrub import selectors as s

apply_pca = ApplyToFrame(PCA(n_components=8), cols=s.regex("lsa"))
Xt = apply_pca.fit_transform(Xt)
Xt

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



These two selectors are scikit-learn transformers and can be chained together within a Pipeline.

from sklearn.pipeline import make_pipeline

model = make_pipeline(
    apply_string_encoder,
    apply_pca,
).fit_transform(X)

Note that selectors also come in handy in a pipeline to select or drop columns, using SelectCols and DropCols!

from sklearn.preprocessing import StandardScaler

from skrub import SelectCols

# Select only numerical columns
pipeline = make_pipeline(
    SelectCols(cols=s.numeric()),
    StandardScaler(),
).set_output(transform="pandas")
pipeline.fit_transform(Xt)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Let’s run through one more example to showcase the expressiveness of the selectors. Suppose we want to apply an OrdinalEncoder on categorical columns with low cardinality (e.g., fewer than 40 unique values).

We define a column filter using skrub selectors with a lambda function. Note that the same effect can be obtained directly by using cardinality_below().

from sklearn.preprocessing import OrdinalEncoder

low_cardinality = s.filter(lambda col: col.nunique() < 40)
ApplyToCols(OrdinalEncoder(), cols=s.string() & low_cardinality).fit_transform(X)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Notice how we composed the selector with string() using a logical operator. This resulting selector matches string columns with cardinality below 40.

We can also define the opposite selector high_cardinality using the negation operator ~ and apply a skrub.StringEncoder to vectorize those columns.

from sklearn.ensemble import HistGradientBoostingRegressor

high_cardinality = ~low_cardinality
pipeline = make_pipeline(
    ApplyToCols(
        OrdinalEncoder(),
        cols=s.string() & low_cardinality,
    ),
    ApplyToCols(
        StringEncoder(),
        cols=s.string() & high_cardinality,
    ),
    HistGradientBoostingRegressor(),
).fit(X, y)
pipeline
Pipeline(steps=[('applytocols-1',
                 ApplyToCols(cols=(string() & filter(<lambda>)),
                             transformer=OrdinalEncoder())),
                ('applytocols-2',
                 ApplyToCols(cols=(string() & (~filter(<lambda>))),
                             transformer=StringEncoder())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Interestingly, the pipeline above is similar to the datatype dispatching performed by TableVectorizer, also used in tabular_learner().

Click on the dropdown arrows next to the datatype to see the columns are mapped to the different transformers in TableVectorizer.

from skrub import tabular_learner

tabular_learner("regressor").fit(X, y)
/home/circleci/project/skrub/_tabular_pipeline.py:75: FutureWarning:

tabular_learner will be deprecated in the next release. Equivalent functionality is available in skrub.set_config.
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Total running time of the script: (0 minutes 8.264 seconds)

Gallery generated by Sphinx-Gallery