Skrub Selectors: helpers for selecting columns in a dataframe#
In Skrub, a selector represents a column selection rule, such as “all columns that have
numerical data types, except the column 'User ID'
”.
Selectors have two main benefits:
Expressing complex selection rules in a simple and concise way by combining selectors with operators. A range of useful selectors is provided by this module.
Delayed selection: passing a selection rule which will evaluated later on a dataframe that is not yet available. For example, without selectors, it is not possible to instantiate a
SelectCols
that selects all columns except those with the suffix ‘ID’ if the data on which it will be fitted is not yet available.
Usage#
Here is an example dataframe. Note that selectors support both Pandas and Polars dataframes:
>>> import pandas as pd
>>> df = pd.DataFrame(
... {
... "height_mm": [297.0, 420.0],
... "width_mm": [210.0, 297.0],
... "kind": ["A4", "A3"],
... "ID": [4, 3],
... }
... )
cols()
is a simple kind of selector which selects a fixed list of
column names:
>>> from skrub import selectors as s
>>> mm_cols = s.cols('height_mm', 'width_mm')
>>> mm_cols
cols('height_mm', 'width_mm')
This selector can then be passed to a select()
function:
>>> s.select(df, mm_cols)
height_mm width_mm
0 297.0 210.0
1 420.0 297.0
It can also be passed to SelectCols
or DropCols
to be embedded in scikit-learn pipelines:
>>> from skrub import SelectCols
>>> SelectCols(cols=mm_cols).fit_transform(df)
height_mm width_mm
0 297.0 210.0
1 420.0 297.0
Last but not least, selectors can be passed to skrub expressions when applying an
estimator with the skrub.Expr.skb.apply()
function:
>>> import skrub
>>> from sklearn.preprocessing import StandardScaler
>>> skrub.X(df).skb.apply(StandardScaler(), cols=mm_cols)
<Apply StandardScaler>
Result:
―――――――
kind ID height_mm width_mm
0 A4 4 -1.0 -1.0
1 A3 3 1.0 1.0
Type of selectors#
all()
is another simple selector, especially useful for default
arguments since it keeps all columns:
>>> SelectCols(cols=s.all()).fit_transform(df)
height_mm width_mm kind ID
0 297.0 210.0 A4 4
1 420.0 297.0 A3 3
Selectors can be combined with operators, for example if we wanted all columns except the “mm” columns above:
>>> SelectCols(s.all() - s.cols("height_mm", "width_mm")).fit_transform(df)
kind ID
0 A4 4
1 A3 3
This module provides several kinds of selectors, which allow to select columns by name, data type, contents, or according to arbitrary user-provided rules.
>>> SelectCols(s.numeric()).fit_transform(df)
height_mm width_mm ID
0 297.0 210.0 4
1 420.0 297.0 3
>>> SelectCols(s.glob('*_mm')).fit_transform(df)
height_mm width_mm
0 297.0 210.0
1 420.0 297.0
See Selecting columns in a DataFrame for an exhaustive list.
The available operators are |
, &
, -
, ^
with the meaning of usual
python sets, and ~
to invert a selection.
>>> SelectCols(s.glob('*_mm')).fit_transform(df)
height_mm width_mm
0 297.0 210.0
1 420.0 297.0
>>> SelectCols(~s.glob('*_mm')).fit_transform(df)
kind ID
0 A4 4
1 A3 3
>>> SelectCols(s.glob('*_mm') | s.cols('ID')).fit_transform(df)
height_mm width_mm ID
0 297.0 210.0 4
1 420.0 297.0 3
>>> SelectCols(s.glob('*_mm') & s.glob('height_*')).fit_transform(df)
height_mm
0 297.0
1 420.0
>>> SelectCols(s.glob('*_mm') ^ s.string()).fit_transform(df)
height_mm width_mm kind
0 297.0 210.0 A4
1 420.0 297.0 A3
The operators respect the usual short-circuit rules. For example, the following selector won’t compute the cardinality of non-categorical columns:
>>> s.categorical() & s.cardinality_below(10)
(categorical() & cardinality_below(10))
Advanced selectors: filter and filter_names#
skrub.selectors.filter()
and skrub.selectors.filter_names()
allow
selecting columns based on arbitrary user-defined criteria. These are also used to
implement many of the other selectors provided in this module.
skrub.selectors.filter()
accepts a function which will be called on a column
(i.e., a Pandas or polars Series). This function, called a predicate, must return
True
if the column should be selected.
>>> s.select(df, s.filter(lambda col: "A4" in col.tolist()))
kind
0 A4
1 A3
skrub.selectors.filter_names()
accepts a predicate that is passed the column name,
instead of the column.
>>> s.select(df, s.filter_names(lambda name: name.endswith('mm')))
height_mm width_mm
0 297.0 210.0
1 420.0 297.0
We can pass args and kwargs that will be forwarded to the predicate, to help avoid lambda or local functions and thus ensure the selector is picklable.
>>> s.select(df, s.filter_names(str.endswith, 'mm'))
height_mm width_mm
0 297.0 210.0
1 420.0 297.0