Very often, column selection is more complex than simply passing a list of column names to a transformer: it may be necessary to select all columns that have a specific data type, or based on some other characteristic (presence of nulls, column cardinality, etc.).
The skrub selectors implement a number of selection strategies that can be combined in various ways to build complex filtering conditions that can then be employed by ApplyToCols, ApplyToFrame, SelectCols and DropCols.
Selectors are available from the skrub.selectors namespace:
We will use this example dataframe to test some of the selectors:
import pandas as pd
import datetime
data = {
"int": [15, 56, 63, 12, 44],
"float": [5.2, 2.4, 6.2, 10.45, 9.0],
"str1": ["public", "private", None, "private", "public"],
"str2": ["officer", "manager", "lawyer", "chef", "teacher"],
"bool": [True, False, True, False, True],
"cat1": pd.Categorical(["yes", "yes", None, "yes", "no"]),
"cat2": pd.Categorical(["20K+", "40K+", "60K+", "30K+", "50K+"]),
"datetime-col": [
datetime.datetime.fromisoformat(dt)
for dt in [
"2020-02-03T12:30:05",
"2021-03-15T00:37:15",
"2022-02-13T17:03:25",
"2023-05-22T08:45:55",
]
]
+ [None], }
df = pd.DataFrame(data)
df| int | float | str1 | str2 | bool | cat1 | cat2 | datetime-col | |
|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | yes | 20K+ | 2020-02-03 12:30:05 |
| 1 | 56 | 2.40 | private | manager | False | yes | 40K+ | 2021-03-15 00:37:15 |
| 2 | 63 | 6.20 | None | lawyer | True | NaN | 60K+ | 2022-02-13 17:03:25 |
| 3 | 12 | 10.45 | private | chef | False | yes | 30K+ | 2023-05-22 08:45:55 |
| 4 | 44 | 9.00 | public | teacher | True | no | 50K+ | NaT |
Selectors should be used in conjunction with the transformers described in the previous chapter: ApplyToCols, ApplyToFrame, SelectCols and DropCols.
Selectors allow to filter columns by data type:
.float: floating-point columns.integer: integer columns.any_date: date or datetime columns.boolean: boolean columns.string: columns with a String data type.categorical: columns with a Categorical data type.numeric: numeric (either integer or float) columns| str1 | str2 | |
|---|---|---|
| 0 | public | officer |
| 1 | private | manager |
| 2 | None | lawyer |
| 3 | private | chef |
| 4 | public | teacher |
Additional conditions include:
.all: select all columns.cardinality_below: select all columns with a number of unique values lower than the given threshold.has_nulls: select all columns that include at least one null value| str1 | cat1 | datetime-col | |
|---|---|---|---|
| 0 | public | yes | 2020-02-03 12:30:05 |
| 1 | private | yes | 2021-03-15 00:37:15 |
| 2 | None | NaN | 2022-02-13 17:03:25 |
| 3 | private | yes | 2023-05-22 08:45:55 |
| 4 | public | no | NaT |
Various selectors allow to choose columns based on their name:
.cols: choose the provided column name (or list of names)
.glob: use Unix shell style glob to select column names.regex: select columns using regular expressionsSelectors can be inverted using .inv or the logical operator ~ to select all other columns, and they can be combined using the & and | logical operators. It is also possible to remove from a selection with -; for example to select all columns except for “datetime-col”, one would write:
| int | float | str1 | str2 | bool | cat1 | cat2 | |
|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | yes | 20K+ |
| 1 | 56 | 2.40 | private | manager | False | yes | 40K+ |
| 2 | 63 | 6.20 | None | lawyer | True | NaN | 60K+ |
| 3 | 12 | 10.45 | private | chef | False | yes | 30K+ |
| 4 | 44 | 9.00 | public | teacher | True | no | 50K+ |
To select all datetime columns OR all string columns that do not contain nulls, we can do:
Selectors can use the expand and expand_index methods to extract the columns that have been selected:
This can be used, for example, to pass a list of columns to a dataframe library.
Finally, it is possible to define function-based selectors using .filter and .filter_names.
.filter selects columns for which the predicate evaluated by a user-defined function on the given column is True. It is also possible to pass arguments to the function to further tweak the conditions. For example, it is possible to select columns that include a certain amount of nulls by defining a function like the following:
| b | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
.filter_names is similar to .filter in the sense that it takes a function that returns a predicate, but in this case the function is evaluated over the column names.
If we define this example dataframe:
| height_mm | width_mm | kind | ID | |
|---|---|---|---|---|
| 0 | 297.0 | 210.0 | A4 | 4 |
| 1 | 420.0 | 297.0 | A3 | 3 |
We can select all the columns that end with "_mm" as follows:
In this chapter we covered the skrub selectors, how they allow to select specific columns either through simple conditions, or by combining different selectors. More use cases of ApplyToCols and the selectors will be shown through the rest of the course, starting from the next chapter on feature engineering.
ApplyToColsPath to the exercise: content/exercises/04_selectors.ipynb
Consider this example dataframe:
import pandas as pd
df = pd.DataFrame(
{
"metric_1": [10.5, 20.3, 30.1, 40.2],
"metric_2": [5.1, 15.6, None, 35.8],
"metric_3": [1.1, 3.3, 2.6, .8],
"num_id": [101, 102, 103, 104],
"str_id": ["A101", "A102", "A103", "A104"],
"description": ["apple", None, "cherry", "date"],
"name": ["Alice", "Bob", "Charlie", "David"],
}
)
df| metric_1 | metric_2 | metric_3 | num_id | str_id | description | name | |
|---|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | apple | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | None | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | cherry | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | date | David |
Using the skrub selectors and ApplyToCols:
StandardScaler to numeric columns, except "num_id".OneHotEncoder with sparse_output=False on all string columns except "str_id".import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline
scaler = ApplyToCols(StandardScaler(), cols=s.numeric() - "num_id")
one_hot = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string() - "str_id")
transformer = make_pipeline(scaler, one_hot)
transformer.fit_transform(df)| metric_1 | metric_2 | metric_3 | num_id | str_id | description_apple | description_cherry | description_date | description_None | name_Alice | name_Bob | name_Charlie | name_David | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.336178 | -1.077965 | -0.820768 | 101 | A101 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | -0.449914 | -0.253793 | 1.303572 | 102 | A102 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 0.436349 | NaN | 0.627646 | 103 | A103 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1.349743 | 1.331758 | -1.110450 | 104 | A104 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Given the same dataframe and using selectors, drop only string columns that contain nulls.
| metric_1 | metric_2 | metric_3 | num_id | str_id | name | |
|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | David |
Now write a custom function that selects columns where all values are lower than 10.0.