import skrub.selectors as s7 Choose your column: selectors
8 Introduction
Very often, column selection is more complex than simply passing a list of column names to a transformer: it may be necessary to select all columns that have a specific data type, or based on some other characteristic (presence of nulls, column cardinality, etc.).
The skrub selectors implement a number of selection strategies that can be combined in various ways to build complex filtering conditions that can then be employed by ApplyToCols, ApplyToFrame, SelectCols and DropCols.
8.1 Skrub selectors
Selectors are available from the skrub.selectors namespace:
We will use this example dataframe to test some of the selectors:
import pandas as pd
import datetime
data = {
"int": [15, 56, 63, 12, 44],
"float": [5.2, 2.4, 6.2, 10.45, 9.0],
"str1": ["public", "private", None, "private", "public"],
"str2": ["officer", "manager", "lawyer", "chef", "teacher"],
"bool": [True, False, True, False, True],
"cat1": pd.Categorical(["yes", "yes", None, "yes", "no"]),
"cat2": pd.Categorical(["20K+", "40K+", "60K+", "30K+", "50K+"]),
"datetime-col": [
datetime.datetime.fromisoformat(dt)
for dt in [
"2020-02-03T12:30:05",
"2021-03-15T00:37:15",
"2022-02-13T17:03:25",
"2023-05-22T08:45:55",
]
]
+ [None], }
df = pd.DataFrame(data)
df| int | float | str1 | str2 | bool | cat1 | cat2 | datetime-col | |
|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | yes | 20K+ | 2020-02-03 12:30:05 |
| 1 | 56 | 2.40 | private | manager | False | yes | 40K+ | 2021-03-15 00:37:15 |
| 2 | 63 | 6.20 | None | lawyer | True | NaN | 60K+ | 2022-02-13 17:03:25 |
| 3 | 12 | 10.45 | private | chef | False | yes | 30K+ | 2023-05-22 08:45:55 |
| 4 | 44 | 9.00 | public | teacher | True | no | 50K+ | NaT |
Selectors should be used in conjunction with the transformers described in the previous chapter: ApplyToCols, ApplyToFrame, SelectCols and DropCols.
Selectors allow to filter columns by data type:
.float: floating-point columns.integer: integer columns.any_date: date or datetime columns.boolean: boolean columns.string: columns with a String data type.categorical: columns with a Categorical data type.numeric: numeric (either integer or float) columns
from skrub import SelectCols
string_selector = s.string()
SelectCols(cols=string_selector).fit_transform(df)| str1 | str2 | |
|---|---|---|
| 0 | public | officer |
| 1 | private | manager |
| 2 | None | lawyer |
| 3 | private | chef |
| 4 | public | teacher |
Additional conditions include:
.all: select all columns.cardinality_below: select all columns with a number of unique values lower than the giventhreshold.has_nulls: select all columns that include at least one null value
SelectCols(cols=s.has_nulls()).fit_transform(df)| str1 | cat1 | datetime-col | |
|---|---|---|---|
| 0 | public | yes | 2020-02-03 12:30:05 |
| 1 | private | yes | 2021-03-15 00:37:15 |
| 2 | None | NaN | 2022-02-13 17:03:25 |
| 3 | private | yes | 2023-05-22 08:45:55 |
| 4 | public | no | NaT |
Various selectors allow to choose columns based on their name:
.cols: choose the provided column name (or list of names)- note that transformers that can accept selectors can also take column names or lists of columns by default
.glob: use Unix shell styleglobto select column names.regex: select columns using regular expressions
SelectCols(cols=s.glob("cat*")).fit_transform(df)| cat1 | cat2 | |
|---|---|---|
| 0 | yes | 20K+ |
| 1 | yes | 40K+ |
| 2 | NaN | 60K+ |
| 3 | yes | 30K+ |
| 4 | no | 50K+ |
8.2 Combining selectors
Selectors can be inverted using .inv or the logical operator ~ to select all other columns, and they can be combined using the & and | logical operators. It is also possible to remove from a selection with -; for example to select all columns except for “datetime-col”, one would write:
SelectCols(cols=s.all() - "datetime-col").fit_transform(df)| int | float | str1 | str2 | bool | cat1 | cat2 | |
|---|---|---|---|---|---|---|---|
| 0 | 15 | 5.20 | public | officer | True | yes | 20K+ |
| 1 | 56 | 2.40 | private | manager | False | yes | 40K+ |
| 2 | 63 | 6.20 | None | lawyer | True | NaN | 60K+ |
| 3 | 12 | 10.45 | private | chef | False | yes | 30K+ |
| 4 | 44 | 9.00 | public | teacher | True | no | 50K+ |
To select all datetime columns OR all string columns that do not contain nulls, we can do:
SelectCols(cols=(s.any_date() | (s.string()) & (~s.has_nulls()))).fit_transform(df)| str2 | datetime-col | |
|---|---|---|
| 0 | officer | 2020-02-03 12:30:05 |
| 1 | manager | 2021-03-15 00:37:15 |
| 2 | lawyer | 2022-02-13 17:03:25 |
| 3 | chef | 2023-05-22 08:45:55 |
| 4 | teacher | NaT |
8.3 Extracting selected columns
Selectors can use the expand and expand_index methods to extract the columns that have been selected:
has_nulls = s.has_nulls()
has_nulls.expand(df)['str1', 'cat1', 'datetime-col']
This can be used, for example, to pass a list of columns to a dataframe library.
df.drop(columns=has_nulls.expand(df))| int | float | str2 | bool | cat2 | |
|---|---|---|---|---|---|
| 0 | 15 | 5.20 | officer | True | 20K+ |
| 1 | 56 | 2.40 | manager | False | 40K+ |
| 2 | 63 | 6.20 | lawyer | True | 60K+ |
| 3 | 12 | 10.45 | chef | False | 30K+ |
| 4 | 44 | 9.00 | teacher | True | 50K+ |
8.4 Designing custom filters
Finally, it is possible to define function-based selectors using .filter and .filter_names.
.filter selects columns for which the predicate evaluated by a user-defined function on the given column is True. It is also possible to pass arguments to the function to further tweak the conditions. For example, it is possible to select columns that include a certain amount of nulls by defining a function like the following:
import pandas as pd
import skrub.selectors as s
from skrub import DropCols
df = pd.DataFrame({"a": [None, None, None, 1], "b": [1,2,3,4]})
def more_nulls_than(col, threshold=.5):
return col.isnull().sum()/len(col) > threshold
DropCols(cols=s.filter(more_nulls_than, threshold=0.5)).fit_transform(df)| b | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
.filter_names is similar to .filter in the sense that it takes a function that returns a predicate, but in this case the function is evaluated over the column names.
If we define this example dataframe:
from skrub import selectors as s
import pandas as pd
df = pd.DataFrame(
{
"height_mm": [297.0, 420.0],
"width_mm": [210.0, 297.0],
"kind": ["A4", "A3"],
"ID": [4, 3],
}
)
df| height_mm | width_mm | kind | ID | |
|---|---|---|---|---|
| 0 | 297.0 | 210.0 | A4 | 4 |
| 1 | 420.0 | 297.0 | A3 | 3 |
We can select all the columns that end with "_mm" as follows:
selector = s.filter_names(lambda name: name.endswith('_mm'))
s.select(df, selector)| height_mm | width_mm | |
|---|---|---|
| 0 | 297.0 | 210.0 |
| 1 | 420.0 | 297.0 |
8.5 Conclusion
In this chapter we covered the skrub selectors, how they allow to select specific columns either through simple conditions, or by combining different selectors. More use cases of ApplyToCols and the selectors will be shown through the rest of the course, starting from the next chapter on feature engineering.
9 Exercise: using selectors together with ApplyToCols
Path to the exercise: content/exercises/04_selectors.ipynb
Consider this example dataframe:
import pandas as pd
df = pd.DataFrame(
{
"metric_1": [10.5, 20.3, 30.1, 40.2],
"metric_2": [5.1, 15.6, None, 35.8],
"metric_3": [1.1, 3.3, 2.6, .8],
"num_id": [101, 102, 103, 104],
"str_id": ["A101", "A102", "A103", "A104"],
"description": ["apple", None, "cherry", "date"],
"name": ["Alice", "Bob", "Charlie", "David"],
}
)
df| metric_1 | metric_2 | metric_3 | num_id | str_id | description | name | |
|---|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | apple | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | None | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | cherry | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | date | David |
Using the skrub selectors and ApplyToCols:
- Apply the
StandardScalerto numeric columns, except"num_id". - Apply a
OneHotEncoderwithsparse_output=Falseon all string columns except"str_id".
import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline
# Write your solution here
#
#
#
#
#
#
#
#
# import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline
scaler = ApplyToCols(StandardScaler(), cols=s.numeric() - "num_id")
one_hot = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string() - "str_id")
transformer = make_pipeline(scaler, one_hot)
transformer.fit_transform(df)| metric_1 | metric_2 | metric_3 | num_id | str_id | description_apple | description_cherry | description_date | description_None | name_Alice | name_Bob | name_Charlie | name_David | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.336178 | -1.077965 | -0.820768 | 101 | A101 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 1 | -0.449914 | -0.253793 | 1.303572 | 102 | A102 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 0.436349 | NaN | 0.627646 | 103 | A103 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 1.349743 | 1.331758 | -1.110450 | 104 | A104 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Given the same dataframe and using selectors, drop only string columns that contain nulls.
from skrub import DropCols
# Write your solution here
#
#
#
#
#
#
# from skrub import DropCols
DropCols(cols=s.has_nulls() & s.string()).fit_transform(df)| metric_1 | metric_2 | metric_3 | num_id | str_id | name | |
|---|---|---|---|---|---|---|
| 0 | 10.5 | 5.1 | 1.1 | 101 | A101 | Alice |
| 1 | 20.3 | 15.6 | 3.3 | 102 | A102 | Bob |
| 2 | 30.1 | NaN | 2.6 | 103 | A103 | Charlie |
| 3 | 40.2 | 35.8 | 0.8 | 104 | A104 | David |
Now write a custom function that selects columns where all values are lower than 10.0.
from skrub import SelectCols
# Write your solution here
#
#
#
#
#
#
# from skrub import SelectCols
def lower_than(col):
return all(col < 10.0)
SelectCols(cols=s.numeric() & s.filter(lower_than)).fit_transform(df)| metric_3 | |
|---|---|
| 0 | 1.1 |
| 1 | 3.3 |
| 2 | 2.6 |
| 3 | 0.8 |