7  Choose your column: selectors

8 Introduction

Very often, column selection is more complex than simply passing a list of column names to a transformer: it may be necessary to select all columns that have a specific data type, or based on some other characteristic (presence of nulls, column cardinality, etc.).

The skrub selectors implement a number of selection strategies that can be combined in various ways to build complex filtering conditions that can then be employed by ApplyToCols, ApplyToFrame, SelectCols and DropCols.

8.1 Skrub selectors

Selectors are available from the skrub.selectors namespace:

import skrub.selectors as s

We will use this example dataframe to test some of the selectors:

import pandas as pd
import datetime

data = {
    "int": [15, 56, 63, 12, 44],
    "float": [5.2, 2.4, 6.2, 10.45, 9.0],
    "str1": ["public", "private", None, "private", "public"],
    "str2": ["officer", "manager", "lawyer", "chef", "teacher"],
    "bool": [True, False, True, False, True],
    "cat1": pd.Categorical(["yes", "yes", None, "yes", "no"]),
    "cat2": pd.Categorical(["20K+", "40K+", "60K+", "30K+", "50K+"]),
    "datetime-col": [
        datetime.datetime.fromisoformat(dt)
        for dt in [
            "2020-02-03T12:30:05",
            "2021-03-15T00:37:15",
            "2022-02-13T17:03:25",
            "2023-05-22T08:45:55",
        ]
    ]
    + [None],    }
df = pd.DataFrame(data)
df
int float str1 str2 bool cat1 cat2 datetime-col
0 15 5.20 public officer True yes 20K+ 2020-02-03 12:30:05
1 56 2.40 private manager False yes 40K+ 2021-03-15 00:37:15
2 63 6.20 None lawyer True NaN 60K+ 2022-02-13 17:03:25
3 12 10.45 private chef False yes 30K+ 2023-05-22 08:45:55
4 44 9.00 public teacher True no 50K+ NaT

Selectors should be used in conjunction with the transformers described in the previous chapter: ApplyToCols, ApplyToFrame, SelectCols and DropCols.

Selectors allow to filter columns by data type:

  • .float: floating-point columns
  • .integer: integer columns
  • .any_date: date or datetime columns
  • .boolean: boolean columns
  • .string: columns with a String data type
  • .categorical: columns with a Categorical data type
  • .numeric: numeric (either integer or float) columns
from skrub import SelectCols
string_selector = s.string()

SelectCols(cols=string_selector).fit_transform(df)
str1 str2
0 public officer
1 private manager
2 None lawyer
3 private chef
4 public teacher

Additional conditions include:

  • .all: select all columns
  • .cardinality_below: select all columns with a number of unique values lower than the given threshold
  • .has_nulls: select all columns that include at least one null value
SelectCols(cols=s.has_nulls()).fit_transform(df)
str1 cat1 datetime-col
0 public yes 2020-02-03 12:30:05
1 private yes 2021-03-15 00:37:15
2 None NaN 2022-02-13 17:03:25
3 private yes 2023-05-22 08:45:55
4 public no NaT

Various selectors allow to choose columns based on their name:

  • .cols: choose the provided column name (or list of names)
    • note that transformers that can accept selectors can also take column names or lists of columns by default
  • .glob: use Unix shell style glob to select column names
  • .regex: select columns using regular expressions
SelectCols(cols=s.glob("cat*")).fit_transform(df)
cat1 cat2
0 yes 20K+
1 yes 40K+
2 NaN 60K+
3 yes 30K+
4 no 50K+

8.2 Combining selectors

Selectors can be inverted using .inv or the logical operator ~ to select all other columns, and they can be combined using the & and | logical operators. It is also possible to remove from a selection with -; for example to select all columns except for “datetime-col”, one would write:

SelectCols(cols=s.all() - "datetime-col").fit_transform(df)
int float str1 str2 bool cat1 cat2
0 15 5.20 public officer True yes 20K+
1 56 2.40 private manager False yes 40K+
2 63 6.20 None lawyer True NaN 60K+
3 12 10.45 private chef False yes 30K+
4 44 9.00 public teacher True no 50K+

To select all datetime columns OR all string columns that do not contain nulls, we can do:

SelectCols(cols=(s.any_date() | (s.string()) & (~s.has_nulls()))).fit_transform(df)
str2 datetime-col
0 officer 2020-02-03 12:30:05
1 manager 2021-03-15 00:37:15
2 lawyer 2022-02-13 17:03:25
3 chef 2023-05-22 08:45:55
4 teacher NaT

8.3 Extracting selected columns

Selectors can use the expand and expand_index methods to extract the columns that have been selected:

has_nulls = s.has_nulls()
has_nulls.expand(df)
['str1', 'cat1', 'datetime-col']

This can be used, for example, to pass a list of columns to a dataframe library.

df.drop(columns=has_nulls.expand(df))
int float str2 bool cat2
0 15 5.20 officer True 20K+
1 56 2.40 manager False 40K+
2 63 6.20 lawyer True 60K+
3 12 10.45 chef False 30K+
4 44 9.00 teacher True 50K+

8.4 Designing custom filters

Finally, it is possible to define function-based selectors using .filter and .filter_names.

.filter selects columns for which the predicate evaluated by a user-defined function on the given column is True. It is also possible to pass arguments to the function to further tweak the conditions. For example, it is possible to select columns that include a certain amount of nulls by defining a function like the following:

import pandas as pd
import skrub.selectors as s
from skrub import DropCols

df = pd.DataFrame({"a": [None, None, None, 1], "b": [1,2,3,4]})

def more_nulls_than(col, threshold=.5):
    return col.isnull().sum()/len(col) > threshold

DropCols(cols=s.filter(more_nulls_than, threshold=0.5)).fit_transform(df)
b
0 1
1 2
2 3
3 4

.filter_names is similar to .filter in the sense that it takes a function that returns a predicate, but in this case the function is evaluated over the column names.

If we define this example dataframe:

from skrub import selectors as s
import pandas as pd
df = pd.DataFrame(
    {
        "height_mm": [297.0, 420.0],
        "width_mm": [210.0, 297.0],
        "kind": ["A4", "A3"],
        "ID": [4, 3],
    }
)
df
height_mm width_mm kind ID
0 297.0 210.0 A4 4
1 420.0 297.0 A3 3

We can select all the columns that end with "_mm" as follows:

selector = s.filter_names(lambda name: name.endswith('_mm'))
s.select(df, selector)
height_mm width_mm
0 297.0 210.0
1 420.0 297.0

8.5 Conclusion

In this chapter we covered the skrub selectors, how they allow to select specific columns either through simple conditions, or by combining different selectors. More use cases of ApplyToCols and the selectors will be shown through the rest of the course, starting from the next chapter on feature engineering.

9 Exercise: using selectors together with ApplyToCols

Path to the exercise: content/exercises/04_selectors.ipynb

Consider this example dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "metric_1": [10.5, 20.3, 30.1, 40.2],
        "metric_2": [5.1, 15.6, None, 35.8],
        "metric_3": [1.1, 3.3, 2.6, .8],
        "num_id": [101, 102, 103, 104],
        "str_id": ["A101", "A102", "A103", "A104"],
        "description": ["apple", None, "cherry", "date"],
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)
df
metric_1 metric_2 metric_3 num_id str_id description name
0 10.5 5.1 1.1 101 A101 apple Alice
1 20.3 15.6 3.3 102 A102 None Bob
2 30.1 NaN 2.6 103 A103 cherry Charlie
3 40.2 35.8 0.8 104 A104 date David

Using the skrub selectors and ApplyToCols:

  • Apply the StandardScaler to numeric columns, except "num_id".
  • Apply a OneHotEncoder with sparse_output=False on all string columns except "str_id".
import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 
import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

scaler = ApplyToCols(StandardScaler(), cols=s.numeric() - "num_id")
one_hot = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string() - "str_id")

transformer = make_pipeline(scaler, one_hot)

transformer.fit_transform(df)
metric_1 metric_2 metric_3 num_id str_id description_apple description_cherry description_date description_None name_Alice name_Bob name_Charlie name_David
0 -1.336178 -1.077965 -0.820768 101 A101 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1 -0.449914 -0.253793 1.303572 102 A102 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
2 0.436349 NaN 0.627646 103 A103 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 1.349743 1.331758 -1.110450 104 A104 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0

Given the same dataframe and using selectors, drop only string columns that contain nulls.

from skrub import DropCols

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
from skrub import DropCols

DropCols(cols=s.has_nulls() & s.string()).fit_transform(df)
metric_1 metric_2 metric_3 num_id str_id name
0 10.5 5.1 1.1 101 A101 Alice
1 20.3 15.6 3.3 102 A102 Bob
2 30.1 NaN 2.6 103 A103 Charlie
3 40.2 35.8 0.8 104 A104 David

Now write a custom function that selects columns where all values are lower than 10.0.

from skrub import SelectCols

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
from skrub import SelectCols

def lower_than(col):
    return all(col < 10.0)

SelectCols(cols=s.numeric() & s.filter(lower_than)).fit_transform(df)
metric_3
0 1.1
1 3.3
2 2.6
3 0.8