Choose your column: selectors

Introduction

Very often, column selection is more complex than simply passing a list of column names to a transformer: it may be necessary to select all columns that have a specific data type, or based on some other characteristic (presence of nulls, column cardinality, etc.).

The skrub selectors implement a number of selection strategies that can be combined in various ways to build complex filtering conditions that can then be employed by ApplyToCols, ApplyToFrame, SelectCols and DropCols.

Skrub selectors

Selectors are available from the skrub.selectors namespace:

import skrub.selectors as s

We will use this example dataframe to test some of the selectors:

import pandas as pd
import datetime

data = {
    "int": [15, 56, 63, 12, 44],
    "float": [5.2, 2.4, 6.2, 10.45, 9.0],
    "str1": ["public", "private", None, "private", "public"],
    "str2": ["officer", "manager", "lawyer", "chef", "teacher"],
    "bool": [True, False, True, False, True],
    "cat1": pd.Categorical(["yes", "yes", None, "yes", "no"]),
    "cat2": pd.Categorical(["20K+", "40K+", "60K+", "30K+", "50K+"]),
    "datetime-col": [
        datetime.datetime.fromisoformat(dt)
        for dt in [
            "2020-02-03T12:30:05",
            "2021-03-15T00:37:15",
            "2022-02-13T17:03:25",
            "2023-05-22T08:45:55",
        ]
    ]
    + [None],    }
df = pd.DataFrame(data)
df

	int	float	str1	str2	bool	cat1	cat2	datetime-col
0	15	5.20	public	officer	True	yes	20K+	2020-02-03 12:30:05
1	56	2.40	private	manager	False	yes	40K+	2021-03-15 00:37:15
2	63	6.20	None	lawyer	True	NaN	60K+	2022-02-13 17:03:25
3	12	10.45	private	chef	False	yes	30K+	2023-05-22 08:45:55
4	44	9.00	public	teacher	True	no	50K+	NaT

Selectors should be used in conjunction with the transformers described in the previous chapter: ApplyToCols, ApplyToFrame, SelectCols and DropCols.

Selectors allow to filter columns by data type:

.float: floating-point columns
.integer: integer columns
.any_date: date or datetime columns
.boolean: boolean columns
.string: columns with a String data type
.categorical: columns with a Categorical data type
.numeric: numeric (either integer or float) columns

from skrub import SelectCols
string_selector = s.string()

SelectCols(cols=string_selector).fit_transform(df)

	str1	str2
0	public	officer
1	private	manager
2	None	lawyer
3	private	chef
4	public	teacher

Additional conditions include:

.all: select all columns
.cardinality_below: select all columns with a number of unique values lower than the given threshold
.has_nulls: select all columns that include at least one null value

SelectCols(cols=s.has_nulls()).fit_transform(df)

	str1	cat1	datetime-col
0	public	yes	2020-02-03 12:30:05
1	private	yes	2021-03-15 00:37:15
2	None	NaN	2022-02-13 17:03:25
3	private	yes	2023-05-22 08:45:55
4	public	no	NaT

Various selectors allow to choose columns based on their name:

.cols: choose the provided column name (or list of names)
- note that transformers that can accept selectors can also take column names or lists of columns by default
.glob: use Unix shell style glob to select column names
.regex: select columns using regular expressions

SelectCols(cols=s.glob("cat*")).fit_transform(df)

	cat1	cat2
0	yes	20K+
1	yes	40K+
2	NaN	60K+
3	yes	30K+
4	no	50K+

Combining selectors

Selectors can be inverted using .inv or the logical operator ~ to select all other columns, and they can be combined using the & and | logical operators. It is also possible to remove from a selection with -; for example to select all columns except for “datetime-col”, one would write:

SelectCols(cols=s.all() - "datetime-col").fit_transform(df)

	int	float	str1	str2	bool	cat1	cat2
0	15	5.20	public	officer	True	yes	20K+
1	56	2.40	private	manager	False	yes	40K+
2	63	6.20	None	lawyer	True	NaN	60K+
3	12	10.45	private	chef	False	yes	30K+
4	44	9.00	public	teacher	True	no	50K+

To select all datetime columns OR all string columns that do not contain nulls, we can do:

SelectCols(cols=(s.any_date() | (s.string()) & (~s.has_nulls()))).fit_transform(df)

	str2	datetime-col
0	officer	2020-02-03 12:30:05
1	manager	2021-03-15 00:37:15
2	lawyer	2022-02-13 17:03:25
3	chef	2023-05-22 08:45:55
4	teacher	NaT

Extracting selected columns

Selectors can use the expand and expand_index methods to extract the columns that have been selected:

has_nulls = s.has_nulls()
has_nulls.expand(df)

['str1', 'cat1', 'datetime-col']

This can be used, for example, to pass a list of columns to a dataframe library.

df.drop(columns=has_nulls.expand(df))

	int	float	str2	bool	cat2
0	15	5.20	officer	True	20K+
1	56	2.40	manager	False	40K+
2	63	6.20	lawyer	True	60K+
3	12	10.45	chef	False	30K+
4	44	9.00	teacher	True	50K+

Designing custom filters

Finally, it is possible to define function-based selectors using .filter and .filter_names.

.filter selects columns for which the predicate evaluated by a user-defined function on the given column is True. It is also possible to pass arguments to the function to further tweak the conditions. For example, it is possible to select columns that include a certain amount of nulls by defining a function like the following:

import pandas as pd
import skrub.selectors as s
from skrub import DropCols

df = pd.DataFrame({"a": [None, None, None, 1], "b": [1,2,3,4]})

def more_nulls_than(col, threshold=.5):
    return col.isnull().sum()/len(col) > threshold

DropCols(cols=s.filter(more_nulls_than, threshold=0.5)).fit_transform(df)

	b
0	1
1	2
2	3
3	4

.filter_names is similar to .filter in the sense that it takes a function that returns a predicate, but in this case the function is evaluated over the column names.

If we define this example dataframe:

from skrub import selectors as s
import pandas as pd
df = pd.DataFrame(
    {
        "height_mm": [297.0, 420.0],
        "width_mm": [210.0, 297.0],
        "kind": ["A4", "A3"],
        "ID": [4, 3],
    }
)
df

	height_mm	width_mm	kind	ID
0	297.0	210.0	A4	4
1	420.0	297.0	A3	3

We can select all the columns that end with "_mm" as follows:

selector = s.filter_names(lambda name: name.endswith('_mm'))
s.select(df, selector)

	height_mm	width_mm
0	297.0	210.0
1	420.0	297.0

Conclusion

In this chapter we covered the skrub selectors, how they allow to select specific columns either through simple conditions, or by combining different selectors. More use cases of ApplyToCols and the selectors will be shown through the rest of the course, starting from the next chapter on feature engineering.

Exercise: using selectors together with `ApplyToCols`

Path to the exercise: content/exercises/04_selectors.ipynb

Consider this example dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "metric_1": [10.5, 20.3, 30.1, 40.2],
        "metric_2": [5.1, 15.6, None, 35.8],
        "metric_3": [1.1, 3.3, 2.6, .8],
        "num_id": [101, 102, 103, 104],
        "str_id": ["A101", "A102", "A103", "A104"],
        "description": ["apple", None, "cherry", "date"],
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)
df

	metric_1	metric_2	metric_3	num_id	str_id	description	name
0	10.5	5.1	1.1	101	A101	apple	Alice
1	20.3	15.6	3.3	102	A102	None	Bob
2	30.1	NaN	2.6	103	A103	cherry	Charlie
3	40.2	35.8	0.8	104	A104	date	David

Using the skrub selectors and ApplyToCols:

Apply the StandardScaler to numeric columns, except "num_id".
Apply a OneHotEncoder with sparse_output=False on all string columns except "str_id".

import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
#

import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

scaler = ApplyToCols(StandardScaler(), cols=s.numeric() - "num_id")
one_hot = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string() - "str_id")

transformer = make_pipeline(scaler, one_hot)

transformer.fit_transform(df)

	metric_1	metric_2	metric_3	num_id	str_id	description_apple	description_cherry	description_date	description_None	name_Alice	name_Bob	name_Charlie	name_David
0	-1.336178	-1.077965	-0.820768	101	A101	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
1	-0.449914	-0.253793	1.303572	102	A102	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
2	0.436349	NaN	0.627646	103	A103	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
3	1.349743	1.331758	-1.110450	104	A104	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0

Given the same dataframe and using selectors, drop only string columns that contain nulls.

from skrub import DropCols

# Write your solution here
# 
# 
# 
# 
# 
# 
#

from skrub import DropCols

DropCols(cols=s.has_nulls() & s.string()).fit_transform(df)

	metric_1	metric_2	metric_3	num_id	str_id	name
0	10.5	5.1	1.1	101	A101	Alice
1	20.3	15.6	3.3	102	A102	Bob
2	30.1	NaN	2.6	103	A103	Charlie
3	40.2	35.8	0.8	104	A104	David

Now write a custom function that selects columns where all values are lower than 10.0.

from skrub import SelectCols

# Write your solution here
# 
# 
# 
# 
# 
# 
#

from skrub import SelectCols

def lower_than(col):
    return all(col < 10.0)

SelectCols(cols=s.numeric() & s.filter(lower_than)).fit_transform(df)

	metric_3
0	1.1
1	3.3
2	2.6
3	0.8