Choose your columns with selectors

Introduction to Selectors

Complex column selection beyond simple name lists:

import skrub.selectors as s

Selecting by Data Type

Available selectors:

  • .numeric(): Numeric columns (int or float)
  • .integer(): Integer columns only
  • .float(): Floating-point columns
  • .string(): String columns
  • .categorical(): Categorical columns
  • .any_date(): Date or datetime columns
  • .boolean(): Boolean columns

Example: Select String Columns

from skrub import SelectCols
import pandas as pd

df = pd.DataFrame({
    "age": [25, 30, 35],
    "name": ["Alice", None, "Charlie"],
    "city": ["NYC", "LA", "Chicago"]
})

SelectCols(cols=s.string()).fit_transform(df)
name city
0 Alice NYC
1 None LA
2 Charlie Chicago

Example: select/drop columns

df = pd.DataFrame({
    "age": [25, 30, 35],
    "name": ["Alice", None, "Charlie"],
    "city": ["NYC", "LA", "Chicago"]
})

selector = s.string()
s.select(df, selector)
name city
0 Alice NYC
1 None LA
2 Charlie Chicago

Selecting by Characteristics

  • .all(): All columns
  • .has_nulls(): Columns with at least one null
  • .cardinality_below(threshold): Few unique values
df = pd.DataFrame({
    "age": [25, 30, 35],
    "name": ["Alice", None, "Charlie"],
    "city": ["NYC", "LA", "Chicago"]
})
SelectCols(cols=s.has_nulls()).fit_transform(df)
name
0 Alice
1 None
2 Charlie

Selecting by Name

  • .cols("name"): Specific column name(s)
  • .glob("pattern*"): Unix shell-style globbing
  • .regex("pattern"): Regular expressions
df = pd.DataFrame({
    "patient_id": [101, 102, 103],
    "metric_1": [0.3, 0.3, 0.6],
    "metric_2": [3, 5, 10],
})

SelectCols(cols=s.glob("metric_*")).fit_transform(df)
metric_1 metric_2
0 0.3 3
1 0.3 5
2 0.6 10

Combining Selectors

Use logical operators:

# Inverse: NOT numeric
~s.numeric()

# OR: datetime columns OR string columns
s.any_date() | s.string()

# AND: string columns without nulls
s.string() & ~s.has_nulls()

# XOR: either string, or name starts with "date_" but not both
s.string() ^ s.glob("date_*")

# Include: string columns and col "date"
s.string() | "date"

# Exclude: all columns except "datetime-col"
s.all() - "datetime-col"

Extracting Selected Columns

Get the list of selected columns:

df = pd.DataFrame(
    {
        "age": [25, 30, 35],
        "name": ["Alice", None, "Charlie"],
        "city": ["NYC", "LA", "Chicago"],
    }
)

selector = s.has_nulls()
columns_with_nulls = selector.expand(df)

# Use in dataframe operations: capitalize columns with nulls
df[columns_with_nulls].apply(lambda x : x.str.upper())
name
0 ALICE
1 None
2 CHARLIE

Custom Selectors: Filter by Condition

def more_nulls_than(col, threshold=0.5):
    return col.isnull().sum() / len(col) > threshold

df = pd.DataFrame(
    {
        "no-nulls": [1, 2, 3, 4],
        "lotsa-nulls": [None, None, None, 4],
    }
)

selector = s.filter(more_nulls_than, threshold=0.5)
s.select(df, selector)
lotsa-nulls
0 NaN
1 NaN
2 NaN
3 4.0

Custom Selectors: Filter by Name

df = pd.DataFrame(
    {
        "patient_id": [101, 102, 103],
        "metric_1": [0.3, 0.3, 0.6],
        "metric_2": [3, 5, 10],
    }
)

selector = s.filter_names(lambda name: name.startswith("metric_"))
s.select(df, selector)
metric_1 metric_2
0 0.3 3
1 0.3 5
2 0.6 10

What we have seen in this chapters

  • Use selectors for flexible, reusable column selection
  • Combine selectors with logical operators
  • expand() extracts selected column names
  • Create custom selectors for domain-specific logic
  • Perfect with ApplyToCols, SelectCols, DropCols