.. currentmodule:: skrub.selectors .. |ApplyToCols| replace:: :class:`~skrub.ApplyToCols` .. |StandardScaler| replace:: :class:`~sklearn.preprocessing.StandardScaler` .. |filter| replace:: :func:`filter` .. |filter_names| replace:: :func:`filter_names` .. _user_guide_advanced_selectors: |filter| and |filter_names| to select with user-defined criteria ----------------------------------------------------------------- :func:`filter` and :func:`filter_names` allow selecting columns based on arbitrary user-defined criteria. These are also used to implement many of the other selectors provided in this module. :func:`filter` accepts a function which will be called on a column (i.e., a Pandas or polars Series). This function, called a predicate, must return ``True`` if the column should be selected. >>> import pandas as pd >>> import skrub.selectors as s >>> df = pd.DataFrame( ... { ... "height_mm": [297.0, 420.0], ... "width_mm": [210.0, 297.0], ... "kind": ["A4", "A3"], ... "ID": [4, 3], ... } ... ) >>> s.select(df, s.filter(lambda col: "A4" in col.tolist())) kind 0 A4 1 A3 :func:`filter_names` accepts a predicate that is passed the column name, instead of the column. >>> s.select(df, s.filter_names(lambda name: name.endswith('mm'))) height_mm width_mm 0 297.0 210.0 1 420.0 297.0 We can pass args and kwargs that will be forwarded to the predicate, to help avoid lambda or local functions and thus ensure the selector is picklable. >>> s.select(df, s.filter_names(str.endswith, 'mm')) height_mm width_mm 0 297.0 210.0 1 420.0 297.0 Example of custom criteria in :func:`filter`: selecting columns with outliers ............................................................................. The :func:`filter` selector can be used to select columns based on custom criteria. For example, we can define a function that checks if a column contains outliers using the Interquartile Range (IQR) method, and then use this function with :func:`filter` to select such columns. Specifically, we define a function that computes the IQR (Inter Quartile Range) of a column and checks if any data points extend further than 2 IQRs of the lower and upper quartile. >>> def has_outliers(column): ... q1 = column.quantile(0.25) ... q3 = column.quantile(0.75) ... IQR = q3 - q1 ... lower_bound = q1 - 2 * IQR ... upper_bound = q3 + 2 * IQR ... outliers = (column < lower_bound) | (column > upper_bound) ... return any(outliers) >>> from skrub import SelectCols >>> select = SelectCols(s.filter(has_outliers)) >>> data = pd.DataFrame({ ... "A": [10, 12, 14, 15, 100], # Outlier in column A ... "B": [20, 22, 21, 19, 20], # No outliers in column B ... "C": [30, 29, 31, 32, 300] # Outlier in column C ... }) >>> select.fit_transform(data) A C 0 10 30 1 12 29 2 14 31 3 15 32 4 100 300