.. currentmodule:: skrub .. |ApplyToCols| replace:: :class:`ApplyToCols` .. |TableVectorizer| replace:: :class:`TableVectorizer` .. |s.string| replace:: :meth:`~skrub.selectors.string` .. |s.numeric| replace:: :meth:`~skrub.selectors.numeric` .. |RejectColumn| replace:: :class:`core.RejectColumn` .. |ToDatetime| replace:: :class:`ToDatetime` .. |SingleColumnTransformer| replace:: :class:`~skrub.core.SingleColumnTransformer` .. |StandardScaler| replace:: :class:`~sklearn.preprocessing.StandardScaler` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder` .. |make_pipeline| replace:: :class:`~sklearn.pipeline.make_pipeline` .. _user_guide_multiple_columns: Transforming selected columns with |ApplyToCols| =========================================================== Very often and for various reasons, transformers must be applied only to some of the columns in a dataframe. For example, all numeric columns in a dataframe may need to be scaled at the same time, while string columns should be left alone. While the heuristics used by the :class:`TableVectorizer` are usually good enough to apply the proper transformers to different datatypes, using it may not be an option in all cases. In scikit-learn pipelines, the column selection operation can be done with the :class:`~sklearn.compose.ColumnTransformer`. Skrub provides the |ApplyToCols| transformer to achieve the same results with a larger degree of control over which columns are being transformed. |ApplyToCols| maps a transformer to columns in a dataframe, so that all columns that satisfy a certain condition are transformed, while the others are left untouched. .. tip:: If a skrub transformer has a ``cols`` parameter to specify a column list, that can be a selector as well. Selectors give more control over which columns are being transformed: they are discussed at length in the :ref:`selectors user guide`. |ApplyToCols| can be used to transform a subset of columns in a dataframe, while leaving the non-selected columns unchanged. In this example, we want to apply an |OrdinalEncoder| only on the text column, and a |StandardScaler| on the numeric column. Columns that aren't selected are passed through unchanged, and this allows to concatenate |ApplyToCols| transformers with |make_pipeline|. >>> import pandas as pd >>> df = pd.DataFrame({"text": ["foo", "bar", "baz"], "number": [1, 2, 3]}) We use the |s.string| selector to choose only the text column, and |s.numeric| to select only the numeric column: >>> import skrub.selectors as s >>> from skrub import ApplyToCols >>> from sklearn.preprocessing import OrdinalEncoder, StandardScaler >>> >>> numeric = ApplyToCols(StandardScaler(), cols=s.numeric()) >>> string = ApplyToCols(OrdinalEncoder(), cols=s.string()) We then concatenate the two with |make_pipeline|: >>> from sklearn.pipeline import make_pipeline >>> transformed = make_pipeline(numeric, string).fit_transform(df) >>> transformed number text 0 -1.224745 2.0 1 0.000000 0.0 2 1.224745 1.0 If |ApplyToCols| is used with a transformer that inherits from |SingleColumnTransformer|, or one that has the ``__single_column_transformer__`` attribute, then the transformer will be cloned and applied separately to each column. Most skrub transformers belong to this category. Here we want to apply |ToDatetime| to each of the datetime columns to convert them to datetime dtype. |ApplyToCols| automatically detects that |ToDatetime| should be applied to each column separately: >>> from skrub._to_datetime import ToDatetime >>> df = pd.DataFrame({ ... 'date_1': ['2024-01-15', '2024-02-20', '2024-03-10'], ... 'date_2': ['2023-12-01', '2024-01-05', '2024-02-28'] ... }) >>> df_enc = ApplyToCols(ToDatetime()).fit_transform(df) >>> df_enc date_1 date_2 0 2024-01-15 2023-12-01 1 2024-02-20 2024-01-05 2 2024-03-10 2024-02-28 >>> df_enc.dtypes date_1 datetime64[...] date_2 datetime64[...] dtype: ... We can also combine |ApplyToCols| with |TableVectorizer| to only vectorize columns specific columns and avoid others, like ID columns: >>> from skrub import TableVectorizer >>> df = pd.DataFrame({ ... 'id': ["c1", "c2", "c3"], ... 'city': ['Paris', 'Rome', 'Madrid'], ... 'date': ['2023-01-15', '2023-02-20', '2023-03-10'] ... }) >>> ApplyToCols(TableVectorizer(), cols=s.all() - "id").fit_transform(df) # doctest: +SKIP id city_Madrid city_Paris city_Rome date_year date_month date_day date_total_seconds 0 c1 0.0 1.0 0.0 2023.0 1.0 15.0 1.673741e+09 1 c2 0.0 0.0 1.0 2023.0 2.0 20.0 1.676851e+09 2 c3 1.0 0.0 0.0 2023.0 3.0 10.0 1.678406e+09 Note that the column "id" was not encoded and was instead left as-is. Dealing with columns that cannot be handled by a transformer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |ApplyToCols| can allow the underlying encoder to decide which columns it can be applied to. For example, if we do not know in advance which columns can be transformed to datetime, we can use |ApplyToCols| to map |ToDatetime| to all columns in a dataframe and pass ``allow_reject=True``. In that case, non-datetime columns. By default, all columns in ``cols`` must be transformed, and if one of them cannot be transformed an exception will be raised and the transformation will fail. It is possible to change how rejected columns are handled through the ``allow_reject`` parameter. By default, no special handling is performed and rejections are considered to be errors: >>> from skrub._to_datetime import ToDatetime >>> df = pd.DataFrame(dict(birthday=["29/01/2024"], city=["London"])) >>> df birthday city 0 29/01/2024 London >>> to_datetime = ApplyToCols(ToDatetime()) >>> to_datetime.fit_transform(df) Traceback (most recent call last): ... ValueError: Transformer ToDatetime.fit_transform failed on column 'city'. See above for the full traceback. However, setting ``allow_reject=True`` gives the transformer itself some control over which columns it should be applied to. For example, we can try to parse all columns but allow the transformer to reject those that, upon inspection, do not contain dates. >>> to_datetime = ApplyToCols(ToDatetime(), allow_reject=True) >>> transformed = to_datetime.fit_transform(df) >>> transformed birthday city 0 2024-01-29 London >>> transformed.dtypes birthday datetime64[...] city ... dtype: ... Advanced usage of |ApplyToCols| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For more advanced use cases, refer to the examples section of the |ApplyToCols| docstring, and to :ref:`this user guide section `.