.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/10_apply_on_cols.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_10_apply_on_cols.py: Hands-On with Column Selection and Transformers =============================================== In previous examples, we saw how skrub provides powerful abstractions like :class:`~skrub.TableVectorizer` and :func:`~skrub.tabular_learner` to create pipelines. In this new example, we show how to create more flexible pipelines by selecting and transforming dataframe columns using arbitrary logic. .. GENERATED FROM PYTHON SOURCE LINES 13-15 We begin with loading a dataset with heterogeneous datatypes, and replacing Pandas's display with the TableReport display via :func:`skrub.set_config`. .. GENERATED FROM PYTHON SOURCE LINES 15-23 .. code-block:: Python import skrub from skrub.datasets import fetch_employee_salaries skrub.set_config(use_tablereport=True) data = fetch_employee_salaries() X, y = data.X, data.y X .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 24-36 Our goal is now to apply a :class:`~skrub.StringEncoder` to two columns of our choosing: ``division`` and ``employee_position_title``. We can achieve this using :class:`~skrub.ApplyToCols`, whose job is to apply a transformer to multiple columns independently, and let unmatched columns through without changes. This can be seen as a handy drop-in replacement of the :class:`~sklearn.compose.ColumnTransformer`. Since we selected two columns and set the number of components to ``30`` each, :class:`~skrub.ApplyToCols` will create ``2*30`` embedding columns in the dataframe ``Xt``, which we prefix with ``lsa_``. .. GENERATED FROM PYTHON SOURCE LINES 36-46 .. code-block:: Python from skrub import ApplyToCols, StringEncoder apply_string_encoder = ApplyToCols( StringEncoder(n_components=30), cols=["division", "employee_position_title"], rename_columns="lsa_{}", ) Xt = apply_string_encoder.fit_transform(X) Xt .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 47-59 In addition to the :class:`~skrub.ApplyToCols` class, the :class:`~skrub.ApplyToFrame` class is useful for transformers that work on multiple columns at once, such as the :class:`~sklearn.decomposition.PCA` which reduces the number of components. To select columns without hardcoding their names, we introduce :ref:`selectors`, which allow for flexible matching pattern and composable logic. The regex selector below will match all columns prefixed with ``"lsa"``, and pass them to :class:`~skrub.ApplyToFrame` which will assemble these columns into a dataframe and finally pass it to the PCA. .. GENERATED FROM PYTHON SOURCE LINES 59-68 .. code-block:: Python from sklearn.decomposition import PCA from skrub import ApplyToFrame from skrub import selectors as s apply_pca = ApplyToFrame(PCA(n_components=8), cols=s.regex("lsa")) Xt = apply_pca.fit_transform(Xt) Xt .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 69-71 These two selectors are scikit-learn transformers and can be chained together within a :class:`~sklearn.pipeline.Pipeline`. .. GENERATED FROM PYTHON SOURCE LINES 71-78 .. code-block:: Python from sklearn.pipeline import make_pipeline model = make_pipeline( apply_string_encoder, apply_pca, ).fit_transform(X) .. GENERATED FROM PYTHON SOURCE LINES 79-81 Note that selectors also come in handy in a pipeline to select or drop columns, using :class:`~skrub.SelectCols` and :class:`~skrub.DropCols`! .. GENERATED FROM PYTHON SOURCE LINES 81-92 .. code-block:: Python from sklearn.preprocessing import StandardScaler from skrub import SelectCols # Select only numerical columns pipeline = make_pipeline( SelectCols(cols=s.numeric()), StandardScaler(), ).set_output(transform="pandas") pipeline.fit_transform(Xt) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 93-100 Let's run through one more example to showcase the expressiveness of the selectors. Suppose we want to apply an :class:`~sklearn.preprocessing.OrdinalEncoder` on categorical columns with low cardinality (e.g., fewer than ``40`` unique values). We define a column filter using skrub selectors with a lambda function. Note that the same effect can be obtained directly by using :func:`~srkub.selectors.cardinality_below`. .. GENERATED FROM PYTHON SOURCE LINES 100-105 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder low_cardinality = s.filter(lambda col: col.nunique() < 40) ApplyToCols(OrdinalEncoder(), cols=s.string() & low_cardinality).fit_transform(X) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 106-113 Notice how we composed the selector with :func:`~skrub.selectors.string()` using a logical operator. This resulting selector matches string columns with cardinality below ``40``. We can also define the opposite selector ``high_cardinality`` using the negation operator ``~`` and apply a :class:`skrub.StringEncoder` to vectorize those columns. .. GENERATED FROM PYTHON SOURCE LINES 113-129 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor high_cardinality = ~low_cardinality pipeline = make_pipeline( ApplyToCols( OrdinalEncoder(), cols=s.string() & low_cardinality, ), ApplyToCols( StringEncoder(), cols=s.string() & high_cardinality, ), HistGradientBoostingRegressor(), ).fit(X, y) pipeline .. raw:: html
Pipeline(steps=[('applytocols-1',
                     ApplyToCols(cols=(string() & filter(<lambda>)),
                                 transformer=OrdinalEncoder())),
                    ('applytocols-2',
                     ApplyToCols(cols=(string() & (~filter(<lambda>))),
                                 transformer=StringEncoder())),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 130-135 Interestingly, the pipeline above is similar to the datatype dispatching performed by :class:`~skrub.TableVectorizer`, also used in :func:`~skrub.tabular_learner`. Click on the dropdown arrows next to the datatype to see the columns are mapped to the different transformers in :class:`~skrub.TableVectorizer`. .. GENERATED FROM PYTHON SOURCE LINES 135-138 .. code-block:: Python from skrub import tabular_learner tabular_learner("regressor").fit(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/skrub/_tabular_pipeline.py:75: FutureWarning: tabular_learner will be deprecated in the next release. Equivalent functionality is available in skrub.set_config. .. raw:: html
Pipeline(steps=[('tablevectorizer',
                     TableVectorizer(low_cardinality=ToCategorical())),
                    ('histgradientboostingregressor',
                     HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 8.264 seconds) .. _sphx_glr_download_auto_examples_10_apply_on_cols.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/10_apply_on_cols.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/10_apply_on_cols.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 10_apply_on_cols.ipynb <10_apply_on_cols.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 10_apply_on_cols.py <10_apply_on_cols.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 10_apply_on_cols.zip <10_apply_on_cols.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_