.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/11_squashing_scaler.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_11_squashing_scaler.py: SquashingScaler: Robust numerical preprocessing for neural networks =================================================================== The following example illustrates the use of the :class:`~skrub.SquashingScaler`, a transformer that can rescale and squash numerical features to a range that works well with neural networks and perhaps also other related models. Its basic idea is to rescale the features based on quantile statistics (to be robust to outliers), and then perform a smooth squashing function to limit the outputs to a pre-defined range. This transform has been found to even work well when applied to one-hot encoded features. In this example, we want to fit a neural network to predict employee salaries. The dataset contains numerical features, categorical features, text features, and dates. These features are first converted to numerical features using :class:`~skrub.TableVectorizer`. Since the encoded features are not normalized, we apply a numerical transformation to them. Finally, we fit a simple neural network and compare the R2 scores obtained with different numerical transformations. While we use a simple :class:`~sklearn.neural_network.MLPRegressor` here for simplicity, we generally recommend using better neural network implementations or tree-based models whenever low test errors are desired. .. GENERATED FROM PYTHON SOURCE LINES 28-36 Comparing numerical preprocessings ---------------------------------- We test the :class:`~skrub.SquashingScaler` against the :class:`~sklearn.preprocessing.StandardScaler` and the :class:`~sklearn.preprocessing.QuantileTransformer` from scikit-learn. We put each of these together in a pipeline with a TableVectorizer and a simple MLPRegressor. In the end, we print the R2 scores of each fold's validation set in a three-fold cross-validation. .. GENERATED FROM PYTHON SOURCE LINES 36-76 .. code-block:: Python import warnings import numpy as np from sklearn.compose import TransformedTargetRegressor from sklearn.exceptions import ConvergenceWarning from sklearn.model_selection import cross_validate from sklearn.neural_network import MLPRegressor from sklearn.pipeline import make_pipeline from sklearn.preprocessing import QuantileTransformer, StandardScaler from skrub import DatetimeEncoder, SquashingScaler, TableVectorizer from skrub.datasets import fetch_employee_salaries np.random.seed(0) data = fetch_employee_salaries() for num_transformer in [ StandardScaler(), QuantileTransformer(output_distribution="normal", random_state=0), SquashingScaler(), ]: pipeline = make_pipeline( TableVectorizer(datetime=DatetimeEncoder(periodic_encoding="circular")), num_transformer, TransformedTargetRegressor( # We use lbfgs for faster convergence MLPRegressor(solver="lbfgs", max_iter=100), transformer=StandardScaler(), ), ) with warnings.catch_warnings(): # Ignore warnings about the MLPRegressor not converging warnings.simplefilter("ignore", category=ConvergenceWarning) scores = cross_validate(pipeline, data.X, data.y, cv=3, scoring="r2") print( f"Cross-validation R2 scores for {num_transformer.__class__.__name__}" f" (higher is better):\n{scores['test_score']}\n" ) .. rst-class:: sphx-glr-script-out .. code-block:: none Cross-validation R2 scores for StandardScaler (higher is better): [0.84757689 0.83732283 0.85409122] Cross-validation R2 scores for QuantileTransformer (higher is better): [0.76847544 0.77914599 0.77413903] Cross-validation R2 scores for SquashingScaler (higher is better): [0.88737156 0.89555145 0.90645875] .. GENERATED FROM PYTHON SOURCE LINES 77-80 On the employee salaries dataset, the SquashingScaler performs better than StandardScaler and QuantileTransformer on all cross-validation folds. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 23.332 seconds) .. _sphx_glr_download_auto_examples_11_squashing_scaler.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/11_squashing_scaler.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/11_squashing_scaler.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 11_squashing_scaler.ipynb <11_squashing_scaler.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 11_squashing_scaler.py <11_squashing_scaler.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 11_squashing_scaler.zip <11_squashing_scaler.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_